Table of Contents
在 机器学习漫谈(1) 中我们提到了二分类问题的泛化界, 但是对于一个具体的方法,分析它的假设空间误差界是一件挺复杂的事情,特别是对于像神经网络这样通过SGD 方法来优化得到$\hat{f}$的就更加困难.
对于深度学习而言主要的难点有两个
- 巨大的参数体量
- 深度网络的优化和泛化界有很强的关联性
第一个问题直接导致了网络是over-parameterized, 那么它的假设空间也是相当巨大的,对于训练数据基本上可以做到完美的拟合. 这对于传统的泛化界分析方法(VC-dimension)是难以处理的.
第二个问题就是通过深度网络得到的$\hat{f}$ 往往很依赖算法(SGD),而初始化,以及参数的动态更新机制都会影响最终得到的$\hat{f}$.
深度学习泛化界的分析方法
Parameter norm-based
Uniform stability-based
Other generalization theories
另一个值得注意的泛化界限研究领域采用信息论17。用互信息(MI)被用来衡量深度学习模型和算法的信息传输和信息损失。此外,用于限制泛化误差的其他技术和方法包括模型压缩、边缘理论、路径长度估计和优化算法的线性稳定性。
Footnotes:
Chao Ma, Lei Wu, and Weinan E. A priori estimates of the population risk for two-layer neural networks. arXiv preprint arXiv:1810.06397, 2018.
Chao Ma, Qingcan Wang, and Weinan E. A priori estimates of the population risk for residual networks. arXiv preprint arXiv:1903.02154, 2019.
Zhong Li, Chao Ma, and Lei Wu. Complexity measures for neural networks with general activation functions using path-based norms. arXiv preprint arXiv:2009.06132, 2020.
Weinan E, Chao Ma, and Lei Wu. The barron space and the flow-induced function spaces for neural network models. Constructive Approximation, pages 1–38, 2021.
Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299. PMLR, 2018.
Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498, 2017.
Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. Fisher-rao metric, geometry, and complexity of neural networks. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 888–896. PMLR, 2019.
Zhuozhuo Tu, Fengxiang He, and Dacheng Tao. Understanding generalization in recurrent neural networks. In International Conference on Learning Representations, 2019.
William H Rogers and Terry J Wagner. A finite sample distribution-free performance bound for local discrimination rules. The Annals of Statistics, pages 506–514, 1978.
Olivier Bousquet, Yegor Klochkov, and Nikita Zhivotovskiy. Sharper bounds for uniformly stable algorithms. In Conference on Learning Theory, pages 610–626. PMLR, 2020.
Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225– 1234. PMLR, 2016.
Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. arXiv preprint arXiv:2103.00065, 2021.
Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pages 1674–1703. PMLR, 2017.
Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688. Citeseer, 2011.
Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient langevin dynamics. In Conference on Learning Theory, pages 1980–2022. PMLR, 2017.
Wenlong Mou, Liwei Wang, Xiyu Zhai, and Kai Zheng. Generalization bounds of sgld for non-convex learning: Two theoretical viewpoints. In Conference on Learning Theory, pages 605–638. PMLR, 2018.
Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.