Double descent

{{For|the concept of double descent in anthropology|Kinship#Descent rules}}{{Machine learning bar}}

File:Double descent in a two-layer neural network (Figure 3a from Rocks et al. 2022).png: as the ratio of parameters to data points increases, the test error first falls, then rises, then falls again.{{cite journal |last1=Rocks |first1=Jason W. |title=Memorizing without overfitting: Bias, variance, and interpolation in overparameterized models |journal=Physical Review Research |date=2022 |volume=4 |issue=1 |doi=10.1103/PhysRevResearch.4.013201 |url=https://journals.aps.org/prresearch/abstract/10.1103/PhysRevResearch.4.013201|arxiv=2010.13933 }} The vertical line marks the "interpolation threshold" boundary between the underparametrized region (more data points than parameters) and the overparameterized region (more parameters than data points).]]

Double descent in statistics and machine learning is the phenomenon where a model with a small number of parameters and a model with an extremely large number of parameters both have a small training error, but a model whose number of parameters is about the same as the number of data points used to train the model will have a much greater test error than one with a much larger number of parameters.{{Cite web |date=2019-12-05 |title=Deep Double Descent |url=https://openai.com/blog/deep-double-descent/ |access-date=2022-08-12 |website=OpenAI |language=en}} This phenomenon has been considered surprising, as it contradicts assumptions about overfitting in classical machine learning.{{Cite arXiv |eprint=2303.14151v1 |class=cs.LG |first1=Rylan |last1=Schaeffer |first2=Mikail |last2=Khona |title=Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle |date=2023-03-24 |language=en |last3=Robertson |first3=Zachary |last4=Boopathy |first4=Akhilan |last5=Pistunova |first5=Kateryna |last6=Rocks |first6=Jason W. |last7=Fiete |first7=Ila Rani |last8=Koyejo |first8=Oluwasanmi}}

History

Early observations of what would later be called double descent in specific models date back to 1989.{{Cite journal |last1=Vallet |first1=F. |last2=Cailton |first2=J.-G. |last3=Refregier |first3=Ph |date=June 1989 |title=Linear and Nonlinear Extension of the Pseudo-Inverse Solution for Learning Boolean Functions |url=https://dx.doi.org/10.1209/0295-5075/9/4/003 |journal=Europhysics Letters |language=en |volume=9 |issue=4 |pages=315 |doi=10.1209/0295-5075/9/4/003 |bibcode=1989EL......9..315V |issn=0295-5075|url-access=subscription }}{{Cite journal |last1=Loog |first1=Marco |last2=Viering |first2=Tom |last3=Mey |first3=Alexander |last4=Krijthe |first4=Jesse H. |last5=Tax |first5=David M. J. |date=2020-05-19 |title=A brief prehistory of double descent |journal=Proceedings of the National Academy of Sciences |language=en |volume=117 |issue=20 |pages=10625–10626 |doi=10.1073/pnas.2001875117 |doi-access=free |issn=0027-8424 |pmc=7245109 |pmid=32371495|arxiv=2004.04328 |bibcode=2020PNAS..11710625L }}

The term "double descent" was coined by Belkin et. al.{{Cite journal |last1=Belkin |first1=Mikhail |last2=Hsu |first2=Daniel |last3=Ma |first3=Siyuan |last4=Mandal |first4=Soumik |date=2019-08-06 |title=Reconciling modern machine learning practice and the bias-variance trade-off |journal=Proceedings of the National Academy of Sciences |volume=116 |issue=32 |pages=15849–15854 |arxiv=1812.11118 |doi=10.1073/pnas.1903070116 |issn=0027-8424 |pmc=6689936 |pmid=31341078 |doi-access=free}} in 2019, when the phenomenon gained popularity as a broader concept exhibited by many models.{{Cite journal |last=Spigler |first=Stefano |last2=Geiger |first2=Mario |last3=d'Ascoli |first3=Stéphane |last4=Sagun |first4=Levent |last5=Biroli |first5=Giulio |last6=Wyart |first6=Matthieu |date=2019-11-22 |title=A jamming transition from under- to over-parametrization affects loss landscape and generalization |url=http://arxiv.org/abs/1810.09665 |journal=Journal of Physics A: Mathematical and Theoretical |volume=52 |issue=47 |pages=474001 |doi=10.1088/1751-8121/ab4c8b |issn=1751-8113|arxiv=1810.09665 }}{{Cite journal |last1=Viering |first1=Tom |last2=Loog |first2=Marco |date=2023-06-01 |title=The Shape of Learning Curves: A Review |url=https://ieeexplore.ieee.org/document/9944190 |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=45 |issue=6 |pages=7799–7819 |doi=10.1109/TPAMI.2022.3220744 |pmid=36350870 |issn=0162-8828|arxiv=2103.10948 }} The latter development was prompted by a perceived contradiction between the conventional wisdom that too many parameters in the model result in a significant overfitting error (an extrapolation of the bias–variance tradeoff),{{cite journal |last1=Geman |first1=Stuart |author-link1=Stuart Geman |last2=Bienenstock |first2=Élie |last3=Doursat |first3=René |year=1992 |title=Neural networks and the bias/variance dilemma |url=http://web.mit.edu/6.435/www/Geman92.pdf |journal=Neural Computation |volume=4 |pages=1–58 |doi=10.1162/neco.1992.4.1.1 |s2cid=14215320}} and the empirical observations in the 2010s that some modern machine learning techniques tend to perform better with larger models.{{cite journal |author1=Preetum Nakkiran |author2=Gal Kaplun |author3=Yamini Bansal |author4=Tristan Yang |author5=Boaz Barak |author6=Ilya Sutskever |date=29 December 2021 |title=Deep double descent: where bigger models and more data hurt |journal=Journal of Statistical Mechanics: Theory and Experiment |publisher=IOP Publishing Ltd and SISSA Medialab srl |volume=2021 |issue=12 |page=124003 |arxiv=1912.02292 |bibcode=2021JSMTE2021l4003N |doi=10.1088/1742-5468/ac3a74 |s2cid=207808916}}

Theoretical models

Double descent occurs in linear regression with isotropic Gaussian covariates and isotropic Gaussian noise.{{Cite arXiv |eprint=1912.07242v1 |class=stat.ML |first=Preetum |last=Nakkiran |title=More Data Can Hurt for Linear Regression: Sample-wise Double Descent |date=2019-12-16 |language=en}}

A model of double descent at the thermodynamic limit has been analyzed using the replica trick, and the result has been confirmed numerically.{{Cite journal |last1=Advani |first1=Madhu S. |last2=Saxe |first2=Andrew M. |last3=Sompolinsky |first3=Haim |date=2020-12-01 |title=High-dimensional dynamics of generalization error in neural networks |url= |journal=Neural Networks |volume=132 |pages=428–446 |doi=10.1016/j.neunet.2020.08.022 |issn=0893-6080|doi-access=free |pmid=33022471 |pmc=7685244 }}

Empirical examples

The scaling behavior of double descent has been found to follow a broken neural scaling lawCaballero, Ethan; Gupta, Kshitij; Rish, Irina; Krueger, David (2022). "Broken Neural Scaling Laws". International Conference on Learning Representations (ICLR), 2023. functional form.

References

External links

{{cite web|url=https://mlu-explain.github.io/double-descent/|title=Double Descent: Part 1: A Visual Introduction|author1=Brent Werness|author2=Jared Wilber}}
{{cite web|url=https://mlu-explain.github.io/double-descent2/|title=Double Descent: Part 2: A Mathematical Explanation|author1=Brent Werness|author2=Jared Wilber}}
[https://www.lesswrong.com/posts/FRv7ryoqtvSuqBxuT/understanding-deep-double-descent Understanding "Deep Double Descent"] at evhub.

Category:Model selection

Category:Machine learning

Category:Statistical classification