LeNet

{{short description|Convolutional neural network structure}}

File:LeNet-5_architecture.svg

LeNet is a series of convolutional neural network architectures created by a research group in AT&T Bell Laboratories during the 1988 to 1998 period, centered around Yann LeCun. They were designed for reading small grayscale images of handwritten digits and letters, and were used in ATM for reading cheques.

Convolutional neural networks are a kind of feed-forward neural network whose artificial neurons can respond to a part of the surrounding cells in the coverage range and perform well in large-scale image processing. LeNet-5 was one of the earliest convolutional neural networks and was historically important during the development of deep learning.{{Cite book |last1=Zhang |first1=Aston |title=Dive into deep learning |last2=Lipton |first2=Zachary |last3=Li |first3=Mu |last4=Smola |first4=Alexander J. |date=2024 |publisher=Cambridge University Press |isbn=978-1-009-38943-3 |location=Cambridge New York Port Melbourne New Delhi Singapore |chapter=7.6. Convolutional Neural Networks (LeNet) |chapter-url=https://d2l.ai/chapter_convolutional-neural-networks/lenet.html}}

In general, when "LeNet" is referred to without a number, it refers to the 1998 version, the most well-known version. It is also sometimes called "LeNet-5" or "LeNet5".

Development history

File:MNIST_dataset_example.png, published in 1994. Before 1994, the LeNet series was mainly trained and tested on images similar to this. After 1994, the LeNet series was mainly trained and tested on this dataset.]]

In 1988, LeCun joined the Adaptive Systems Research Department at AT&T Bell Laboratories in Holmdel, New Jersey, United States, headed by Lawrence D. Jackel.File:Yann_LeCun_-_2018_(cropped).jpgIn 1988, LeCun et al. published a neural network design that recognize handwritten zip code. However, its convolutional kernels were hand-designed.{{Cite journal |last1=Denker |first1=John |last2=Gardner |first2=W. |last3=Graf |first3=Hans |last4=Henderson |first4=Donnie |last5=Howard |first5=R. |last6=Hubbard |first6=W. |last7=Jackel |first7=L. D. |last8=Baird |first8=Henry |last9=Guyon |first9=Isabelle |date=1988 |title=Neural Network Recognizer for Hand-Written Zip Code Digits |url=https://proceedings.neurips.cc/paper/1988/hash/a97da629b098b75c294dffdc3e463904-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Morgan-Kaufmann |volume=1}}

In 1989, Yann LeCun et al. at Bell Labs first applied the backpropagation algorithm to practical applications, and believed that the ability to learn network generalization could be greatly enhanced by providing constraints from the task's domain. He combined a convolutional neural network trained by backpropagation algorithms to read handwritten numbers and successfully applied it in identifying handwritten zip code numbers provided by the US Postal Service. This was the prototype of what later came to be called LeNet-1. In the same year, LeCun described a small handwritten digit recognition problem in another paper, and showed that even though the problem is linearly separable, single-layer networks exhibited poor generalization capabilities. When using shift-invariant feature detectors on a multi-layered, constrained network, the model could perform very well. He believed that these results proved that minimizing the number of free parameters in the neural network could enhance the generalization ability of the neural network.

In 1990, their paper described the application of backpropagation networks in handwritten digit recognition again. They only performed minimal preprocessing on the data, and the model was carefully designed for this task and it was highly constrained. The input data consisted of images, each containing a number, and the test results on the postal code digital data provided by the US Postal Service showed that the model had an error rate of only 1% and a rejection rate of about 9%.{{cite conference | last1=LeCun | last2=Boser | first2=B. E. | last3=Denker | first3=J. S. | last4=Henderson | first4=D. | last5=Howard | first5=R. | last6=Hubbard | first6=W. E. | last7=Jackel | first7=L. D. | title=Handwritten digit recognition with a back-propagation network | book-title=Advances in Neural Information Processing Systems 2 (NIPS 1989) | editor-last=Touretsky | editor-first=David S. | publisher=Morgan-Kaufmann | date=1989 | pages=396–404 | url=http://yann.lecun.com/exdb/publis/pdf/lecun-90c.pdf}}

Their research continued for the next four years, and in 1994 MNIST database was developed, for which LeNet-1 was too small, hence a new LeNet-4 was trained on it.{{Cite book |last1=Bottou |first1=L. |last2=Cortes |first2=C. |last3=Denker |first3=J.S. |last4=Drucker |first4=H. |last5=Guyon |first5=I. |last6=Jackel |first6=L.D. |last7=LeCun |first7=Y. |last8=Muller |first8=U.A. |last9=Sackinger |first9=E. |last10=Simard |first10=P. |last11=Vapnik |first11=V. |chapter=Comparison of classifier methods: A case study in handwritten digit recognition |date=1994 |title=Proceedings of the 12th IAPR International Conference on Pattern Recognition (Cat. No.94CH3440-5) |chapter-url=https://ieeexplore.ieee.org/document/576879 |publisher=IEEE Comput. Soc. Press |volume=2 |pages=77–82 |doi=10.1109/ICPR.1994.576879 |isbn=978-0-8186-6270-6}}

A year later the AT&T Bell Labs collective introduced LeNet-5 and reviewed various methods on handwritten character recognition in paper, using standard handwritten digits to identify benchmark tasks. These models were compared and the results showed that the latest network outperformed other models.{{Cite journal |last1=LeCun |first1=Yann |last2=Jackel |first2=L. |last3=Bottou |first3=L. |last4=Cortes |first4=Corinna |last5=Denker |first5=J. |last6=Drucker |first6=H. |last7=Guyon |first7=Isabelle M. |last8=Muller |first8=Urs |last9=Sackinger |first9=E. |last10=Simard |first10=Patrice Y. |last11=Vapnik |first11=V. |date=1995 |title=Learning algorithms for classification: A comparison on handwritten digit recognition |s2cid=13411815 }}

By 1998 Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner were able to provide examples of practical applications of neural networks, such as two systems for recognizing handwritten characters online and models that could read millions of checks per day.{{cite journal |last1=Lecun |first1=Y. |last2=Bottou |first2=L. |last3=Bengio |first3=Y. |last4=Haffner |first4=P. |date=1998 |title=Gradient-based learning applied to document recognition |url=https://hal.science/hal-03926082/document |journal=Proceedings of the IEEE |volume=86 |issue=11 |pages=2278–2324 |doi=10.1109/5.726791 |s2cid=14542261}}

The research achieved great success and aroused the interest of scholars in the study of neural networks. While the architecture of the best performing neural networks today are not the same as that of LeNet, the network was the starting point for a large number of neural network architectures, and also brought inspiration to the field.

class="wikitable"

|+Timeline

|1989

|Yann LeCun et al. proposed the original form of LeNet (LeNet-1)

1989

|Yann LeCun demonstrates that minimizing the number of free parameters in neural networks can enhance the generalization ability of neural networks.

1990

|Application of backpropagation to LeNet-1 in handwritten digit recognition.

1994

|MNIST database and LeNet-4 developed

1995

|LeNet-5 developed, various methods applied to handwritten character recognition reviewed and compared with standard handwritten digit recognition benchmarks. The results show that convolutional neural networks outperform all other models.

1998

|Practical applications

Architecture

{{comparison_image_neural_networks.svg}}

LeNet has several common motifs of modern convolutional neural networks, such as convolutional layer, pooling layer and full connection layer.{{Cite journal|last1=LeCun|first1=Y.|last2=Boser|first2=B.|last3=Denker|first3=J. S.|last4=Henderson|first4=D.|last5=Howard|first5=R. E.|last6=Hubbard|first6=W.|last7=Jackel|first7=L. D.|date=December 1989|title=Backpropagation Applied to Handwritten Zip Code Recognition|journal=Neural Computation|volume=1|issue=4|pages=541–551|doi=10.1162/neco.1989.1.4.541|s2cid=41312633|issn=0899-7667}}

  • Every convolutional layer includes three parts: convolution, pooling, and nonlinear activation functions
  • Using convolution to extract spatial features (Convolution was called receptive fields originally)
  • Subsampling average pooling layer
  • tanh activation function
  • fully connected layers in the final layers for classification
  • Sparse connection between layers to reduce the complexity of computation

In 1989, LeCun et al. published a report, which contained "Net-1" to "Net-5". There were many subsequent refinements, up to 1998, and the naming is inconsistent. Generally, people only speak of "LeNet-5", and not to the earlier forms. When they do, they refer to the 1998 LeNet.

LeNet-1, 4, 5 had been referred to in, but it is unclear what LeNet-2, LeNet-3 might refer to.

= 1988 Net =

The first neural network published by the LeCun research group was in 1988. It was a hybrid approach. The first stage scaled, deskewed, and skeletonized the input image. The second stage was a convolutional layer with 18 hand-designed kernels. The third stage was a fully connected network with one hidden layer.

The dataset was a collection of handwritten digit images extracted from actual U.S. Mail, which was the same dataset used in the famed 1989 report.

= Net-1 to Net-5 =

Net-1 to Net-5 were published in a 1989 report. The last layer of all of them were fully connected. The original paper does not explain the padding strategy. All cells have an independent bias, including the output cells of convolutional layers.

  • Net-1: No hidden layer. Fully connected. (16 \times 16) \to 10.
  • Net-2: One hidden fully connected layer with 12 hidden units. (16 \times 16) \to 12 \to 10.
  • Net-3: Two hidden convolutional layers. (16 \times 16) \to (8 \times 8) \to (4 \times 4) \to 10. Both are locally connected layers with input shape 3\times 3 and stride 2.
  • Net-4: Two hidden layers, the first is a convolution, the second is locally connected. (16 \times 16) \to (8 \times 8 \times 2) \to (4 \times 4) \to 10. The convolution layer has 2 kernels of shape 3\times 3 and stride 2. The locally connected layer has input shape 5 \times 5 \times 2 and stride 1.
  • Net-5: Two convolutional hidden layers. (16 \times 16) \to (8 \times 8 \times 2) \to (4 \times 4 \times 4) \to 10. The first convolution layer has 2 kernels of shape 3\times 3 and stride 2. The second convolutional layer has 4 kernels of shape 5 \times 5 \times 2 and stride 1.

The dataset contained 480 binary images, each sized 16×16 pixels. Originally, 12 examples of each digit were hand-drawn on a 16×13 bitmap using a mouse, resulting in 120 images. Then, each image was shifted horizontally in 4 consecutive positions to generate a 16×16 version, yielding the 480 images.

From these, 320 images (32 per digit) were randomly selected for training and the remaining 160 images (16 per digit) were used for testing. Performance on training set is 100% for all networks, but they differ in test set performance.

class="wikitable"

|+Performance of Net-1 to Net-5{{Cite book |last1=Hastie |first1=Trevor |title=The elements of statistical learning: data mining, inference, and prediction |last2=Tibshirani |first2=Robert |last3=Friedman |first3=Jerome H. |date=2017 |publisher=Springer |isbn=978-0-387-84857-0 |edition=Second |series=Springer Series in Statistics |location=New York, NY |chapter=11.7 Example: ZIP Code Data}}

!Name

!Connections

!Independent parameters

!% correct

Net-1

|2570

|2570

|80.0

Net-2

|3214

|3214

|87.0

Net-3

|1226

|1226

|88.5

Net-4

|2266

|1132

|94.0

Net-5

|5194

|1060

|98.4

= 1989 LeNet =

The LeNet published in 1989 has 3 hidden layers (H1-H3) and an output layer. It has 1256 units, 64660 connections, and 9760 independent parameters.

  • H1 (Convolutional): 16 \times 16 \to 12 \times 8 \times 8 with 5\times 5 kernels.
  • H2 (Convolutional): 12 \times 8 \times 8 \to 12 \times 4 \times 4 with 8 \times 5\times 5 kernels.
  • H3: 30 units fully connected to H2.
  • Output: 10 units fully connected to H3, representing the 10 digit classes (0-9).

The dataset (US Postal Service database) was 9298 grayscale images of resolution 16×16, digitized from handwritten zip codes that appeared on U.S. mail passing through the Buffalo, New York post office. The training set had 7291 data points, and test set had 2007. Both training and test set contained ambiguous, unclassifiable, and misclassified data. The task is rather difficult. On the test set, two humans made errors at an average rate of 2.5%.{{Cite journal |last1=Simard |first1=Patrice |last2=LeCun |first2=Yann |last3=Denker |first3=John |date=1992 |title=Efficient Pattern Recognition Using a New Transformation Distance |url=https://proceedings.neurips.cc/paper_files/paper/1992/hash/26408ffa703a72e8ac0117e74ad46f33-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Morgan-Kaufmann |volume=5}}

Training took 3 days on a Sun workstation.

Compared to the previous 1988 architecture, there was no skeletonization, and the convolutional kernels were learned automatically by backpropagation.

= 1990 LeNet =

A later version of 1989 LeNet has four hidden layers (H1-H4) and an output layer. It takes a 28x28 pixel image as input, though the active region is 16x16 to avoid boundary effects.{{Cite book |last1=Le Cun |first1=Y. |last2=Matan |first2=O. |last3=Boser |first3=B. |last4=Denker |first4=J.S. |last5=Henderson |first5=D. |last6=Howard |first6=R.E. |last7=Hubbard |first7=W. |last8=Jacket |first8=L.D. |last9=Baird |first9=H.S. |chapter=Handwritten zip code recognition with multilayer networks |date=1990 |title=[1990] Proceedings. 10th International Conference on Pattern Recognition |chapter-url=https://ieeexplore.ieee.org/document/119325 |publisher=IEEE Comput. Soc. Press |volume=ii |pages=35–40 |doi=10.1109/ICPR.1990.119325 |isbn=978-0-8186-2062-1}}

  • H1 (Convolutional): 28 \times 28 \to 4 \times 24 \times 24 with 5\times 5 kernels. This layer has 104 trainable parameters (100 from kernels, 4 from biases).
  • H2 (Pooling): 4 \times 24 \times 24 \to 4 \times 12 \times 12 by 2\times 2 average pooling.
  • H3 (Convolutional): 4 \times 12 \times 12 \to 12 \times 8 \times 8 with 5\times 5 kernels. Some kernels take input from 1 feature map, while others take inputs from 2 feature maps.
  • H4 (Pooling): 12 \times 8 \times 8 \to 12 \times 4 \times 4 by 2\times 2 average pooling.
  • Output: 10 units fully connected to H4, representing the 10 digit classes (0-9).

The network 4635 units, 98442 connections, and 2578 trainable parameters. It was started by a previous CNN{{Cite book |last1=Le Cun |first1=Y. |last2=Jackel |first2=L. D. |last3=Boser |first3=B. |last4=Denker |first4=J. S. |last5=Graf |first5=H. P. |last6=Guyon |first6=I. |last7=Henderson |first7=D. |last8=Howard |first8=R. E. |last9=Hubbard |first9=W. |chapter=Handwritten Digit Recognition: Applications of Neural Net Chips and Automatic Learning |date=1990 |editor-last=Soulié |editor-first=Françoise Fogelman |editor2-last=Hérault |editor2-first=Jeanny |title=Neurocomputing |chapter-url=https://link.springer.com/chapter/10.1007/978-3-642-76153-9_35 |language=en |location=Berlin, Heidelberg |publisher=Springer |pages=303–318 |doi=10.1007/978-3-642-76153-9_35 |isbn=978-3-642-76153-9}} with 4 times as many trainable parameters, then optimized by Optimal Brain Damage.{{Cite journal |last1=LeCun |first1=Yann |last2=Denker |first2=John |last3=Solla |first3=Sara |date=1989 |title=Optimal Brain Damage |url=https://proceedings.neurips.cc/paper/1989/hash/6c9882bbac1c7093bd25041881277658-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Morgan-Kaufmann |volume=2}} One forward pass requires about 140,000 multiply-add operations. Its size is 50 kB in memory. It was also called LeNet-1. On a SPARCstation 10, it took 0.5 weeks to train, and 0.015 seconds to classify one image.

= 1994 LeNet =

1994 LeNet was a larger version of 1989 LeNet designed to fit the larger MNIST database. It was also called LeNet-4. It had more feature maps in its convolutional layers, and had an additional layer of hidden units, fully connected to both the last convolutional layer and to the output units. It has 2 convolutions, 2 average poolings, and 2 fully connected layers. It has about 17000 trainable parameters.

One forward pass requires about 260,000 multiply-adds. Its size is 60 kB in memory. On a SPARCstation 10, it took 2 weeks to train, and 0.03 seconds to classify one image.

= 1998 LeNet =

File:LeNet-5_architecture_block_diagram.svgFile:LeNet_architecture.png1998 LeNet is similar to 1994 LeNet, but with more fully connected layers. Its architecture is shown in the image on the right. It has 2 convolutions, 2 subsamplings, and 3 fully connected layers. It is usually called LeNet-5. It has around 60000 trainable parameters.

Specifically, it has the following layers:

  • Input: (Implicitly 1 \times 32 \times 32 pixel image)
  • C1 (Convolutional): 1 \times 32 \times 32 \to 6 \times 28 \times 28 with 5\times 5 kernels, with 156 trainable parameters.
  • S2 (Subsampling): 6 \times 28 \times 28 \to 6 \times 14 \times 14 with 2\times 2 pooling, with 12 trainable parameters.
  • C3 (Convolutional): 6 \times 14 \times 14 \to 16 \times 10 \times 10 with k \times 5\times 5 kernels for various values of k (see Table below for exact connections in this layer), with 1516 trainable parameters.
  • S4 (Subsampling): 16 \times 10 \times 10 \to 16 \times 5 \times 5 with 2\times 2 pooling, with 32 trainable parameters.
  • C5 (Convolutional): 16 \times 5 \times 5 \to 120 \times 1 \times 1 with 16 \times 5\times 5 kernels, with 48120 trainable parameters.
  • F6: 84 units fully connected to C5, with trainable parameters, with 10164 trainable parameters.
  • Output: 10 units (RBF) fully connected to F6, representing the 10 digit classes.

Each of the "subsampling" layers is not precisely an average pooling, but with trainable parameters. Specifically, consider a single cell in S2. It takes as input 4 cells in C1. Let these have values x_1, \dots, x_4, then the cell in S2 has value \sigma \left(w \left(\sum_{i=1}^4 x_i \right) + b \right) , where w, b \in \R are trainable parameters, and \sigma is a sigmoidal activation function, sometimes called "LeCun's tanh". It is a scaled version of the hyperbolic tangent activation function: 1.7159 \tanh(2x/3). It was designed so that it maps the interval [-1, +1] to itself, thus ensuring that the overall gain is around 1 in "normal operating conditions", and that |f''(x)| is at maximum when x = -1, +1, which improves convergence at the end of training.{{cite book |last1=LeCun |first1=Y. |title=Connectionism in Perspective: Proceedings of the International Conference Connectionism in Perspective, University of Zurich, 10–13 October 1988 |date=1989 |publisher=Elsevier |editor-last1=Pfeifer |editor-first1=R. |location=Amsterdam |chapter=Generalization and network design strategies |editor-last2=Schreter |editor-first2=Z. |editor-last3=Fogelman |editor-first3=F. |editor-last4=Steels |editor-first4=L. |chapter-url=https://masters.donntu.ru/2012/fknt/umiarov/library/lecun.pdf}}{{Citation |last1=LeCun |first1=Yann |title=Efficient BackProp |date=1998 |work=Neural Networks: Tricks of the Trade |pages=9–50 |editor-last=Orr |editor-first=Genevieve B. |url=https://link.springer.com/chapter/10.1007/3-540-49430-8_2 |access-date=2024-10-05 |place=Berlin, Heidelberg |publisher=Springer |language=en |doi=10.1007/3-540-49430-8_2 |isbn=978-3-540-49430-0 |last2=Bottou |first2=Leon |last3=Orr |first3=Genevieve B. |last4=Müller |first4=Klaus -Robert |editor2-last=Müller |editor2-first=Klaus-Robert}}

In the following table, each column indicates which of the 6 feature maps in S2 are combined by the units in each of the 15 feature maps of C3.

class="wikitable"
! 0

! 1

! 2

! 3

! 4

! 5

! 6

! 7

! 8

! 9

! 10

! 11

! 12

! 13

! 14

!15

0

| X

|

|

|

| X

| X

| X

|

|

| X

| X

| X

| X

|

| X

|X

1

| X

| X

|

|

|

| X

| X

| X

|

|

| X

| X

| X

| X

|

|X

2

| X

| X

| X

|

|

|

| X

| X

| X

|

|

| X

|

| X

| X

|X

3

|

| X

| X

| X

|

|

| X

| X

| X

| X

|

|

| X

|

| X

|X

4

|

|

| X

| X

| X

|

|

| X

| X

| X

| X

|

| X

| X

|

|X

5

|

|

|

| X

| X

| X

|

|

| X

| X

| X

| X

|

| X

| X

|X

Even though C5 has output shape 120 \times 1 \times 1 , it is not a fully connected layer, because the network is designed to be able to take in input shapes of arbitrary height and width, much larger than the 1 \times 32 \times 32 that the network is trained on. In those cases, the output shape of C5 would be larger. Similarly, the output of F6 would also be larger than 84 \times 1 \times 1. Indeed, in modern language, the layer F6 is better described as a 1 \times 1 convolution.{{cite arXiv | eprint=1412.6806 | author1=Jost Tobias Springenberg | last2=Dosovitskiy | first2=Alexey | last3=Brox | first3=Thomas | last4=Riedmiller | first4=Martin | title=Striving for Simplicity: The All Convolutional Net | date=2014 | class=cs.LG }}

The output of the convolutional part of the network has 84 neurons, and this is not a coincidence. It was designed such, because 84 = 7×12, meaning that the output of the network can be viewed as a small 7×12 grayscale image.

The output layer has RBF units, similar to RBF networks. Each of the 10 units has 84 parameters, which might either be hand-designed and fixed, or trained. When hand-designed, it was designed so that, when viewed as a 7×12 grayscale image, it looks like the digit to be recognized.

1998 LeNet was trained with stochastic Levenberg–Marquardt algorithm with diagonal approximation of the Hessian. It was trained for about 20 epoches over MNIST. It took 2 to 3 days of CPU time on a Silicon Graphics Origin 2000 server, using a single 200 MHz R10000 processor.

= LeNet7 =

A certain "LeNet7" was mentioned in 2005. It was benchmarked on the NYU Object Recognition Benchmark (NORB) as being superior to SVM. It had 90,857 trainable parameters and 4.66 million connections. One forward pass requires 3,896,920 multiply-adds. It was trained by the same method as 1998 LeNet, for about 250,000 parameter updates.{{Cite web |last=LeCun |first=Yann |title=NORB: Generic Object Recognition in Images |url=https://cs.nyu.edu/~yann/research/norb/ |access-date=2025-04-26 |website=cs.nyu.edu}}{{Cite book |last1=LeCun |first1=Y. |last2=Fu Jie Huang |last3=Bottou |first3=L. |chapter=Learning methods for generic object recognition with invariance to pose and lighting |date=2004 |title=Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004 |chapter-url=https://ieeexplore.ieee.org/document/1315150 |publisher=IEEE |volume=2 |pages=97–104 |doi=10.1109/CVPR.2004.1315150 |isbn=978-0-7695-2158-9}}

Application

Recognizing simple digit images is the most classic application of LeNet as it was created because of that. After the development of 1989 LeNet, as a demonstration for real-time application,{{Cite AV media |url=https://www.youtube.com/watch?v=FwFduRA_L6Q |title=Convolutional Network Demo from 1989 |date=2014-06-02 |last=Yann LeCun |access-date=2024-10-31 |via=YouTube}} they loaded the neural network into a AT&T DSP-32C digital signal processor{{Cite journal |last1=Fuccio |first1=M.L. |last2=Gadenz |first2=R.N. |last3=Garen |first3=C.J. |last4=Huser |first4=J.M. |last5=Ng |first5=B. |last6=Pekarich |first6=S.P. |last7=Ulery |first7=K.D. |date=December 1988 |title=The DSP32C: AT&Ts second generation floating point digital signal processor |url=https://ieeexplore.ieee.org/document/16779 |journal=IEEE Micro |volume=8 |issue=6 |pages=30–48 |doi=10.1109/40.16779 |issn=0272-1732|url-access=subscription }} with a peak performance of 12.5 million multiply-add operations per second. It could normalize-and-classify 10 digits a second, or classify 30 normalized digits a second. Shortly afterwards, the research group started working with a development group and a product group at NCR (acquired by AT&T in 1991). It resulted in ATM machines that could read the numerical amounts on checks using a LeNet loaded on DSP-32C.

Later, NCR deployed a similar system in large cheque reading machines in bank back offices since June 1996, and as of 2001, it was estimated to read 20 million checks a day, or 10% of all the checks in the US.[https://leon.bottou.org/talks/gtn Graph Transformer Networks], presentation by Leon Bottoun at ICML Workshop 2001. It was a "graph transformer", with a main component being the LeNet as reported in 1998 with ~60000 trainable parameters. According to a draft report, the system was called HCAR50 (Holmdel Courtesy Amount Reader).{{NoteTag|The "courtesy amount" is the value of the check written in numerals, either handwritten or machine-printed. The "legal amount" is the value of the check written long-form in words.}} There were two previous versions, HCAR30 and HCAR40.{{cite book |last1=Bottou |first1=Léon |url=http://leon.bottou.org/papers/bottou-1996 |title=Document Analysis with Transducers |last2=Bengio |first2=Yoshua |last3=LeCun |first3=Yann |date=July 1996 |type=Technical Report}}{{cite book |last1=Haffner |first1=P. |title=The HCAR50 check amount reading system |last2=Bottou |first2=L. |last3=Bromley |first3=J. |last4=Burges |first4=C.J.C. |last5=Cauble |first5=T. |last6=Le Cun |first6=Y. |last7=Nohl |first7=C. |last8=Stanton |first8=C. |last9=Stenard |first9=C. |date=1996 |publisher=Lucent Technologies, Bell Labs Innovation |type=Technical Report |last10=Vincent |first10=P.}}

Subsequent work

The LeNet-5 means the emergence of CNN and defines the basic components of CNN. But it was not popular at that time because of the lack of hardware, especially since GPUs and other algorithms, such as SVM, could achieve similar effects or even exceed LeNet.

Since the success of AlexNet in 2012, CNN has become the best choice for computer vision applications and many different types of CNN have been created, such as the R-CNN series. Nowadays, CNN models are quite different from LeNet, but they are all developed on the basis of LeNet.

A three-layer tree architecture imitating LeNet-5 and consisting of only one convolutional layer, has achieved a similar success rate on the CIFAR-10 dataset.{{Cite journal |last1=Meir |first1=Yuval |last2=Ben-Noam |first2=Itamar |last3=Tzach |first3=Yarden |last4=Hodassman |first4=Shiri |last5=Kanter |first5=Ido |date=2023-01-30 |title=Learning on tree architectures outperforms a convolutional feedforward network |journal=Scientific Reports |language=en |volume=13 |issue=1 |pages=962 |doi=10.1038/s41598-023-27986-6 |issn=2045-2322 |pmc=9886946 |pmid=36717568|bibcode=2023NatSR..13..962M }}

Increasing the number of filters for the LeNet architecture results in a power law decay of the error rate. These results indicate that a shallow network can achieve the same performance as deep learning architectures.{{Cite journal |last1=Meir |first1=Yuval |last2=Tevet |first2=Ofek |last3=Tzach |first3=Yarden |last4=Hodassman |first4=Shiri |last5=Gross |first5=Ronit D. |last6=Kanter |first6=Ido |date=2023-04-20 |title=Efficient shallow learning as an alternative to deep learning |journal=Scientific Reports |language=en |volume=13 |issue=1 |pages=5423 |doi=10.1038/s41598-023-32559-8 |issn=2045-2322 |pmc=10119101 |pmid=37080998|arxiv=2211.11106 |bibcode=2023NatSR..13.5423M }}

References

{{reflist}}{{reflist|group=note}}