Pooling layer#Average pooling

{{Short description|Architectural motif in neural networks for aggregating information.}}

In neural networks, a pooling layer is a kind of network layer that downsamples and aggregates information that is dispersed among many vectors into fewer vectors.{{Cite book |last1=Zhang |first1=Aston |title=Dive into deep learning |last2=Lipton |first2=Zachary |last3=Li |first3=Mu |last4=Smola |first4=Alexander J. |date=2024 |publisher=Cambridge University Press |isbn=978-1-009-38943-3 |location=Cambridge New York Port Melbourne New Delhi Singapore |chapter=7.5. Pooling |chapter-url=https://d2l.ai/chapter_convolutional-neural-networks/pooling.html}} It has several uses. It removes redundant information, reducing the amount of computation and memory required, makes the model more robust to small variations in the input, and increases the receptive field of neurons in later layers in the network.

Convolutional neural network pooling

Pooling is most commonly used in convolutional neural networks (CNN). Below is a description of pooling in 2-dimensional CNNs. The generalization to n-dimensions is immediate.

As notation, we consider a tensor x \in \R^{H \times W \times C}, where H is height, W is width, and C is the number of channels. A pooling layer outputs a tensor y \in \R^{H' \times W' \times C'}.

We define two variables f, s called "filter size" (aka "kernel size") and "stride". Sometimes, it is necessary to use a different filter size and stride for horizontal and vertical directions. In such cases, we define 4 variables f_H, f_W, s_H, s_W.

The receptive field of an entry in the output tensor y are all the entries in x that can affect that entry.

= Max pooling =

File:Convolutional_neural_network,_maxpooling.png

Max Pooling (MaxPool) is commonly used in CNNs to reduce the spatial dimensions of feature maps.

Define\mathrm{MaxPool}(x | f, s)_{0, 0, 0} = \max(x_{0:f-1, 0:f-1, 0})where 0:f-1 means the range 0, 1, \dots, f-1. Note that we need to avoid the off-by-one error. The next input is\mathrm{MaxPool}(x | f, s)_{1, 0, 0} = \max(x_{s:s + f-1, 0:f-1, 0})and so on. The receptive field of y_{i, j, c} is x_{is + f-1, js + f-1, c}, so in general,\mathrm{MaxPool}(x | f, s)_{i,j,c} = \mathrm{max}(x_{is: is+f-1, js: js + f-1, c})If the horizontal and vertical filter size and strides differ, then in general,\mathrm{MaxPool}(x | f, s)_{i,j,c} = \mathrm{max}(x_{is_H: is_H+f_H-1, js_W: js_W + f_W-1, c})More succinctly, we can write y_k = \max(\{x_{k'} | k' \text{ in the receptive field of }k\}). File:Convolutional_neural_network,_boundary_conditions.png

If H is not expressible as ks + f where k is an integer, then for computing the entries of the output tensor on the boundaries, max pooling would attempt to take as inputs variables off the tensor. In this case, how those non-existent variables are handled depends on the padding conditions, illustrated on the right.

Global Max Pooling (GMP) is a specific kind of max pooling where the output tensor has shape \R^{C} and the receptive field of y_c is all of x_{0:H, 0:W, c}. That is, it takes the maximum over each entire channel. It is often used just before the final fully connected layers in a CNN classification head.

= Average pooling =

Average pooling (AvgPool) is similarly defined\mathrm{AvgPool}(x | f, s)_{i,j,c} = \mathrm{average}(x_{is: is+f-1, js: js + f-1, c}) =

\frac{1}{f^2} \sum_{k \in is: is+f-1}\sum_{l \in js: js + f-1} x_{k, l, c}Global Average Pooling (GAP) is defined similarly to GMP. It was first proposed in Network-in-Network.{{Cite arXiv |eprint=1312.4400 |class=cs.NE |first1=Min |last1=Lin |first2=Qiang |last2=Chen |title=Network In Network |date=2013 |last3=Yan |first3=Shuicheng}} Similarly to GMP, it is often used just before the final fully connected layers in a CNN classification head.

= Interpolations =

There are some interpolations of max pooling and average pooling.

Mixed Pooling is a linear sum of maxpooling and average pooling.{{Cite book |last1=Yu |first1=Dingjun |title=Rough Sets and Knowledge Technology |last2=Wang |first2=Hanli |last3=Chen |first3=Peiqiu |last4=Wei |first4=Zhihua |date=2014 |publisher=Springer International Publishing |isbn=978-3-319-11740-9 |editor-last=Miao |editor-first=Duoqian |series=Lecture Notes in Computer Science |volume=8818 |location=Cham |pages=364–375 |chapter=Mixed Pooling for Convolutional Neural Networks |doi=10.1007/978-3-319-11740-9_34 |editor2-last=Pedrycz |editor2-first=Witold |editor3-last=Ślȩzak |editor3-first=Dominik |editor4-last=Peters |editor4-first=Georg |editor5-last=Hu |editor5-first=Qinghua |editor6-last=Wang |editor6-first=Ruizhi |chapter-url=https://link.springer.com/chapter/10.1007/978-3-319-11740-9_34}} That is,

\mathrm{MixedPool}(x | f, s, w) = w \mathrm{MaxPool}(x | f, s) + (1-w)\mathrm{AvgPool}(x | f, s)where w \in [0, 1] is either a hyperparameter, a learnable parameter, or randomly sampled anew every time.

Lp Pooling is like average pooling, but uses Lp norm average instead of average:y_k = \left(\frac 1N \sum_{k' \text{ in the receptive field of } k} |x_{k'}|^p\right)^{1/p}where N is the size of receptive field, and p \geq 1 is a hyperparameter. If all activations are non-negative, then average pooling is the case of p = 1, and maxpooling is the case of p \to \infty. Square-root pooling is the case of p = 2.{{Cite journal |last1=Boureau |first1=Y-Lan |last2=Ponce |first2=Jean |last3=LeCun |first3=Yann |date=2010-06-21 |title=A theoretical analysis of feature pooling in visual recognition |url=https://dl.acm.org/doi/abs/10.5555/3104322.3104338 |journal=Proceedings of the 27th International Conference on International Conference on Machine Learning |series=ICML'10 |location=Madison, WI, USA |publisher=Omnipress |pages=111–118 |isbn=978-1-60558-907-7}}

Stochastic pooling samples a random activation x_{k'} from the receptive field with probability \frac{x_{k'}}{\sum_{k} x_{k}}. It is the same as average pooling in expectation.{{cite arXiv |last1=Zeiler |first1=Matthew D. |title=Stochastic Pooling for Regularization of Deep Convolutional Neural Networks |date=2013-01-15 |eprint=1301.3557 |last2=Fergus |first2=Rob|class=cs.LG }}

Softmax pooling is like maxpooling, but uses softmax, i.e. \frac{\sum_{k'} e^{\beta x_{k'}}x_{k'}}{\sum_{k} e^{\beta x_{k}}} where \beta > 0. Average pooling is the case of \beta \downarrow 0, and maxpooling is the case of \beta \uparrow \infty

Local Importance-based Pooling generalizes softmax pooling by \frac{\sum_{k'} e^{ g(x_{k'})}x_{k'}}{\sum_{k} e^{ g(x_{k})}} where g is a learnable function.{{Cite journal |last1=Gao |first1=Ziteng |last2=Wang |first2=Limin |last3=Wu |first3=Gangshan |date=2019 |title=LIP: Local Importance-Based Pooling |url=https://openaccess.thecvf.com/content_ICCV_2019/html/Gao_LIP_Local_Importance-Based_Pooling_ICCV_2019_paper.html |pages=3355–3364|arxiv=1908.04156 }}File:RoI_pooling_animated.gif

= Other poolings =

Spatial pyramidal pooling applies max pooling (or any other form of pooling) in a pyramid structure. That is, it applies global max pooling, then applies max pooling to the image divided into 4 equal parts, then 16, etc. The results are then concatenated. It is a hierarchical form of global pooling, and similar to global pooling, it is often used just before a classification head.{{Cite journal |last1=He |first1=Kaiming |last2=Zhang |first2=Xiangyu |last3=Ren |first3=Shaoqing |last4=Sun |first4=Jian |date=2015-09-01 |title=Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition |url=https://ieeexplore.ieee.org/document/7005506 |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=37 |issue=9 |pages=1904–1916 |arxiv=1406.4729 |doi=10.1109/TPAMI.2015.2389824 |issn=0162-8828 |pmid=26353135}}

Region of Interest Pooling (also known as RoI pooling) is a variant of max pooling used in R-CNNs for object detection.{{Cite book |last1=Zhang |first1=Aston |title=Dive into deep learning |last2=Lipton |first2=Zachary |last3=Li |first3=Mu |last4=Smola |first4=Alexander J. |date=2024 |publisher=Cambridge University Press |isbn=978-1-009-38943-3 |location=Cambridge New York Port Melbourne New Delhi Singapore |chapter=14.8. Region-based CNNs (R-CNNs) |chapter-url=https://d2l.ai/chapter_computer-vision/rcnn.html}} It is designed to take an arbitrarily-sized input matrix, and output a fixed-sized output matrix.

Covariance pooling computes the covariance matrix of the vectors \{x_{k, l, 0:C-1}\}_{k \in is:is + f-1, l \in js:js + f-1}which is then flattened to a C^2-dimensional vector y_{i,j, 0:C^2-1}. Global covariance pooling is used similarly to global max pooling. As average pooling computes the average, which is a first-degree statistic, and covariance is a second-degree statistic, covariance pooling is also called "second-order pooling". It can be generalized to higher-order poolings.{{cite book |last1=Tuzel |first1=Oncel |chapter=Region Covariance: A Fast Descriptor for Detection and Classification |date=2006 |title=Computer Vision – ECCV 2006 |volume=3952 |pages=589–600 |editor-last=Leonardis |editor-first=Aleš |chapter-url=http://link.springer.com/10.1007/11744047_45 |access-date=2024-09-09 |place=Berlin, Heidelberg |publisher=Springer Berlin Heidelberg |doi=10.1007/11744047_45 |isbn=978-3-540-33834-5 |last2=Porikli |first2=Fatih |last3=Meer |first3=Peter |editor2-last=Bischof |editor2-first=Horst |editor3-last=Pinz |editor3-first=Axel}}{{Cite journal |last1=Wang |first1=Qilong |last2=Xie |first2=Jiangtao |last3=Zuo |first3=Wangmeng |last4=Zhang |first4=Lei |last5=Li |first5=Peihua |date=2020 |title=Deep CNNs Meet Global Covariance Pooling: Better Representation and Generalization |url=https://ieeexplore.ieee.org/document/9001240 |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=43 |issue=8 |pages=2582–2597 |doi=10.1109/TPAMI.2020.2974833 |pmid=32086198 |issn=0162-8828|arxiv=1904.06836 }}

Blur Pooling means applying a blurring method before downsampling. For example, the Rect-2 blur pooling means taking an average pooling at f = 2, s = 1, then taking every second pixel (identity with s = 2).{{Cite journal |last=Zhang |first=Richard |date=2018-09-27 |title=Making Convolutional Networks Shift-Invariant Again |arxiv=1904.11486 |url=https://openreview.net/forum?id=SklVEnR5K7 }}

Vision Transformer pooling

In Vision Transformers (ViT), there are the following common kinds of poolings.

BERT-like pooling uses a dummy [CLS] token ("classification"). For classification, the output at [CLS] is the classification token, which is then processed by a LayerNorm-feedforward-softmax module into a probability distribution, which is the network's prediction of class probability distribution. This is the one used by the original ViT{{cite arXiv |eprint=2010.11929 |class=cs.CV |first1=Alexey |last1=Dosovitskiy |first2=Lucas |last2=Beyer |title=An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale |date=2021-06-03 |last3=Kolesnikov |first3=Alexander |last4=Weissenborn |first4=Dirk |last5=Zhai |first5=Xiaohua |last6=Unterthiner |first6=Thomas |last7=Dehghani |first7=Mostafa |last8=Minderer |first8=Matthias |last9=Heigold |first9=Georg |last10=Gelly |first10=Sylvain |last11=Uszkoreit |first11=Jakob}} and Masked Autoencoder.{{Cite book |last1=He |first1=Kaiming |last2=Chen |first2=Xinlei |last3=Xie |first3=Saining |last4=Li |first4=Yanghao |last5=Dollar |first5=Piotr |last6=Girshick |first6=Ross |chapter=Masked Autoencoders Are Scalable Vision Learners |date=June 2022 |title=2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) |chapter-url=http://dx.doi.org/10.1109/cvpr52688.2022.01553 |pages=15979–15988 |publisher=IEEE |doi=10.1109/cvpr52688.2022.01553|arxiv=2111.06377 |isbn=978-1-6654-6946-3 }}

Global average pooling (GAP) does not use the dummy token, but simply takes the average of all output tokens as the classification token. It was mentioned in the original ViT as being equally good.

Multihead attention pooling (MAP) applies a multiheaded attention block to pooling. Specifically, it takes as input a list of vectors x_1, x_2, \dots, x_n, which might be thought of as the output vectors of a layer of a ViT. It then applies a feedforward layer \mathrm{FFN} on each vector, resulting in a matrix V = [\mathrm{FFN}(v_1), \dots, \mathrm{FFN}(v_n)]. This is then sent to a multiheaded attention, resulting in \mathrm{MultiheadedAttention}(Q, V, V), where Q is a matrix of trainable parameters. This was first proposed in the Set Transformer architecture.{{Cite journal |last1=Lee |first1=Juho |last2=Lee |first2=Yoonho |last3=Kim |first3=Jungtaek |last4=Kosiorek |first4=Adam |last5=Choi |first5=Seungjin |last6=Teh |first6=Yee Whye |date=2019-05-24 |title=Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks |url=https://proceedings.mlr.press/v97/lee19d.html |journal=Proceedings of the 36th International Conference on Machine Learning |publisher=PMLR |pages=3744–3753|arxiv=1810.00825 }}

Later papers demonstrated that GAP and MAP both perform better than BERT-like pooling.{{Cite book |last1=Zhai |first1=Xiaohua |last2=Kolesnikov |first2=Alexander |last3=Houlsby |first3=Neil |last4=Beyer |first4=Lucas |chapter=Scaling Vision Transformers |date=June 2022 |title=2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) |chapter-url=http://dx.doi.org/10.1109/cvpr52688.2022.01179 |pages=1204–1213 |publisher=IEEE |doi=10.1109/cvpr52688.2022.01179|arxiv=2106.04560 |isbn=978-1-6654-6946-3 }}{{cite arXiv |last1=Karamcheti |first1=Siddharth |title=Language-Driven Representation Learning for Robotics |date=2023-02-24 |eprint=2302.12766 |last2=Nair |first2=Suraj |last3=Chen |first3=Annie S. |last4=Kollar |first4=Thomas |last5=Finn |first5=Chelsea |last6=Sadigh |first6=Dorsa |last7=Liang |first7=Percy|class=cs.RO }}

Graph neural network pooling

{{Copyedit|date=September 2024|section|for=notations need to be explained and unified with the rest of the article}}

In graph neural networks (GNN), there are also two forms of pooling: global and local. Global pooling can be reduced to a local pooling where the receptive field is the entire output.

  1. Local pooling: a local pooling layer coarsens the graph via downsampling. Local pooling is used to increase the receptive field of a GNN, in a similar fashion to pooling layers in convolutional neural networks. Examples include k-nearest neighbours pooling, top-k pooling,{{cite arXiv |last1=Gao |first1=Hongyang |last2=Ji |first2=Shuiwang Ji |title=Graph U-Nets |date=2019 |class=cs.LG |eprint=1905.05178}} and self-attention pooling.
  2. Global pooling: a global pooling layer, also known as readout layer, provides fixed-size representation of the whole graph. The global pooling layer must be permutation invariant, such that permutations in the ordering of graph nodes and edges do not alter the final output. Examples include element-wise sum, mean or maximum.

Local pooling layers coarsen the graph via downsampling. We present here several learnable local pooling strategies that have been proposed.{{cite arXiv |last1=Liu |first1=Chuang |last2=Zhan |first2=Yibing |last3=Li |first3=Chang |last4=Du |first4=Bo |last5=Wu |first5=Jia |last6=Hu |first6=Wenbin |last7=Liu |first7=Tongliang |last8=Tao |first8=Dacheng |title=Graph Pooling for Graph Neural Networks: Progress, Challenges, and Opportunities |date=2022 |class=cs.LG |eprint=2204.07321}} For each cases, the input is the initial graph is represented by a matrix \mathbf{X} of node features, and the graph adjacency matrix \mathbf{A}. The output is the new matrix \mathbf{X}'of node features, and the new graph adjacency matrix \mathbf{A}'.

= Top-k pooling =

We first set

\mathbf{y} = \frac{\mathbf{X}\mathbf{p}}{\Vert\mathbf{p}\Vert}

where \mathbf{p} is a learnable projection vector. The projection vector \mathbf{p} computes a scalar projection value for each graph node.

The top-k pooling layer can then be formalised as follows:

:\mathbf{X}' = (\mathbf{X} \odot \text{sigmoid}(\mathbf{y}))_{\mathbf{i}}

:\mathbf{A}' = \mathbf{A}_{\mathbf{i}, \mathbf{i}}

where \mathbf{i} = \text{top}_k(\mathbf{y}) is the subset of nodes with the top-k highest projection scores, \odot denotes element-wise matrix multiplication, and \text{sigmoid}(\cdot) is the sigmoid function. In other words, the nodes with the top-k highest projection scores are retained in the new adjacency matrix \mathbf{A}'. The \text{sigmoid}(\cdot) operation makes the projection vector \mathbf{p} trainable by backpropagation, which otherwise would produce discrete outputs.

= Self-attention pooling =

We first set

:\mathbf{y} = \text{GNN}(\mathbf{X}, \mathbf{A})

where \text{GNN} is a generic permutation equivariant GNN layer (e.g., GCN, GAT, MPNN).

The Self-attention pooling layer{{cite arXiv |last1=Lee |first1=Junhyun |last2=Lee |first2=Inyeop |last3=Kang |first3=Jaewoo |title=Self-Attention Graph Pooling |date=2019 |class=cs.LG |eprint=1904.08082}} can then be formalised as follows:

:\mathbf{X}' = (\mathbf{X} \odot \mathbf{y})_{\mathbf{i}}

:\mathbf{A}' = \mathbf{A}_{\mathbf{i}, \mathbf{i}}

where \mathbf{i} = \text{top}_k(\mathbf{y}) is the subset of nodes with the top-k highest projection scores, \odot denotes element-wise matrix multiplication.

The self-attention pooling layer can be seen as an extension of the top-k pooling layer. Differently from top-k pooling, the self-attention scores computed in self-attention pooling account both for the graph features and the graph topology.

History

In early 20th century, neuroanatomists noticed a certain motif where multiple neurons synapse to the same neuron. This was given a functional explanation as "local pooling", which makes vision translation-invariant. (Hartline, 1940){{Cite journal |last=Hartline |first=H. K. |date=1940-09-30 |title=The Receptive Fields of Optic Nerve Fibers |url=https://www.physiology.org/doi/10.1152/ajplegacy.1940.130.4.690 |journal=American Journal of Physiology. Legacy Content |volume=130 |issue=4 |pages=690–699 |doi=10.1152/ajplegacy.1940.130.4.690 |issn=0002-9513}} gave supporting evidence for the theory by electrophysiological experiments on the receptive fields of retinal ganglion cells. The Hubel and Wiesel experiments showed that the vision system in cats is similar to a convolutional neural network, with some cells summing over inputs from the lower layer.{{Cite journal |last1=Hubel |first1=D. H. |last2=Wiesel |first2=T. N. |date=January 1962 |title=Receptive fields, binocular interaction and functional architecture in the cat's visual cortex |journal=The Journal of Physiology |volume=160 |issue=1 |pages=106–154.2 |doi=10.1113/jphysiol.1962.sp006837 |issn=0022-3751 |pmc=1359523 |pmid=14449617}}{{Pg|location=Fig. 19, 20}} See (Westheimer, 1965){{Cite journal |last=Westheimer |first=G |date=December 1965 |title=Spatial interaction in the human retina during scotopic vision. |journal=The Journal of Physiology |volume=181 |issue=4 |pages=881–894 |doi=10.1113/jphysiol.1965.sp007803 |issn=0022-3751 |pmc=1357689 |pmid=5881260}} for citations to these early literature.

During the 1970s, to explain the effects of depth perception, some such as (Julesz and Chang, 1976){{Cite journal |last1=Julesz |first1=Bela |last2=Chang |first2=Jih Jie |date=March 1976 |title=Interaction between pools of binocular disparity detectors tuned to different disparities |url=http://link.springer.com/10.1007/BF00320135 |journal=Biological Cybernetics |volume=22 |issue=2 |pages=107–119 |doi=10.1007/BF00320135 |pmid=1276243 |issn=0340-1200}} proposed that the vision system implements a disparity-selective mechanism by global pooling, where the outputs from matching pairs of retinal regions in the two eyes are pooled in higher order cells. See {{Cite journal |last1=Schumer |first1=Robert |last2=Ganz |first2=Leo |date=1979-01-01 |title=Independent stereoscopic channels for different extents of spatial pooling |url=https://www.sciencedirect.com/science/article/abs/pii/0042698979902025 |journal=Vision Research |volume=19 |issue=12 |pages=1303–1314 |doi=10.1016/0042-6989(79)90202-5 |pmid=532098 |issn=0042-6989}} for citations to these early literature.

In artificial neural networks, max pooling was used in 1990 for speech processing (1-dimensional convolution),{{cite conference |last1=Yamaguchi |first1=Kouichi |last2=Sakamoto |first2=Kenji |last3=Akabane |first3=Toshio |last4=Fujimoto |first4=Yoshiji |date=November 1990 |title=A Neural Network for Speaker-Independent Isolated Word Recognition |url=https://www.isca-speech.org/archive/icslp_1990/i90_1077.html |conference=First International Conference on Spoken Language Processing (ICSLP 90) |location=Kobe, Japan |archive-url=https://web.archive.org/web/20210307233750/https://www.isca-speech.org/archive/icslp_1990/i90_1077.html |archive-date=2021-03-07 |access-date=2019-09-04 |url-status=dead}} and for image processing, was first used in the Cresceptron of 1992.{{Cite journal |last=Weng |first=J. |last2=Ahuja |first2=N. |last3=Huang |first3=T.S. |date=1992 |title=Cresceptron: a self-organizing neural network which grows adaptively |url=https://ieeexplore.ieee.org/document/287150/ |publisher=IEEE |volume=1 |pages=576–581 |doi=10.1109/IJCNN.1992.287150 |isbn=978-0-7803-0559-5}}

See also

References