vision transformer

{{Short description|Machine learning model for vision processing}}

A vision transformer (ViT) is a transformer designed for computer vision.{{cite arXiv |eprint=2010.11929 |class=cs.CV |first1=Alexey |last1=Dosovitskiy |first2=Lucas |last2=Beyer |title=An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale |date=2021-06-03 |last3=Kolesnikov |first3=Alexander |last4=Weissenborn |first4=Dirk |last5=Zhai |first5=Xiaohua |last6=Unterthiner |first6=Thomas |last7=Dehghani |first7=Mostafa |last8=Minderer |first8=Matthias |last9=Heigold |first9=Georg |last10=Gelly |first10=Sylvain |last11=Uszkoreit |first11=Jakob}} A ViT decomposes an input image into a series of patches (rather than text into tokens), serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

ViTs were designed as alternatives to convolutional neural networks (CNNs) in computer vision applications. They have different inductive biases, training stability, and data efficiency. Compared to CNNs, ViTs are less data efficient, but have higher capacity. Some of the largest modern computer vision models are ViTs, such as one with 22B parameters.{{Citation |last1=Dehghani |first1=Mostafa |title=Scaling Vision Transformers to 22 Billion Parameters |date=2023-02-10 |arxiv=2302.05442 |last2=Djolonga |first2=Josip |last3=Mustafa |first3=Basil |last4=Padlewski |first4=Piotr |last5=Heek |first5=Jonathan |last6=Gilmer |first6=Justin |last7=Steiner |first7=Andreas |last8=Caron |first8=Mathilde |last9=Geirhos |first9=Robert}}{{Cite web |title=Scaling vision transformers to 22 billion parameters |url=http://research.google/blog/scaling-vision-transformers-to-22-billion-parameters/ |access-date=2024-08-07 |website=research.google |language=en}}

Subsequent to its publication, many variants were proposed, with hybrid architectures with both features of ViTs and CNNs. ViTs have found application in image recognition, image segmentation, weather prediction, and autonomous driving.{{Cite journal |last1=Han |first1=Kai |last2=Wang |first2=Yunhe |last3=Chen |first3=Hanting |last4=Chen |first4=Xinghao |last5=Guo |first5=Jianyuan |last6=Liu |first6=Zhenhua |last7=Tang |first7=Yehui |last8=Xiao |first8=An |last9=Xu |first9=Chunjing |last10=Xu |first10=Yixing |last11=Yang |first11=Zhaohui |last12=Zhang |first12=Yiman |last13=Tao |first13=Dacheng |date=2023-01-01 |title=A Survey on Vision Transformer |url=https://ieeexplore.ieee.org/document/9716741 |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=45 |issue=1 |pages=87–110 |doi=10.1109/TPAMI.2022.3152247 |pmid=35180075 |issn=0162-8828|arxiv=2012.12556 }}{{Cite journal |last1=Khan |first1=Salman |last2=Naseer |first2=Muzammal |last3=Hayat |first3=Munawar |last4=Zamir |first4=Syed Waqas |last5=Khan |first5=Fahad Shahbaz |last6=Shah |first6=Mubarak |date=2022-09-13 |title=Transformers in Vision: A Survey |url=https://doi.org/10.1145/3505244 |journal=ACM Comput. Surv. |volume=54 |issue=10s |pages=200:1–200:41 |doi=10.1145/3505244 |issn=0360-0300|arxiv=2101.01169 }}

History

Transformers were introduced in Attention Is All You Need (2017),{{cite journal |last1=Vaswani |first1=Ashish |author1-link= Ashish Vaswani |last2=Shazeer |first2=Noam |last3=Parmar |first3=Niki |last4=Uszkoreit |first4=Jakob |last5=Jones |first5=Llion |last6=Gomez |first6=Aidan N |author6-link= Aidan Gomez |last7=Kaiser |first7=Łukasz |last8=Polosukhin |first8=Illia |title=Attention is All you Need |journal=Advances in Neural Information Processing Systems |date=2017 |volume=30 |url=https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf |publisher=Curran Associates, Inc.}} and have found widespread use in natural language processing. A 2019 paper{{Cite journal |last1=Ramachandran |first1=Prajit |last2=Parmar |first2=Niki |last3=Vaswani |first3=Ashish |last4=Bello |first4=Irwan |last5=Levskaya |first5=Anselm |last6=Shlens |first6=Jon |date=2019 |title=Stand-Alone Self-Attention in Vision Models |url=https://proceedings.neurips.cc/paper/2019/hash/3416a75f4cea9109507cacd8e2f2aefc-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=32|arxiv=1906.05909 }} applied ideas from the Transformer to computer vision. Specifically, they started with a ResNet, a standard convolutional neural network used for computer vision, and replaced all convolutional kernels by the self-attention mechanism found in a Transformer. It resulted in superior performance. However, it is not a Vision Transformer.

In 2020, an encoder-only Transformer was adapted for computer vision, yielding the ViT, which reached state of the art in image classification, overcoming the previous dominance of CNN. The masked autoencoder (2022) extended ViT to work with unsupervised training. The vision transformer and the masked autoencoder, in turn, stimulated new developments in convolutional neural networks.{{Cite journal |last1=Liu |first1=Zhuang |last2=Mao |first2=Hanzi |last3=Wu |first3=Chao-Yuan |last4=Feichtenhofer |first4=Christoph |last5=Darrell |first5=Trevor |last6=Xie |first6=Saining |date=2022 |title=A ConvNet for the 2020s |url=https://openaccess.thecvf.com/content/CVPR2022/html/Liu_A_ConvNet_for_the_2020s_CVPR_2022_paper.html |language=en |pages=11976–11986|arxiv=2201.03545 }}{{Cite journal |last1=Woo |first1=Sanghyun |last2=Debnath |first2=Shoubhik |last3=Hu |first3=Ronghang |last4=Chen |first4=Xinlei |last5=Liu |first5=Zhuang |last6=Kweon |first6=In So |last7=Xie |first7=Saining |date=2023 |title=ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders |url=https://openaccess.thecvf.com/content/CVPR2023/html/Woo_ConvNeXt_V2_Co-Designing_and_Scaling_ConvNets_With_Masked_Autoencoders_CVPR_2023_paper.html |language=en |pages=16133–16142|arxiv=2301.00808 }}

Subsequently, there was cross-fertilization between the previous CNN approach and the ViT approach.

In 2021, some important variants of the Vision Transformers were proposed. These variants are mainly intended to be more efficient, more accurate or better suited to a specific domain. Two studies {{Cite arXiv |eprint=2006.03677 |class=cs.CV |first1=Bichen |last1=Wu |first2=Chenfeng |last2=Xu |title=Visual Transformers: Token-based Image Representation and Processing for Computer Vision |last3=Dai |first8=Joseph |year=2020 |last10=Vajda |first10=Peter |first9=Kurt |last9=Keutzer |last8=Gonzalez |first3=Xiaoliang |first7=Tomizuka |first6=Zhicheng |last6=Yan |first5=Peizhao |last5=Zhang |first4=Alvin |last4=Wan |last7=Masayoshi}} improved efficiency and robustness of ViT by adding a CNN as a preprocessor. The Swin Transformer{{cite arXiv|last1=Liu|first1=Ze|last2=Lin|first2=Yutong|last3=Cao|first3=Yue|last4=Hu|first4=Han|last5=Wei|first5=Yixuan|last6=Zhang|first6=Zheng|last7=Lin|first7=Stephen|last8=Guo|first8=Baining|date=2021-03-25|title=Swin Transformer: Hierarchical Vision Transformer using Shifted Windows|class=cs.CV |eprint=2103.14030|language=en}} achieved state-of-the-art results on some object detection datasets such as COCO, by using convolution-like sliding windows of attention mechanism, and the pyramid process in classical computer vision.

Overview

File:Vision_Transformer.svg

The basic architecture, used by the original 2020 paper, is as follows. In summary, it is a BERT-like encoder-only Transformer.

The input image is of type $\R^{H\times W \times C}$ , where $H, W, C$ are height, width, channel (RGB). It is then split into square-shaped patches of type $\R^{P\times P \times C}$ .

For each patch, the patch is pushed through a linear operator, to obtain a vector ("patch embedding"). The position of the patch is also transformed into a vector by "position encoding". The two vectors are added, then pushed through several Transformer encoders.

The attention mechanism in a ViT repeatedly transforms representation vectors of image patches, incorporating more and more semantic relations between image patches in an image. This is analogous to how in natural language processing, as representation vectors flow through a transformer, they incorporate more and more semantic relations between words, from syntax to semantics.

The above architecture turns an image into a sequence of vector representations. To use these for downstream applications, an additional head needs to be trained to interpret them.

For example, to use it for classification, one can add a shallow MLP on top of it that outputs a probability distribution over classes. The original paper uses a linear-GeLU-linear-softmax network.

Variants

= Original ViT =

The original ViT was an encoder-only Transformer supervise-trained to predict the image label from the patches of the image. As in the case of BERT, it uses a special token in the input side, and the corresponding output vector is used as the only input of the final output MLP head. The special token is an architectural hack to allow the model to compress all information relevant for predicting the image label into one vector.File:Vision Transformer.gif

Transformers found their initial applications in natural language processing tasks, as demonstrated by language models such as BERT and GPT-3. By contrast the typical image processing system uses a convolutional neural network (CNN). Well-known projects include Xception, ResNet, EfficientNet,{{cite journal |last1=Tan |first1=Mingxing |last2=Le |first2=Quoc |date=23 June 2021 |title=EfficientNetV2: Smaller Models and Faster Training |url=https://proceedings.mlr.press/v139/tan21a/tan21a.pdf |journal=Proceedings of the 38th International Conference on Machine Learning (PMLR) |volume=139 |issue= |pages=10096–10106 |doi= |arxiv=2104.00298 |access-date=31 October 2023}} DenseNet,{{Cite arXiv |last1=Huang|first1=Gao|last2=Liu|first2=Zhuang|last3=van der Maaten|first3=Laurens|last4=Q. Weinberger|first4=Kilian|date= 28 Jan 2018|title=Densely Connected Convolutional Networks |class=cs.CV|eprint=1608.06993}} and Inception.{{Cite web |last=Sarkar |first=Arjun |date=2021-05-20 |title=Are Transformers better than CNN's at Image Recognition? |url=https://towardsdatascience.com/are-transformers-better-than-cnns-at-image-recognition-ced60ccc7c8 |access-date=2021-07-11 |website=Medium |language=en}}

Transformers measure the relationships between pairs of input tokens (words in the case of text strings), termed attention. The cost is quadratic in the number of tokens. For images, the basic unit of analysis is the pixel. However, computing relationships for every pixel pair in a typical image is prohibitive in terms of memory and computation. Instead, ViT computes relationships among pixels in various small sections of the image (e.g., 16x16 pixels), at a drastically reduced cost. The sections (with positional embeddings) are placed in a sequence. The embeddings are learnable vectors. Each section is arranged into a linear sequence and multiplied by the embedding matrix. The result, with the position embedding is fed to the transformer.

= Architectural improvements =

== Pooling ==

After the ViT processes an image, it produces some embedding vectors. These must be converted to a single class probability prediction by some kind of network. In the original ViT and Masked Autoencoder, they used a dummy [CLS] token , in emulation of the BERT language model. The output at [CLS] is the classification token, which is then processed by a LayerNorm-feedforward-softmax module into a probability distribution.

Global average pooling (GAP) does not use the dummy token, but simply takes the average of all output tokens as the classification token. It was mentioned in the original ViT as being equally good.

Multihead attention pooling (MAP) applies a multiheaded attention block to pooling. Specifically, it takes as input a list of vectors $x_1, x_2, \dots, x_n$ , which might be thought of as the output vectors of a layer of a ViT. The output from MAP is $\mathrm{MultiheadedAttention}(Q, V, V)$ , where $q$ is a trainable query vector, and $V$ is the matrix with rows being $x_1, x_2, \dots, x_n$ .{{Cite book |last1=Zhai |first1=Xiaohua |last2=Kolesnikov |first2=Alexander |last3=Houlsby |first3=Neil |last4=Beyer |first4=Lucas |chapter=Scaling Vision Transformers |date=June 2022 |title=2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) |chapter-url=http://dx.doi.org/10.1109/cvpr52688.2022.01179 |pages=1204–1213 |publisher=IEEE |doi=10.1109/cvpr52688.2022.01179|arxiv=2106.04560 |isbn=978-1-6654-6946-3 }} This was first proposed in the Set Transformer architecture.{{Cite journal |last1=Lee |first1=Juho |last2=Lee |first2=Yoonho |last3=Kim |first3=Jungtaek |last4=Kosiorek |first4=Adam |last5=Choi |first5=Seungjin |last6=Teh |first6=Yee Whye |date=2019-05-24 |title=Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks |url=https://proceedings.mlr.press/v97/lee19d.html |journal=Proceedings of the 36th International Conference on Machine Learning |language=en |publisher=PMLR |pages=3744–3753|arxiv=1810.00825 }}

Later papers demonstrated that GAP and MAP both perform better than BERT-like pooling.{{Citation |last1=Karamcheti |first1=Siddharth |title=Language-Driven Representation Learning for Robotics |date=2023-02-24 |arxiv=2302.12766 |last2=Nair |first2=Suraj |last3=Chen |first3=Annie S. |last4=Kollar |first4=Thomas |last5=Finn |first5=Chelsea |last6=Sadigh |first6=Dorsa |last7=Liang |first7=Percy}} A variant of MAP was proposed as class attention, which applies MAP, then feedforward, then MAP again.{{Cite journal |last1=Touvron |first1=Hugo |last2=Cord |first2=Matthieu |last3=Sablayrolles |first3=Alexandre |last4=Synnaeve |first4=Gabriel |last5=Jégou |first5=Hervé |date=2021 |title=Going Deeper With Image Transformers |url=https://openaccess.thecvf.com/content/ICCV2021/html/Touvron_Going_Deeper_With_Image_Transformers_ICCV_2021_paper.html |language=en |pages=32–42|arxiv=2103.17239 }}

Re-attention was proposed to allow training deep ViT. It changes the multiheaded attention module.{{Citation |last1=Zhou |first1=Daquan |title=DeepViT: Towards Deeper Vision Transformer |date=2021-04-19 |arxiv=2103.11886 |last2=Kang |first2=Bingyi |last3=Jin |first3=Xiaojie |last4=Yang |first4=Linjie |last5=Lian |first5=Xiaochen |last6=Jiang |first6=Zihang |last7=Hou |first7=Qibin |last8=Feng |first8=Jiashi}}

= Masked Autoencoder =

File:Masked Autoencoder.svg

The Masked Autoencoder{{Cite arXiv |eprint=2111.06377 |class=cs.CV |first1=Kaiming |last1=He |first2=Xinlei |last2=Chen |title=Masked Autoencoders Are Scalable Vision Learners |date=2021 |last3=Xie |first3=Saining |last4=Li |first4=Yanghao |last5=Dollár |first5=Piotr |last6=Girshick |first6=Ross}} took inspiration from denoising autoencoders and context encoders.{{Cite book |last1=Pathak |first1=Deepak |last2=Krahenbuhl |first2=Philipp |last3=Donahue |first3=Jeff |last4=Darrell |first4=Trevor |last5=Efros |first5=Alexei A. |chapter=Context Encoders: Feature Learning by Inpainting |date=June 2016 |title=2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) |chapter-url=https://ieeexplore.ieee.org/document/7780647 |publisher=IEEE |pages=2536–2544 |doi=10.1109/CVPR.2016.278 |isbn=978-1-4673-8851-1|arxiv=1604.07379 }} It has two ViTs put end-to-end. The first one ("encoder") takes in image patches with positional encoding, and outputs vectors representing each patch. The second one (called "decoder", even though it is still an encoder-only Transformer) takes in vectors with positional encoding and outputs image patches again. During training, both the encoder and the decoder ViTs are used. During inference, only the encoder ViT is used.

During training, each image is cut into patches, and with their positional embeddings added. Of these, only 25% of the patches are selected. The encoder ViT processes the selected patches. No mask tokens are used. Then, mask tokens are added back in, and positional embeddings added again. These are processed by the decoder ViT, which outputs a reconstruction of the full image. The loss is the total mean-squared loss in pixel-space for all masked patches (reconstruction loss is not computed for non-masked patches).

A similar architecture was BERT ViT (BEiT), published concurrently.{{Cite journal |last1=Bao |first1=Hangbo |last2=Dong |first2=Li |last3=Piao |first3=Songhao |last4=Wei |first4=Furu |date=2021-10-06 |title=BEiT: BERT Pre-Training of Image Transformers |url=https://openreview.net/forum?id=p-BhZSz59o4 |journal=International Conference on Learning Representations |arxiv=2106.08254 |language=en}}

= DINO =

Like the Masked Autoencoder, the DINO (self-distillation with no labels) method is a way to train a ViT by self-supervision. DINO is a form of teacher-student self-distillation. In DINO, the student is the model itself, and the teacher is an exponential average of the student's past states. The method is similar to previous works like momentum contrast{{Cite journal |last1=He |first1=Kaiming |last2=Fan |first2=Haoqi |last3=Wu |first3=Yuxin |last4=Xie |first4=Saining |last5=Girshick |first5=Ross |date=2020 |title=Momentum Contrast for Unsupervised Visual Representation Learning |url=https://openaccess.thecvf.com/content_CVPR_2020/html/He_Momentum_Contrast_for_Unsupervised_Visual_Representation_Learning_CVPR_2020_paper.html |pages=9729–9738|arxiv=1911.05722 }} and bootstrap your own latent (BYOL).{{Cite journal |last1=Grill |first1=Jean-Bastien |last2=Strub |first2=Florian |last3=Altché |first3=Florent |last4=Tallec |first4=Corentin |last5=Richemond |first5=Pierre |last6=Buchatskaya |first6=Elena |last7=Doersch |first7=Carl |last8=Avila Pires |first8=Bernardo |last9=Guo |first9=Zhaohan |last10=Gheshlaghi Azar |first10=Mohammad |last11=Piot |first11=Bilal |last12=kavukcuoglu |first12=koray |last13=Munos |first13=Remi |last14=Valko |first14=Michal |date=2020 |title=Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning |url=https://proceedings.neurips.cc/paper/2020/hash/f3ada80d5c4ee70142b17b8192b2958e-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=33 |pages=21271–21284}}

The loss function used in DINO is the cross-entropy loss between the output of the teacher network ( $f_{\theta'_t}$ ) and the output of the student network ( $f_{\theta_t}$ ). The teacher network is an exponentially decaying average of the student network's past parameters: $\theta'_t = \alpha \theta_t + \alpha(1-\alpha) \theta_{t-1} + \cdots$ . The inputs to the networks are two different crops of the same image, represented as $T(x)$ and $T'(x)$ , where $x$ is the original image. The loss function is written as $L(f_{\theta'_t}(T(x)), f_{\theta_t}(T'(x)))$ One issue is that the network can "collapse" by always outputting the same value ( $y$ ), regardless of the input. To prevent this collapse, DINO employs two strategies:

Sharpening: The teacher network's output is sharpened using a softmax function with a lower temperature. This makes the teacher more "confident" in its predictions, forcing the student to learn more meaningful representations to match the teacher's sharpened output.
Centering: The teacher network's output is centered by averaging it with its previous outputs. This prevents the teacher from becoming biased towards any particular output value, encouraging the student to learn a more diverse set of features.

In January 2024, Meta AI Research released an updated version called DINOv2{{Cite arXiv|last1=Oquab |first1=Maxime |last2=Darcet |first2=Timothée |last3=Moutakanni |first3=Théo |last4=Vo |first4=Huy |last5=Szafraniec |first5=Marc |last6=Khalidov |first6=Vasil |last7=Fernandez |first7=Pierre |last8=Haziza |first8=Daniel |last9=Massa |first9=Francisco |date=2023-04-14 |title=DINOv2: Learning Robust Visual Features without Supervision |class=cs.CV |language=en |eprint=2304.07193}} with improvements in architecture, loss function, and optimization technique. It was trained on a larger and more diverse dataset. The features learned by DINOv2 were more transferable, meaning it had better performance in downstream tasks.

= Swin Transformer =

The Swin Transformer ("Shifted windows") took inspiration from standard CNNs:

Instead of performing self-attention over the entire sequence of tokens, one for each patch, it performs "shifted window based" self-attention, which means only performing attention over square-shaped blocks of patches. One block of patches is analogous to the receptive field of one convolution.
After every few attention blocks, there is a "merge layer", which merges neighboring 2x2 tokens into a single token. This is analogous to pooling (by 2x2 convolution kernels, with stride 2). Merging means concatenation followed by multiplication with a matrix.

It is improved by Swin Transformer V2,{{Cite web |last1=Liu |first1=Ze |last2=Hu |first2=Han |last3=Lin |first3=Yutong |last4=Yao |first4=Zhuliang |last5=Xie |first5=Zhenda |last6=Wei |first6=Yixuan |last7=Ning |first7=Jia |last8=Cao |first8=Yue |last9=Zhang |first9=Zheng |last10=Dong |first10=Li |last11=Wei |first11=Furu |last12=Guo |first12=Baining | publisher=Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition| date=2022 |title=Swin Transformer V2: Scaling Up Capacity and Resolution |url=https://openaccess.thecvf.com/content/CVPR2022/html/Liu_Swin_Transformer_V2_Scaling_Up_Capacity_and_Resolution_CVPR_2022_paper.html |language=en |pages=12009–12019}} which modifies upon the ViT by a different attention mechanism{{Pg|location=Figure 1}}:

LayerNorm immediately after each attention and feedforward layer ("res-post-norm");
scaled cosine attention to replace the original dot product attention;
log-spaced continuous relative position bias, which allows transfer learning across different window resolutions.

= TimeSformer =

The TimeSformer{{cite arXiv |eprint=2102.05095 |class=cs.CV |first1=Gedas |last1=Bertasius |first2=Heng |last2=Wang |title=Is Space-Time Attention All You Need for Video Understanding? |date=2021-02-09 |language=en |last3=Torresani |first3=Lorenzo}} was designed for video understanding tasks, and it applied a factorized self-attention, similar to the factorized convolution kernels found in the Inception CNN architecture.{{Cite journal |last1=Szegedy |first1=Christian |last2=Vanhoucke |first2=Vincent |last3=Ioffe |first3=Sergey |last4=Shlens |first4=Jon |last5=Wojna |first5=Zbigniew |date=2016 |title=Rethinking the Inception Architecture for Computer Vision |url=https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Szegedy_Rethinking_the_Inception_CVPR_2016_paper.html |pages=2818–2826|arxiv=1512.00567 }} Schematically, it divides a video into frames, and each frame into a square grid of patches (same as ViT). Let each patch coordinate be denoted by $x, y, t$ , denoting horizontal, vertical, and time.

A space attention layer is a self-attention layer where each query patch $q_{x, y, t}$ attends to only the key and value patches $k_{x', y', t'}, v_{x', y', t'}$ such that $t = t'$ .
A time attention layer is where the requirement is $x' = x, y' = y$ instead.

The TimeSformer also considered other attention layer designs, such as the "height attention layer" where the requirement is $x' = x, t' = t$ . However, they found empirically that the best design interleaves one space attention layer and one time attention layer.

= ViT-VQGAN =

In ViT-VQGAN,{{Cite arXiv |last1=Yu |first1=Jiahui |last2=Li |first2=Xin |last3=Koh |first3=Jing Yu |last4=Zhang |first4=Han |last5=Pang |first5=Ruoming |last6=Qin |first6=James |last7=Ku |first7=Alexander |last8=Xu |first8=Yuanzhong |last9=Baldridge |first9=Jason |last10=Wu |first10=Yonghui |date=2021 |title=Vector-quantized Image Modeling with Improved VQGAN |class=cs.CV |eprint=2110.04627}} there are two ViT encoders and a discriminator. One encodes 8x8 patches of an image into a list of vectors, one for each patch. The vectors can only come from a discrete set of "codebook", as in vector quantization. Another encodes the quantized vectors back to image patches. The training objective attempts to make the reconstruction image (the output image) faithful to the input image. The discriminator (usually a convolutional network, but other networks are allowed) attempts to decide if an image is an original real image, or a reconstructed image by the ViT.

The idea is essentially the same as vector quantized variational autoencoder (VQVAE) plus generative adversarial network (GAN).

After such a ViT-VQGAN is trained, it can be used to code an arbitrary image into a list of symbols, and code an arbitrary list of symbols into an image. The list of symbols can be used to train into a standard autoregressive transformer (like GPT), for autoregressively generating an image. Further, one can take a list of caption-image pairs, convert the images into strings of symbols, and train a standard GPT-style transformer. Then at test time, one can just give an image caption, and have it autoregressively generate the image. This is the structure of Google Parti.{{Cite web |title=Parti: Pathways Autoregressive Text-to-Image Model |url=https://sites.research.google/parti/ |access-date=2023-11-03 |website=sites.research.google}}

= Others =

Other examples include the visual transformer,{{Citation |last1=Wu |first1=Bichen |title=Visual Transformers: Token-based Image Representation and Processing for Computer Vision |date=2020-11-19 |arxiv=2006.03677 |last2=Xu |first2=Chenfeng |last3=Dai |first3=Xiaoliang |last4=Wan |first4=Alvin |last5=Zhang |first5=Peizhao |last6=Yan |first6=Zhicheng |last7=Tomizuka |first7=Masayoshi |last8=Gonzalez |first8=Joseph |last9=Keutzer |first9=Kurt}} CoAtNet,{{cite arXiv |eprint=2106.04803 |class=cs.CV |first1=Zihang |last1=Dai |first2=Hanxiao |last2=Liu |title=CoAtNet: Marrying Convolution and Attention for All Data Sizes |date=2021-06-09 |language=en |last3=Le |first3=Quoc V. |last4=Tan |first4=Mingxing}} CvT,{{cite arXiv |eprint=2103.15808 |class=cs.CV |first1=Haiping |last1=Wu |first2=Bin |last2=Xiao |title=CvT: Introducing Convolutions to Vision Transformers |date=2021-03-29 |language=en |last3=Codella |first3=Noel |last4=Liu |first4=Mengchen |last5=Dai |first5=Xiyang |last6=Yuan |first6=Lu |last7=Zhang |first7=Lei}} the data-efficient ViT (DeiT),{{Cite book |last1=Touvron |first1=Hugo |last2=Cord |first2=Matthieu |last3=Jégou |first3=Hervé |chapter=DeiT III: Revenge of the ViT |series=Lecture Notes in Computer Science |date=2022 |volume=13684 |editor-last=Avidan |editor-first=Shai |editor2-last=Brostow |editor2-first=Gabriel |editor3-last=Cissé |editor3-first=Moustapha |editor4-last=Farinella |editor4-first=Giovanni Maria |editor5-last=Hassner |editor5-first=Tal |title=Computer Vision – ECCV 2022 |chapter-url=https://link.springer.com/chapter/10.1007/978-3-031-20053-3_30 |language=en |location=Cham |publisher=Springer Nature Switzerland |pages=516–533 |doi=10.1007/978-3-031-20053-3_30 |isbn=978-3-031-20053-3}} etc.

In the Transformer in Transformer architecture, each layer applies a vision Transformer layer on each image patch embedding, add back the resulting tokens to the embedding, then applies another vision Transformer layer.{{Cite journal |last1=Han |first1=Kai |last2=Xiao |first2=An |last3=Wu |first3=Enhua |last4=Guo |first4=Jianyuan |last5=XU |first5=Chunjing |last6=Wang |first6=Yunhe |date=2021 |title=Transformer in Transformer |url=https://proceedings.neurips.cc/paper/2021/hash/854d9fca60b4bd07f9bb215d59ef5561-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=34 |pages=15908–15919}}

Comparison with CNNs

Typically, ViT uses patch sizes larger than standard CNN kernels (3x3 to 7x7). ViT is more sensitive to the choice of the optimizer, hyperparameters, and network depth. Preprocessing with a layer of smaller-size, overlapping (stride < size) convolutional filters helps with performance and stability.{{cite arXiv|last1=Xiao|first1=Tete|last2=Singh|first2=Mannat|last3=Mintun|first3=Eric|last4=Darrell|first4=Trevor|last5=Dollár|first5=Piotr|last6=Girshick|first6=Ross|date=2021-06-28|title=Early Convolutions Help Transformers See Better|class=cs.CV|eprint=2106.14881}}

This different behavior seems to derive from the different inductive biases they possess.

CNN applies the same set of filters for processing the entire image. This allows them to be more data efficient and less sensitive to local perturbations.{{cite arXiv|last1=Raghu|first1=Maithra|last2=Unterthiner|first2=Thomas|last3=Kornblith|first3=Simon|last4=Zhang|first4=Chiyuan|last5=Dosovitskiy|first5=Alexey|date=2021-08-19|title=Do Vision Transformers See Like Convolutional Neural Networks?|class=cs.CV |eprint=2108.08810}} ViT applies self-attention, allowing them to easily capture long-range relationships between patches. They also require more data to train, but they can ingest more training data compared to CNN, which might not improve after training on a large enough training dataset. ViT also appears more robust to input image distortions such as adversarial patches or permutations.{{cite arXiv|last1=Naseer|first1=Muzammal|last2=Ranasinghe|first2=Kanchana|last3=Khan|first3=Salman|last4=Hayat|first4=Munawar|last5=Khan|first5=Fahad Shahbaz|last6=Yang|first6=Ming-Hsuan|date=2021-05-21|title=Intriguing Properties of Vision Transformers|class=cs.CV |eprint=2105.10497|language=en}}

Applications

ViT have been used in many Computer Vision tasks with excellent results and in some cases even state-of-the-art. Image Classification, Object Detection, Video Deepfake Detection,{{cite book |last1=Coccomini |first1=Davide |title=Image Analysis and Processing – ICIAP 2022 |last2=Messina |first2=Nicola |last3=Gennaro |first3=Claudio |last4=Falchi |first4=Fabrizio |year=2022 |isbn=978-3-031-06432-6 |series=Lecture Notes in Computer Science |volume=13233 |pages=219–229 |language=en |chapter=Combining Efficient Net and Vision Transformers for Video Deepfake Detection |doi=10.1007/978-3-031-06433-3_19 |arxiv=2107.02612 |s2cid=235742764}} Image segmentation,{{Cite journal |last1=Kirillov |first1=Alexander |last2=Mintun |first2=Eric |last3=Ravi |first3=Nikhila |last4=Mao |first4=Hanzi |last5=Rolland |first5=Chloe |last6=Gustafson |first6=Laura |last7=Xiao |first7=Tete |last8=Whitehead |first8=Spencer |last9=Berg |first9=Alexander C. |last10=Lo |first10=Wan-Yen |last11=Dollar |first11=Piotr |last12=Girshick |first12=Ross |date=2023 |title=Segment Anything |url=https://openaccess.thecvf.com/content/ICCV2023/html/Kirillov_Segment_Anything_ICCV_2023_paper.html |language=en |pages=4015–4026}} Anomaly detection, Image Synthesis, Cluster analysis, Autonomous Driving.

ViT had been used for image generation as backbones for GAN{{Cite journal |last1=Jiang |first1=Yifan |last2=Chang |first2=Shiyu |last3=Wang |first3=Zhangyang |date=2021 |title=TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up |url=https://proceedings.neurips.cc/paper_files/paper/2021/hash/7c220a2091c26a7f5e9f1cfb099511e3-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=34 |pages=14745–14758|arxiv=2102.07074 }} and for diffusion models (diffusion transformer, or DiT).{{Cite arXiv |eprint=2212.09748v2 |class=cs.CV |first1=William |last1=Peebles |first2=Saining |last2=Xie |title=Scalable Diffusion Models with Transformers |date=March 2023 |language=en}}

DINO{{Cite book |last1=Caron |first1=Mathilde |last2=Touvron |first2=Hugo |last3=Misra |first3=Ishan |last4=Jegou |first4=Herve |last5=Mairal |first5=Julien |last6=Bojanowski |first6=Piotr |last7=Joulin |first7=Armand |chapter=Emerging Properties in Self-Supervised Vision Transformers |date=October 2021 |pages=9630–9640 |title=2021 IEEE/CVF International Conference on Computer Vision (ICCV) |chapter-url=http://dx.doi.org/10.1109/iccv48922.2021.00951 |publisher=IEEE |doi=10.1109/iccv48922.2021.00951|arxiv=2104.14294 |isbn=978-1-6654-2812-5 }} has been demonstrated to learn useful representations for clustering images and exploring morphological profiles on biological datasets, such as images generated with the Cell Painting assay.{{Cite journal |last1=Doron |first1=Michael |last2=Moutakanni |first2=Théo |last3=Chen |first3=Zitong S. |last4=Moshkov |first4=Nikita |last5=Caron |first5=Mathilde |last6=Touvron |first6=Hugo |last7=Bojanowski |first7=Piotr |last8=Pernice |first8=Wolfgang M. |last9=Caicedo |first9=Juan C. |date=2023-06-18 |title=Unbiased single-cell morphology with self-supervised vision transformers |url=http://dx.doi.org/10.1101/2023.06.16.545359 |access-date=2024-02-12 |journal=BioRxiv: The Preprint Server for Biology|pages=2023.06.16.545359 |doi=10.1101/2023.06.16.545359 |pmid=37398158 |pmc=10312751 }}

In 2024, a 113 billion-parameter ViT model was proposed (the largest ViT to date) for weather and climate prediction, and trained on the Frontier supercomputer with a throughput of 1.6 exaFLOPs.{{cite arXiv |eprint=2404.14712 |last1=Wang |first1=Xiao |last2=Liu |first2=Siyan |last3=Tsaris |first3=Aristeidis |last4=Choi |first4=Jong-Youl |last5=Aji |first5=Ashwin |last6=Fan |first6=Ming |last7=Zhang |first7=Wei |last8=Yin |first8=Junqi |last9=Ashfaq |first9=Moetasim |last10=Lu |first10=Dan |last11=Balaprakash |first11=Prasanna |title=ORBIT: Oak Ridge Base Foundation Model for Earth System Predictability |date=2024 |class=physics.ao-ph }}