Contrastive Language-Image Pre-training

{{Short description|Technique in neural networks for learning joint representations of text and images}}

{{Infobox software

| name = CLIP

| caption = Contrastive Language-Image Pre-training (CLIP)

| logo = Contrastive Language-Image Pretraining.png

| developer = OpenAI

| released = January 5, 2021

| license = MIT License

| website = [https://openai.com/research/clip openai.com/research/clip]

| repo = https://github.com/OpenAI/CLIP

| programming language = Python

}}

Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective.{{Cite conference |last1=Radford |first1=Alec |last2=Kim |first2=Jong Wook |last3=Hallacy |first3=Chris |last4=Ramesh |first4=Aditya |last5=Goh |first5=Gabriel |last6=Agarwal |first6=Sandhini |last7=Sastry |first7=Girish |last8=Askell |first8=Amanda |last9=Mishkin |first9=Pamela |last10=Clark |first10=Jack |last11=Krueger |first11=Gretchen |last12=Sutskever |first12=Ilya |date=2021-07-01 |title=Learning Transferable Visual Models From Natural Language Supervision |url=https://proceedings.mlr.press/v139/radford21a |conference=Proceedings of the 38th International Conference on Machine Learning |publisher=PMLR |pages=8748–8763}}

This method has enabled broad applications across multiple domains, including cross-modal retrieval,{{Cite journal |last=Hendriksen |first=Mariya |last2=Bleeker |first2=Maurits |last3=Vakulenko |first3=Svitlana |last4=van Noord |first4=Nanne |last5=Kuiper |first5=Ernst |last6=de Rijke |first6=Maarten |date=2022 |editor-last=Hagen |editor-first=Matthias |editor2-last=Verberne |editor2-first=Suzan |editor3-last=Macdonald |editor3-first=Craig |editor4-last=Seifert |editor4-first=Christin |editor5-last=Balog |editor5-first=Krisztian |editor6-last=Nørvåg |editor6-first=Kjetil |editor7-last=Setty |editor7-first=Vinay |title=Extending CLIP for Category-to-Image Retrieval in E-Commerce |url=https://link.springer.com/chapter/10.1007/978-3-030-99736-6_20 |journal=Advances in Information Retrieval |language=en |location=Cham |publisher=Springer International Publishing |pages=289–303 |doi=10.1007/978-3-030-99736-6_20 |isbn=978-3-030-99736-6}} text-to-image generation,{{cite web |date=17 September 2022 |title=Stable Diffusion Repository on GitHub |url=https://github.com/CompVis/stable-diffusion |url-status=live |archive-url=https://web.archive.org/web/20230118183342/https://github.com/CompVis/stable-diffusion |archive-date=January 18, 2023 |access-date=17 September 2022 |publisher=CompVis - Machine Vision and Learning Research Group, LMU Munich}} and aesthetic ranking.{{Citation |title=LAION-AI/aesthetic-predictor |date=2024-09-06 |url=https://github.com/LAION-AI/aesthetic-predictor |access-date=2024-09-08 |publisher=LAION AI}}

Algorithm

File:Contrastive_Language-Image_Pretraining.png

The CLIP method trains a pair of models contrastively. One model takes in a piece of text as input and outputs a single vector representing its semantic content. The other model takes in an image and similarly outputs a single vector representing its visual content. The models are trained so that the vectors corresponding to semantically similar text-image pairs are close together in the shared vector space, while those corresponding to dissimilar pairs are far apart.

To train a pair of CLIP models, one would start by preparing a large dataset of image-caption pairs. During training, the models are presented with batches of $N$ image-caption pairs. Let the outputs from the text and image models be respectively $v_1, ..., v_N, w_1, ..., w_N$ . Two vectors are considered "similar" if their dot product is large.

The loss incurred on this batch is the multi-class N-pair loss,{{Cite journal |last=Sohn |first=Kihyuk |date=2016 |title=Improved Deep Metric Learning with Multi-class N-pair Loss Objective |url=https://proceedings.neurips.cc/paper/2016/hash/6b180037abbebea991d8b1232f8a8ca9-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=29}} which is a symmetric cross-entropy loss over similarity scores: $- \frac 1N \sum_{i} \ln\frac{e^{v_i \cdot w_i / T}}{\sum_j e^{v_i \cdot w_j / T}} - \frac 1N \sum_{j} \ln\frac{e^{v_j \cdot w_j / T}}{\sum_i e^{v_i \cdot w_j / T}}$ In essence, this loss function encourages the dot product between matching image and text vectors ( $v_i \cdot w_i$ ) to be high, while discouraging high dot products between non-matching pairs. The parameter $T > 0$ is the temperature, which is parameterized in the original CLIP model as $T = e^{-\tau}$ where $\tau \in \R$ is a learned parameter.

Other loss functions are possible. For example, Sigmoid CLIP (SigLIP){{Cite conference|last1=Zhai |first1=Xiaohua |last2=Mustafa |first2=Basil |last3=Kolesnikov |first3=Alexander |last4=Beyer |first4=Lucas |date=2023|conference=IEEE/CVF International Conference on Computer Vision (ICCV) |title=Sigmoid Loss for Language Image Pre-Training |url=https://openaccess.thecvf.com/content/ICCV2023/html/Zhai_Sigmoid_Loss_for_Language_Image_Pre-Training_ICCV_2023_paper.html |pages=11975–11986}} proposes the following loss function: $L = \frac 1N \sum_{i, j \in 1:N} f( (2\delta_{i,j} - 1) (e^\tau w_i \cdot v_j + b) )$ where $f(x) = \ln(1 + e^{-x})$ is the negative log sigmoid loss, and the Dirac delta symbol $\delta_{i,j}$ is 1 if $i = j$ else 0.

CLIP models

While the original model was developed by OpenAI, subsequent models have been trained by other organizations as well.

= Image model =

File:Vision_Transformer.svg

The image encoding models used in CLIP are typically vision transformers (ViT). The naming convention for these models often reflects the specific ViT architecture used. For instance, "ViT-L/14" means a "vision transformer large" (compared to other models in the same series) with a patch size of 14, meaning that the image is divided into 14-by-14 pixel patches before being processed by the transformer. The size indicator ranges from B, L, H, G (base, large, huge, giant), in that order.

Other than ViT, the image model is typically a convolutional neural network, such as ResNet (in the original series by OpenAI), or ConvNeXt{{Cite conference|last1=Liu |first1=Zhuang |last2=Mao |first2=Hanzi |last3=Wu |first3=Chao-Yuan |last4=Feichtenhofer |first4=Christoph |last5=Darrell |first5=Trevor |last6=Xie |first6=Saining |date=2022 |conference=IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)|title=A ConvNet for the 2020s |url=https://openaccess.thecvf.com/content/CVPR2022/html/Liu_A_ConvNet_for_the_2020s_CVPR_2022_paper.html |pages=11976–11986}} (in the OpenCLIP model series by LAION{{Citation |last1=Ilharco |first1=Gabriel |title=OpenCLIP |date=July 2021 |url=https://github.com/mlfoundations/open_clip |access-date=2024-09-06 |last2=Wortsman |first2=Mitchell |last3=Wightman |first3=Ross |last4=Gordon |first4=Cade |last5=Carlini |first5=Nicholas | author5-link=Nicholas Carlini |last6=Taori |first6=Rohan |last7=Dave |first7=Achal |last8=Shankar |first8=Vaishaal |last9=Namkoong |first9=Hongseok|doi=10.5281/zenodo.5143773 }}).

Since the output vectors of the image model and the text model must have exactly the same length, both the image model and the text model have fixed-length vector outputs, which in the original report is called "embedding dimension".{{NoteTag|Similar to the "embedding dimension" of text embedding in Transformer models.}}

For example, in the original OpenAI model, the ResNet models have embedding dimensions ranging from 512 to 1024,{{Cite arXiv |eprint=2103.00020 |class=cs.CV |first1=Alec |last1=Radford |first2=Jong Wook |last2=Kim |title=Learning Transferable Visual Models From Natural Language Supervision |date=2021 |last3=Hallacy |first3=Chris |last4=Ramesh |first4=Aditya |last5=Goh |first5=Gabriel |last6=Agarwal |first6=Sandhini |last7=Sastry |first7=Girish |last8=Askell |first8=Amanda |last9=Mishkin |first9=Pamela |last10=Clark |first10=Jack |last11=Krueger |first11=Gretchen |last12=Sutskever |first12=Ilya}}{{Pg|location=Table 19}} and for the ViTs, from 512 to 768.{{Pg|location=Table 20}}

class="wikitable"

|+Models released by OpenAI{{Citation |title=openai/CLIP |date=2024-09-06 |url=https://github.com/openai/CLIP/ |access-date=2024-09-06 |publisher=OpenAI}}{{NoteTag|text=

!pip install git+https://github.com/openai/CLIP.git

!wget https://github.com/openai/CLIP/raw/main/CLIP.png -O CLIP.png

import torch

import clip

from PIL import Image

import numpy as np

device = "cuda" if torch.cuda.is_available() else "cpu"

for m in clip.available_models():

model, preprocess = clip.load(m, device=device)

input_resolution = model.visual.input_resolution

context_length = model.context_length

vocab_size = model.vocab_size

print("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}")

print("Input resolution:", input_resolution)

print("Context length:", context_length)

print("Vocab size:", vocab_size)

n_params_vision = sum(p.numel() for p in model.visual.parameters())

n_params_text = sum(p.numel() for p in model.transformer.parameters())

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)

image_features = model.encode_image(image)

print(f"Model: {m}, #vision parameters: {n_params_vision:,}, #text parameters: {n_params_text:,}, embedding dimension: {image_features.shape[1]}")

del model, preprocess, image, image_features

}}

!Model name

!Resolution

!Parameters (total, in millions)

!Parameters (vision)

!Parameters (text)

!Embedding dimension

!Size (MB)

!Release date

RN50

|224

|102

|38.3

|63.1

|1024

|244

|2021-01

RN101

|224

|120

|56.3

|63.1

|512

|278

|2021-03

RN50x4

|288

|178

|87.1

|90.7

|640

|402

|2021-03

RN50x16

|384

|291

|167.3

|123.0

|768

|630

|2021-07

RN50x64

|448

|623

|420.4

|201.8

|1024

|1260

|2022-01

ViT-B/32

|224

|151

|87.8

|63.1

|512

|338

|2021-01

ViT-B/16

|224

|150

|86.2

|63.1

|512

|335

|2021-07

ViT-L/14

|224

|428

|304.0

|123.0

|768

|890

|2022-01

ViT-L/14@336px

|336

|428

|304.3

|123.0

|768

|891

|2022-04

Its implementation of ViT was the same as the original one,{{cite arXiv |eprint=2010.11929 |class=cs.CV |first1=Alexey |last1=Dosovitskiy |first2=Lucas |last2=Beyer |title=An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale |date=2021-06-03 |last3=Kolesnikov |first3=Alexander |last4=Weissenborn |first4=Dirk |last5=Zhai |first5=Xiaohua |last6=Unterthiner |first6=Thomas |last7=Dehghani |first7=Mostafa |last8=Minderer |first8=Matthias |last9=Heigold |first9=Georg |last10=Gelly |first10=Sylvain |last11=Uszkoreit |first11=Jakob}} with one modification: after position embeddings are added to the initial patch embeddings, there is a LayerNorm.

Its implementation of ResNet was the same as the original one,{{Cite conference |last1=He |first1=Kaiming |last2=Zhang |first2=Xiangyu |last3=Ren |first3=Shaoqing |last4=Sun |first4=Jian |date=10 Dec 2015 |title=Deep Residual Learning for Image Recognition |arxiv=1512.03385}} with 3 modifications:

In the start of the CNN (the "stem"), they used three stacked 3x3 convolutions instead of a single 7x7 convolution, as suggested by.{{cite arXiv |last1=He |first1=Tong |title=Bag of Tricks for Image Classification with Convolutional Neural Networks |date=2018-12-05 |eprint=1812.01187 |last2=Zhang |first2=Zhi |last3=Zhang |first3=Hang |last4=Zhang |first4=Zhongyue |last5=Xie |first5=Junyuan |last6=Li |first6=Mu|class=cs.CV }}
There is an average pooling of stride 2 at the start of each downsampling convolutional layer (they called it rect-2 blur pooling according to the terminology of {{Cite journal |last=Zhang |first=Richard |date=2018-09-27 |title=Making Convolutional Networks Shift-Invariant Again |url=https://openreview.net/forum?id=SklVEnR5K7 |language=en}}). This has the effect of blurring images before downsampling, for antialiasing.{{cite arXiv |last=Zhang |first=Richard |title=Making Convolutional Networks Shift-Invariant Again |date=2019-06-08 |class=cs.CV |eprint=1904.11486}}
The final convolutional layer is followed by a multiheaded attention pooling.

ALIGN a model with similar capabilities, trained by researchers from Google{{Cite journal |last1=Jia |first1=Chao |last2=Yang |first2=Yinfei |last3=Xia |first3=Ye |last4=Chen |first4=Yi-Ting |last5=Parekh |first5=Zarana |last6=Pham |first6=Hieu |last7=Le |first7=Quoc |last8=Sung |first8=Yun-Hsuan |last9=Li |first9=Zhen |last10=Duerig |first10=Tom |date=2021-07-01 |title=Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision |url=https://proceedings.mlr.press/v139/jia21b.html |journal=Proceedings of the 38th International Conference on Machine Learning |publisher=PMLR |pages=4904–4916}} used EfficientNet{{cite arXiv |last1=Tan |first1=Mingxing |title=EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks |date=2020-09-11 |eprint=1905.11946 |last2=Le |first2=Quoc V.|class=cs.LG }}, a kind of convolutional neural network.

= Text model =

File:Transformer,_one_decoder_block.png

The text encoding models used in CLIP are typically Transformers.

In the original OpenAI report, they reported using a Transformer (63M-parameter, 12-layer, 512-wide, 8 attention heads) with lower-cased byte pair encoding (BPE) with 49152 vocabulary size. Context length was capped at 76 for efficiency. Like GPT, it was decoder-only, with only causally-masked self-attention.{{Pg|quote=Masked self-attention was used in the text encoder to preserve the ability to initialize with a pre-trained language model or add language modeling as an auxiliary objective, though exploration of this is left as future work.|page=5}} Its architecture is the same as GPT-2.{{Cite journal |last1=Radford |first1=Alec |last2=Wu |first2=Jeff |last3=Child |first3=R. |last4=Luan |first4=D. |last5=Amodei |first5=Dario |last6=Sutskever |first6=I. |date=2019 |title=Language Models are Unsupervised Multitask Learners |s2cid=160025533 }}

Like BERT, the text sequence is bracketed by two special tokens [SOS] and [EOS] ("start of sequence" and "end of sequence"). Take the activations of the highest layer of the transformer on the [EOS], apply LayerNorm, then a final linear map. This is the text encoding of the input sequence. The final linear map has output dimension equal to the embedding dimension of whatever image encoder it is paired with. These models all had context length 77 and vocabulary size 49408.

ALIGN used BERT of various sizes.

Dataset

= WebImageText =

The CLIP models released by OpenAI were trained on a dataset called "WebImageText" (WIT) containing 400 million pairs of images and their corresponding captions scraped from the internet. The total number of words in this dataset is similar in scale to the WebText dataset used for training GPT-2, which contains about 40 gigabytes of text data.

The dataset contains 500,000 text-queries, with up to 20,000 (image, text) pairs per query. The text-queries were generated by starting with all words occurring at least 100 times in English Wikipedia, then extended by bigrams with high mutual information, names of all Wikipedia articles above a certain search volume, and WordNet synsets.

The dataset is private and has not been released to the public, and there is no further information on it.{{NoteTag|It is not the same as the Wikipedia-based Image Text dataset, also called "WIT".{{Cite book |last1=Srinivasan |first1=Krishna |last2=Raman |first2=Karthik |last3=Chen |first3=Jiecao |last4=Bendersky |first4=Michael |last5=Najork |first5=Marc |chapter=WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning |date=2021-07-11 |title=Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval |pages=2443–2449 |doi=10.1145/3404835.3463257|arxiv=2103.01913 |isbn=978-1-4503-8037-9 }}

}}

== Data preprocessing ==

For the CLIP image models, the input images are preprocessed by first dividing each of the R, G, B values of an image by the maximum possible value, so that these values fall between 0 and 1, then subtracting by [0.48145466, 0.4578275, 0.40821073], and dividing by [0.26862954, 0.26130258, 0.27577711].

The rationale was that these are the mean and standard deviations of the images in the WebImageText dataset, so this preprocessing step roughly whitens the image tensor. These numbers slightly differ from the standard preprocessing for ImageNet, which uses [0.485, 0.456, 0.406] and [0.229, 0.224, 0.225].{{Cite web |title=std and mean for image normalization different from ImageNet · Issue #20 · openai/CLIP |url=https://github.com/openai/CLIP/issues/20 |access-date=2024-09-19 |website=GitHub |language=en}}

If the input image does not have the same resolution as the native resolution (224x224 for all except ViT-L/14@336px, which has 336x336 resolution), then the input image is scaled down by bicubic interpolation, so that its shorter side is the same as the native resolution, then the central square of the image is cropped out.

= Others =

ALIGN used over one billion image-text pairs, obtained by extracting images and their alt-tags from online crawling. The method was described as similar to how the Conceptual Captions dataset{{Cite journal |last1=Sharma |first1=Piyush |last2=Ding |first2=Nan |last3=Goodman |first3=Sebastian |last4=Soricut |first4=Radu |date=July 2018 |editor-last=Gurevych |editor-first=Iryna |editor2-last=Miyao |editor2-first=Yusuke |title=Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning |url=https://aclanthology.org/P18-1238/ |journal=Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |location=Melbourne, Australia |publisher=Association for Computational Linguistics |pages=2556–2565 |doi=10.18653/v1/P18-1238|doi-access=free }} was constructed, but instead of complex filtering, they only applied a frequency-based filtering.

Later models trained by other organizations had published datasets. For example, LAION trained OpenCLIP with published datasets LAION-400M, LAION-2B, and DataComp-1B.{{Cite book |last1=Cherti |first1=Mehdi |last2=Beaumont |first2=Romain |last3=Wightman |first3=Ross |last4=Wortsman |first4=Mitchell |last5=Ilharco |first5=Gabriel |last6=Gordon |first6=Cade |last7=Schuhmann |first7=Christoph |last8=Schmidt |first8=Ludwig |last9=Jitsev |first9=Jenia |chapter=Reproducible Scaling Laws for Contrastive Language-Image Learning |date=June 2023 |title=2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) |pages=2818–2829 |doi=10.1109/CVPR52729.2023.00276|arxiv=2212.07143 |isbn=979-8-3503-0129-8 }}

Training

In the original OpenAI CLIP report, they reported training 5 ResNet and 3 ViT (ViT-B/32, ViT-B/16, ViT-L/14). Each was trained for 32 epochs. The largest ResNet model took 18 days to train on 592 V100 GPUs. The largest ViT model took 12 days on 256 V100 GPUs.

All ViT models were trained on 224x224 image resolution. The ViT-L/14 was then boosted to 336x336 resolution by FixRes,{{Cite journal |last1=Touvron |first1=Hugo |last2=Vedaldi |first2=Andrea |last3=Douze |first3=Matthijs |last4=Jegou |first4=Herve |date=2019 |title=Fixing the train-test resolution discrepancy |url=https://proceedings.neurips.cc/paper/2019/hash/d03a857a23b5285736c4d55e0bb067c8-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=32}} resulting in a model.{{NoteTag|They referred to this as both ViT-L/14-336px and ViT-L/14@336px, inconsistently throughout the report.}} They found this was the best-performing model.{{Pg|location=Appendix F. Model Hyperparameters}}

In the OpenCLIP series, the ViT-L/14 model was trained on 384 A100 GPUs on the LAION-2B dataset, for 160 epochs for a total of 32B samples seen.{{Cite web |date=2023-09-10 |title=laion/CLIP-ViT-L-14-laion2B-s32B-b82K · Hugging Face |url=https://huggingface.co/laion/CLIP-ViT-L-14-laion2B-s32B-b82K |access-date=2024-09-06 |website=huggingface.co}}

Applications

= Image classification =

CLIP can perform zero-shot image classification tasks. This is achieved by prompting the text encoder with class names and selecting the class whose embedding is closest to the image embedding. For example, to classify an image, they compared the embedding of the image with the embedding of the text "A photo of a {class}.", and the {class} that results in the highest dot product is outputted.

= CLIP for multimodal learning =

CLIP has been used as a component in multimodal learning. For example, during the training of Google DeepMind's Flamingo (2022),{{Cite journal |last1=Alayrac |first1=Jean-Baptiste |last2=Donahue |first2=Jeff |last3=Luc |first3=Pauline |last4=Miech |first4=Antoine |last5=Barr |first5=Iain |last6=Hasson |first6=Yana |last7=Lenc |first7=Karel |last8=Mensch |first8=Arthur |last9=Millican |first9=Katherine |last10=Reynolds |first10=Malcolm |last11=Ring |first11=Roman |last12=Rutherford |first12=Eliza |last13=Cabi |first13=Serkan |last14=Han |first14=Tengda |last15=Gong |first15=Zhitao |date=2022-12-06 |title=Flamingo: a Visual Language Model for Few-Shot Learning |url=https://proceedings.neurips.cc/paper_files/paper/2022/hash/960a172bc7fbf0177ccccbb411a7d800-Abstract-Conference.html |journal=Advances in Neural Information Processing Systems |volume=35 |pages=23716–23736}} the authors trained a CLIP pair, with BERT as the text encoder and NormalizerFree ResNet F6{{Cite journal |last1=Brock |first1=Andy |last2=De |first2=Soham |last3=Smith |first3=Samuel L. |last4=Simonyan |first4=Karen |date=2021-07-01 |title=High-Performance Large-Scale Image Recognition Without Normalization |url=https://proceedings.mlr.press/v139/brock21a.html |journal=Proceedings of the 38th International Conference on Machine Learning |publisher=PMLR |pages=1059–1071}} as the image encoder. The image encoder of the CLIP pair was taken with parameters frozen and the text encoder was discarded. The frozen image encoder was then combined with a frozen Chinchilla language model, by finetuning with some further parameters that connect the two frozen models.

= Applications in other domains =

CLIP has been used in various domains beyond its original purpose:

Image Featurizer: CLIP's image encoder can be adapted as a pre-trained image featurizer. This can then be fed into other AI models.

Text-to-Image Generation: Models like Stable Diffusion use CLIP's text encoder to transform text prompts into embeddings for image generation.{{cite web |date=17 September 2022 |title=Stable Diffusion Repository on GitHub |url=https://github.com/CompVis/stable-diffusion |url-status=live |archive-url=https://web.archive.org/web/20230118183342/https://github.com/CompVis/stable-diffusion |archive-date=January 18, 2023 |access-date=17 September 2022 |publisher=CompVis - Machine Vision and Learning Research Group, LMU Munich}} CLIP can also be used as a gradient signal for directly guiding diffusion ("CLIP guidance"){{cite arXiv |last1=Ramesh |first1=Aditya |title=Hierarchical Text-Conditional Image Generation with CLIP Latents |date=2022-04-12 |eprint=2204.06125 |last2=Dhariwal |first2=Prafulla |last3=Nichol |first3=Alex |last4=Chu |first4=Casey |last5=Chen |first5=Mark|class=cs.CV }}{{Cite journal |last=Hendriksen |first=Mariya |last2=Bleeker |first2=Maurits |last3=Vakulenko |first3=Svitlana |last4=van Noord |first4=Nanne |last5=Kuiper |first5=Ernst |last6=de Rijke |first6=Maarten |date=2022 |editor-last=Hagen |editor-first=Matthias |editor2-last=Verberne |editor2-first=Suzan |editor3-last=Macdonald |editor3-first=Craig |editor4-last=Seifert |editor4-first=Christin |editor5-last=Balog |editor5-first=Krisztian |editor6-last=Nørvåg |editor6-first=Kjetil |editor7-last=Setty |editor7-first=Vinay |title=Extending CLIP for Category-to-Image Retrieval in E-Commerce |url=https://link.springer.com/chapter/10.1007/978-3-030-99736-6_20 |journal=Advances in Information Retrieval |language=en |location=Cham |publisher=Springer International Publishing |pages=289–303 |doi=10.1007/978-3-030-99736-6_20 |isbn=978-3-030-99736-6}} or other generative art.{{Cite web |last=Whitaker |first=Jonathan |date=2022-05-22 |title=Fun With Neural Cellular Automata |url=https://wandb.ai/johnowhitaker/nca/reports/Fun-with-Neural-Cellular-Automata--VmlldzoyMDQ5Mjg0 |access-date=2024-09-08 |website=W&B |language=en}}

Aesthetic Ranking: Fine-tuned CLIP models can be used to rank images by aesthetic quality, aiding in dataset curation.{{Citation |title=LAION-AI/aesthetic-predictor |date=2024-09-06 |url=https://github.com/LAION-AI/aesthetic-predictor |access-date=2024-09-08 |publisher=LAION AI}}

Image Captioning: CLIP can be used to generate image captions by matching text inputs to image embeddings.{{Cite arXiv |last1=Mokady |first1=Ron |last2=Hertz |first2=Amir |last3=Bermano |first3=Amit H. |date=2021 |title=ClipCap: CLIP Prefix for Image Captioning |class=cs.CV |eprint=2111.09734}}

Notes

References

External links

[https://openai.com/research/clip OpenAI's CLIP webpage]
[https://github.com/mlfoundations/open_clip OpenCLIP: An open source implementation of CLIP]
{{Cite web |last=Arora |first=Aman |date=2023-03-11 |title=The Annotated CLIP (Part-2) |url=https://amaarora.github.io/posts/2023-03-11_Understanding_CLIP_part_2.html |access-date=2024-09-11 |website=amaarora.github.io |language=en}}

Category:Machine learning

Category:Artificial neural networks

Category:Computer vision

Category:Natural language processing