Fine-tuning (deep learning)#Low-rank adaptation

In deep learning, fine-tuning is an approach to transfer learning in which the parameters of a pre-trained neural network model are trained on new data.{{cite book |last1=Quinn |first1=Joanne |url=https://d2l.ai/chapter_computer-vision/fine-tuning.html#steps |title=Dive into deep learning: tools for engagement |date=2020 |isbn=978-1-5443-6137-6 |location=Thousand Oaks, California |page=551 |access-date=January 10, 2023 |archive-url=https://web.archive.org/web/20230110131250/https://d2l.ai/chapter_computer-vision/fine-tuning.html#steps |archive-date=January 10, 2023 |url-status=live}} Fine-tuning can be done on the entire neural network, or on only a subset of its layers, in which case the layers that are not being fine-tuned are "frozen" (i.e., not changed during backpropagation).{{cite web |title=CS231n Convolutional Neural Networks for Visual Recognition |url=https://cs231n.github.io/transfer-learning/ |website=cs231n.github.io |access-date=9 March 2023}} A model may also be augmented with "adapters" that consist of far fewer parameters than the original model, and fine-tuned in a parameter-efficient way by tuning the weights of the adapters and leaving the rest of the model's weights frozen.{{cite conference

| last1 = Liu

| first1 = Haokun

| last2 = Tam

| first2 = Derek

| last3 = Muqeeth

| first3 = Mohammed

| last4 = Mohta

| first4 = Jay

| last5 = Huang

| first5 = Tenghao

| last6 = Bansal

| first6 = Mohit

| last7 = Raffel

| first7 = Colin A

| title = Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

| conference = Advances in Neural Information Processing Systems

| volume = 35

| pages = 1950–1965

| year = 2022

| publisher = Curran Associates, Inc.

| editor-last = Koyejo

| editor-first = S.

| editor2-last = Mohamed

| editor2-first = S.

| editor3-last = Agarwal

| editor3-first = A.

| editor4-last = Belgrave

| editor4-first = D.

| editor5-last = Cho

| editor5-first = K.

| editor6-last = Oh

| editor6-first = A.

| url = https://proceedings.neurips.cc/paper_files/paper/2022/file/0cde695b83bd186c1fd456302888454c-Paper-Conference.pdf

}}

For some architectures, such as convolutional neural networks, it is common to keep the earlier layers (those closest to the input layer) frozen, as they capture lower-level features, while later layers often discern high-level features that can be more related to the task that the model is trained on.{{cite journal |last1=Zeiler |first1=Matthew D |last2=Fergus |first2=Rob |date=2013 |title=Visualizing and Understanding Convolutional Networks |journal=ECCV |arxiv=1311.2901}}

Models that are pre-trained on large, general corpora are usually fine-tuned by reusing their parameters as a starting point and adding a task-specific layer trained from scratch.{{cite journal |last1=Dodge |first1=Jesse |last2=Ilharco |first2=Gabriel |last3=Schwartz |first3=Roy |last4=Farhadi |first4=Ali |last5=Hajishirzi |first5=Hannaneh |last6=Smith |first6=Noah |title=Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping |date=2020 |arxiv=2002.06305}} Fine-tuning the full model is also common and often yields better results, but is more computationally expensive.

Fine-tuning is typically accomplished via supervised learning, but there are also techniques to fine-tune a model using weak supervision.{{cite journal |last1=Yu |first1=Yue |last2=Zuo |first2=Simiao |last3=Jiang |first3=Haoming |last4=Ren |first4=Wendi |last5=Zhao |first5=Tuo |last6=Zhang |first6=Chao |date=2020 |title=Fine-Tuning Pre-trained Language Model with Weak Supervision: A Contrastive-Regularized Self-Training Approach |journal=Association for Computational Linguistics |arxiv=2010.07835}} Fine-tuning can be combined with a reinforcement learning from human feedback-based objective to produce language models such as ChatGPT (a fine-tuned version of GPT models) and Sparrow.{{cite web |title=Introducing ChatGPT |url=https://openai.com/blog/chatgpt |website=openai.com |access-date=9 March 2023}}{{cite journal |last1=Glaese |first1=Amelia |last2=McAleese |first2=Nat |last3=Trębacz |first3=Maja |last4=Aslanides |first4=John |last5=Firoiu |first5=Vlad |last6=Ewalds |first6=Timo |last7=Rauh |first7=Maribeth |last8=Weidinger |first8=Laura |last9=Chadwick |first9=Martin |last10=Thacker |first10=Phoebe |last11=Campbell-Gillingham |first11=Lucy |last12=Uesato |first12=Jonathan |last13=Huang |first13=Po-Sen |last14=Comanescu |first14=Ramona |last15=Yang |first15=Fan |date=2022 |title=Improving alignment of dialogue agents via targeted human judgements |journal=DeepMind |arxiv=2209.14375 |last16=See |first16=Abigail |last17=Dathathri |first17=Sumanth |last18=Greig |first18=Rory |last19=Chen |first19=Charlie |last20=Fritz |first20=Doug |last21=Elias |first21=Jaume Sanchez |last22=Green |first22=Richard |last23=Mokrá |first23=Soňa |last24=Fernando |first24=Nicholas |last25=Wu |first25=Boxi |last26=Foley |first26=Rachel |last27=Young |first27=Susannah |last28=Gabriel |first28=Iason |last29=Isaac |first29=William |last30=Mellor |first30=John |last31=Hassabis |first31=Demis |last32=Kavukcuoglu |first32=Koray |last33=Hendricks |first33=Lisa Anne |last34=Irving |first34=Geoffrey}}

Robustness

Fine-tuning can degrade a model's robustness to distribution shifts.{{cite arXiv |last1=Radford |first1=Alec |last2=Kim |first2=Jong Wook |last3=Hallacy |first3=Chris |last4=Ramesh |first4=Aditya |last5=Goh |first5=Gabriel |last6=Agarwal |first6=Sandhini |last7=Sastry |first7=Girish |last8=Askell |first8=Amanda |last9=Mishkin |first9=Pamela |last10=Clark |first10=Jack |last11=Krueger |first11=Gretchen |last12=Sutskever |first12=Ilya |title=Learning Transferable Visual Models From Natural Language Supervision |year=2021 |eprint=2103.00020 |class=cs.CV }}{{cite journal |last1=Kumar |first1=Ananya |last2=Raghunathan |first2=Aditi |last3=Jones |first3=Robbie |last4=Ma |first4=Tengyu |last5=Liang |first5=Percy |date=2022 |title=Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution |journal=ICLR |arxiv=2202.10054}} One mitigation is to linearly interpolate a fine-tuned model's weights with the weights of the original model, which can greatly increase out-of-distribution performance while largely retaining the in-distribution performance of the fine-tuned model.{{cite arXiv |last1=Wortsman |first1=Mitchell |last2=Ilharco |first2=Gabriel |last3=Kim |first3=Jong Wook |last4=Li |first4=Mike |last5=Kornblith |first5=Simon |last6=Roelofs |first6=Rebecca |last7=Gontijo-Lopes |first7=Raphael |last8=Hajishirzi |first8=Hannaneh |last9=Farhadi |first9=Ali |last10=Namkoong |first10=Hongseok |last11=Schmidt |first11=Ludwig |title=Robust fine-tuning of zero-shot models |year=2022 |class=cs.CV |eprint=2109.01903}}

Variants

= Low-rank adaptation =

Low-rank adaptation (LoRA) is an adapter-based technique for efficiently fine-tuning models. The basic idea is to design a low-rank matrix that is then added to the original matrix.{{Cite journal |last1=Hu |first1=Edward J. |last2=Shen |first2=Yelong |last3=Wallis |first3=Phillip |last4=Allen-Zhu |first4=Zeyuan |last5=Li |first5=Yuanzhi |last6=Wang |first6=Shean |last7=Wang |first7=Lu |last8=Chen |first8=Weizhu |date=2022-01-28 |title=LoRA: Low-Rank Adaptation of Large Language Models |url=https://openreview.net/forum?id=nZeVKeeFYf9 |journal=ICLR |language=en |arxiv=2106.09685}} An adapter, in this context, is a collection of low-rank matrices which, when added to a base model, produces a fine-tuned model. It allows for performance that approaches full-model fine-tuning with lower space requirements. A language model with billions of parameters may be LoRA fine-tuned with only several millions of parameters.

LoRA-based fine-tuning has become popular in the Stable Diffusion community.{{cite web |url=https://github.com/cloneofsimo/lora |title=Using Low-rank adaptation to quickly fine-tune diffusion models |last=Ryu |first=Simo |date=February 13, 2023 |website=GitHub |access-date=June 19, 2023}} Support for LoRA was integrated into the Diffusers library from Hugging Face.{{cite web |url=https://huggingface.co/blog/lora |title=Using LoRA for Efficient Stable Diffusion Fine-Tuning |last1=Cuenca |first1=Pedro |last2=Paul |first2=Sayak |date=January 26, 2023 |website=Hugging Face |access-date=June 19, 2023}} Support for LoRA and similar techniques is also available for a wide range of other models through Hugging Face's Parameter-Efficient Fine-Tuning (PEFT) package.{{Cite web |title=Parameter-Efficient Fine-Tuning using 🤗 PEFT |url=https://huggingface.co/blog/peft |access-date=2023-06-20 |website=huggingface.co}}

= Representation fine-tuning =

{{One source section

| find = Representation fine-tuning (ReFT)

| date = May 2024

}}

Representation fine-tuning (ReFT) is a technique developed by researchers at Stanford University aimed at fine-tuning large language models (LLMs) by modifying less than 1% of their representations. Unlike traditional parameter-efficient fine-tuning (PEFT) methods, which mainly focus on updating weights, ReFT targets specific parts of the model relevant to the task being fine-tuned. This approach is based on the understanding that deep learning models encode rich semantic information in their representations, suggesting that modifying representations might be a more effective strategy than updating weights.{{Citation |last1=Wu |first1=Zhengxuan |title=ReFT: Representation Finetuning for Language Models |date=2024-04-07 |arxiv=2404.03592 |last2=Arora |first2=Aryaman |last3=Wang |first3=Zheng |last4=Geiger |first4=Atticus |last5=Jurafsky |first5=Dan |last6=Manning |first6=Christopher D. |last7=Potts |first7=Christopher}}

ReFT methods operate on a frozen base model and learn task-specific interventions on hidden representations and train interventions that manipulate a small fraction of model representations to steer model behaviors towards solving downstream tasks at inference time. One specific method within the ReFT family is Low-rank Linear Subspace ReFT (LoReFT), which intervenes on hidden representations in the linear subspace spanned by a low-rank projection matrix. LoReFT can be seen as the representation-based equivalent of Low-rank Adaptation (LoRA).

Applications

=Natural language processing=

Fine-tuning is common in natural language processing (NLP), especially in the domain of language modeling. Large language models like OpenAI's series of GPT foundation models can be fine-tuned on data for specific downstream NLP tasks (tasks that use a pre-trained model) to improve performance over the unmodified pre-trained model.{{cite journal |last1=Dingliwal |first1=Saket |last2=Shenoy |first2=Ashish |last3=Bodapati |first3=Sravan |last4=Gandhe |first4=Ankur |last5=Gadde |first5=Ravi Teja |last6=Kirchhoff |first6=Katrin |date=2021 |title=Prompt Tuning GPT-2 language model for parameter-efficient domain adaptation of ASR systems |journal=InterSpeech |arxiv=2112.08718}}

Commercial models

Commercially-offered large language models can sometimes be fine-tuned if the provider offers a fine-tuning API. As of June 19, 2023, language model fine-tuning APIs are offered by OpenAI and Microsoft Azure's Azure OpenAI Service for a subset of their models, as well as by Google Cloud Platform for some of their PaLM models, and by others.{{cite web |url=https://platform.openai.com/docs/guides/fine-tuning |title=Fine-tuning |publisher=OpenAI |access-date=2023-06-19}}{{cite web |url=https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/fine-tuning |title=Learn how to customize a model for your application |publisher=Microsoft |access-date=2023-06-19}}{{cite web |url=https://cloud.google.com/vertex-ai/docs/generative-ai/models/tune-models |title=Tune text foundation models |access-date=2023-06-19}}

References

Category:Machine learning

Category:Deep learning