vision-language-action model

{{Short description|Foundation model allowing control of robot actions}}

{{Use mdy dates|date=March 2025}}

A vision-language-action model (VLA) is a foundation model that allows control of robot actions through vision and language commands.{{Cite journal |last1=Jeong |first1=Hyeongyo |last2=Lee |first2=Haechan |last3=Kim |first3=Changwon |last4=Shin |first4=Sungta |date=October 2024 |title=A Survey of Robot Intelligence with Large Language Models |journal=Applied Sciences |volume=14 |issue=19 |page=8868 |doi=10.3390/app14198868 |doi-access=free }}

One method for constructing a VLA is to fine-tune a vision-language model (VLM) by training it on robot trajectory data and large-scale visual language data{{Cite book |last1=Fan |first1=L. |last2=Chen |first2=Z. |last3=Xu |first3=M. |last4=Yuan |first4=M. |last5=Huang |first5=P. |last6=Huang |first6=W. |chapter=Language Reasoning in Vision-Language-Action Model for Robotic Grasping |date=2024 |title=2024 China Automation Congress (CAC) |pages=6656–6661 |doi=10.1109/CAC63892.2024.10865585|isbn=979-8-3503-6860-4 }} or Internet-scale vision-language tasks.{{Cite arXiv |date=July 28, 2023 |title=RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control |eprint=2307.15818 |last1=Brohan |first1=Anthony |last2=Brown |first2=Noah |last3=Carbajal |first3=Justice |last4=Chebotar |first4=Yevgen |last5=Chen |first5=Xi |last6=Choromanski |first6=Krzysztof |last7=Ding |first7=Tianli |last8=Driess |first8=Danny |last9=Dubey |first9=Avinava |last10=Finn |first10=Chelsea |last11=Florence |first11=Pete |last12=Fu |first12=Chuyuan |author13=Montse Gonzalez Arenas |last14=Gopalakrishnan |first14=Keerthana |last15=Han |first15=Kehang |last16=Hausman |first16=Karol |last17=Herzog |first17=Alexander |last18=Hsu |first18=Jasmine |last19=Ichter |first19=Brian |last20=Irpan |first20=Alex |last21=Joshi |first21=Nikhil |last22=Julian |first22=Ryan |last23=Kalashnikov |first23=Dmitry |last24=Kuang |first24=Yuheng |last25=Leal |first25=Isabel |last26=Lee |first26=Lisa |author27=Tsang-Wei Edward Lee |last28=Levine |first28=Sergey |last29=Lu |first29=Yao |last30=Michalewski |first30=Henryk |class=cs.RO |display-authors=1 }}

Examples of VLAs include RT-2 from Google DeepMind.{{Cite news |last=Dotson |first=Kyt |date=July 28, 2023 |title=Google unveils RT-2, an AI language model for telling robots what to do |url=https://siliconangle.com/2023/07/28/google-unveils-rt-2-ai-language-model-telling-robots/ |access-date=2025-03-13 |work=Silicon Angle}}

References