Load-link/store-conditional

{{short description|CPU instructions which read and modify an unaltered value in memory}}

In computer science, load-linked/store-conditional (LL/SC), sometimes known as load-reserved/store-conditional (LR/SC), are a pair of instructions used in multithreading to achieve synchronization. Load-link returns the current value of a memory location, while a subsequent store-conditional to the same memory location will store a new value only if no updates have occurred to that location since the load-link. Together, this implements a lock-free, atomic, read-modify-write operation.

"Load-linked" is also known as load-link,{{Cite patent|number=US20030217115A1|title=Load-linked/store conditional mechanism in a CC-NUMA system|gdate=2003-11-20|invent1=Rowlands|inventor1-first=Joseph|url=https://patents.google.com/patent/US20030217115A1/en}} load-reserved, and load-locked.{{Citation needed|date=November 2020}}

LL/SC was originally{{Cite journal |last=Herlihy |first=Maurice |date=1993-11-01 |title=A methodology for implementing highly concurrent data objects |url=https://dl.acm.org/doi/10.1145/161468.161469 |journal=ACM Transactions on Programming Languages and Systems |volume=15 |issue=5 |pages=745–770 |doi=10.1145/161468.161469 |issn=0164-0925}} proposed by Jensen, Hagensen, and Broughton for the S-1 AAP multiprocessor{{failed verification|date=March 2025}} at Lawrence Livermore National Laboratory.

Comparison of LL/SC and compare-and-swap

If any updates have occurred, the store-conditional is guaranteed to fail, even if the value read by the load-link has since been restored. As such, an LL/SC pair is stronger than a read followed by a compare-and-swap (CAS), which will not detect updates if the old value has been restored (see ABA problem).

Real implementations of LL/SC do not always succeed even if there are no concurrent updates to the memory location in question. Any exceptional events between the two operations, such as a context switch, another load-link, or even (on many platforms) another load or store operation, will cause the store-conditional to spuriously fail. Older implementations will fail if there are any updates broadcast over the memory bus. This is called weak LL/SC by researchers, as it breaks many theoretical LL/SC algorithms.{{cite web |last1=Beckmann |first1=Nathan |title=Synchronization |url=https://www.cs.cmu.edu/afs/cs/academic/class/15740-f18/www/lectures/10-11-synchronization.pdf |website=15-740: Computer Architecture, Fall 2018 |publisher=Carnegie Mellon University |access-date=23 April 2021}} Weakness is relative, and some weak implementations can be used for some algorithms.

LL/SC is more difficult to emulate than CAS. Additionally, stopping running code between paired LL/SC instructions, such as when single-stepping through code, can prevent forward progress, making debugging tricky.

Nevertheless, LL/SC is equivalent to CAS in the sense that either primitive can be implemented in terms of the other, in O(1) and in a wait-free manner.

Implementations

LL/SC instructions are supported by:

Alpha: ldl_l{{Cite web |url=https://download.majix.org/dec/alpha_arch_ref.pdf#page=79 |title=Alpha Architecture Reference Manual |pages=4–9~4–12|accessdate=2024-01-26}}/stl_c{{Cite web |url=https://download.majix.org/dec/alpha_arch_ref.pdf#page=83 |title=Alpha Architecture Reference Manual |pages=4–13~4–15|accessdate=2024-01-26}} and ldq_l/stq_c
PowerPC/Power ISA: lwarx/stwcx and ldarx/stdcx{{cite book |title=The PowerPC architecture: A SPECIFICATION FOR A NEW FAMILY OF RISC PROCESSORS |last1=May |first1=Cathy |last2=Silha |first2=Ed |last3=Simpson |first3=Eick |last4=Warren |first4=Hank |isbn=1-55860-316-6 |year=1993 |publisher=Morgan Kaufmann PUblishers, Inc |pages=336–338, 465}}{{cite book |title=Optimizing PowerPC Code |last=Kacmarcik |first=Cary |isbn=0-201-40839-2 |date=1995 |publisher=Addison-Wesley Publishing Company |pages=71–72 }}
MIPS: ll/sc{{Cite web |url=https://www.cs.auckland.ac.nz/compsci313s2c/resources/MIPSLLSC.pdf#page=13 |title=APPLICATION NOTE MIPS R4000 Synchronization Primitives |page=9 |accessdate=2023-12-27}} and lld/scd{{Cite web |url=https://www.cs.auckland.ac.nz/compsci313s2c/resources/MIPSLLSC.pdf#page=9 |title=APPLICATION NOTE MIPS R4000 Synchronization Primitives |page=5 |accessdate=2023-12-27}}
ARM: ldrex/strex (ARMv6,{{Cite web |url=https://documentation-service.arm.com/static/5e8e1e0388295d1e18d368b2 |title=ARM11 MPCore™ Processor Revision: r2p0 Technical Reference Manual |page=301-302(8-7,8-8) |accessdate=2023-12-14}} v7 and v8-M{{Cite web |url=https://documentation-service.arm.com/static/615c3754e4f35d2484678be7 |title=Arm®v8-M Architecture Reference Manual |page=278 |accessdate=2023-12-14}}), and ldxr/stxr (ARMv8-A{{Cite web |url=https://documentation-service.arm.com/static/5efa1989dbdee951c1ccdea1 |title=ARMv8-A Synchronization primitives |page=6 |accessdate=2023-12-14}})
RISC-V: lr/sc
ARC: LLOCK/SCOND

Some CPUs{{which|date=January 2017|reason=I have serious doubts about the accuracy of this statement, as typically the cache is used to implement the conditional reservation}} require the address being accessed exclusively to be configured in write-through mode.

Typically, CPUs track the load-linked address at a cache-line or other granularity, such that any modification to any portion of the cache line (whether via another core's store-conditional or merely by an ordinary store) is sufficient to cause the store-conditional to fail.

All of these platforms provide weak{{Clarify|date=January 2017}} LL/SC. The PowerPC implementation allows an LL/SC pair to wrap loads and even stores to other cache lines (although this approach is vulnerable to false cache line sharing). This allows it to implement, for example, lock-free reference counting in the face of changing object graphs with arbitrary counter reuse (which otherwise requires double compare-and-swap, DCAS). RISC-V provides an architectural guarantee of eventual progress for LL/SC sequences of limited length.

Some ARM implementations define platform dependent blocks, ranging from 8 bytes to 2048 bytes, and an LL/SC attempt in any given block fails if there is between the LL and SC a normal memory access inside the same block. Other ARM implementations fail if there is a modification anywhere in the whole address space. The former implementation is the stronger and most practical.

LL/SC has two advantages over CAS when designing a load–store architecture: reads and writes are separate instructions, as required by the design philosophy (and pipeline architecture); and both instructions can be performed using only two registers (address and value), fitting naturally into common 2-operand ISAs. CAS, on the other hand, requires three registers (address, old value, new value) and a dependency between the value read and the value written. x86, being a CISC architecture, does not have this constraint; though modern chips may well translate a CAS instruction into separate LL/SC micro-operations internally.

Extensions

Hardware LL/SC implementations typically do not allow nesting of LL/SC pairs. A nesting LL/SC mechanism can be used to provide a MCAS primitive (multi-word CAS, where the words can be scattered). In 2013, Trevor Brown, Faith Ellen, and Eric Ruppert implemented in software a multi-address LL/SC extension (which they call LLX/SCX) that relies on automated code generation; they have used it to implement one of the best-performing concurrent binary search tree (actually a chromatic tree), slightly beating the JDK CAS-based skip list implementation.

References

{{Reflist|refs=

{{cite book |author=James H. Anderson |author2=Mark Moir |chapter=Universal constructions for multi-object operations |title=PODC '95 Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing |publisher=ACM |year=1995 |isbn=0-89791-710-3 |pages=184–193 |doi=10.1145/224964.224985|s2cid=8204331 }} See their Table 1, Figures 1 & 2 and Section 2 in particular.

{{cite book |first1=Trevor |last1=Brown |first2=Faith |last2=Ellen |first3=Eric |last3=Ruppert |chapter=Pragmatic primitives for non-blocking data structures |chapter-url=http://www.cs.utoronto.ca/~tabrown/scx/fullpaper.pdf |title=PODC '13 Proceedings of the 2013 ACM symposium on Principles of distributed computing |publisher=ACM |year=2013 |isbn=978-1-4503-2065-8 |pages=13–22 |doi=10.1145/2484239.2484273|arxiv=1712.06688 |s2cid=6537417 }} See also [http://www.cs.utoronto.ca/~tabrown/scx/slides.podc13.pdf slides]

{{cite book |author1=Trevor Brown |author2=Faith Ellen |author3=Eric Ruppert |chapter=A general technique for non-blocking trees |chapter-url=http://www.cs.toronto.edu/~tabrown/chromatic/fullpaper.pdf |title=PPoPP '14 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming |publisher=ACM |year=2014 |isbn=978-1-4503-2656-8 |pages=329–342 |doi=10.1145/2555243.2555267|arxiv=1712.06687 |s2cid=9442380 }}

{{cite web |author=Keno Fischer |title=Julia 1.5 Feature Preview: Time Traveling (Linux) Bug Reporting |url=https://julialang.org/blog/2020/05/rr/#hardware_and_software_limitations |access-date=2020-05-14 |date=2020-05-02}}

{{cite tech report |author=Keir Fraser |title=Practical lock-freedom |institution=University of Cambridge Computer Laboratory |number=UCAM-CL-TR-579 |page=20 |date=February 2004 |url=https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-579.pdf |format=PDF}}

{{cite book |author1=James R. Larus |author2=Ravi Rajwar |title=Transactional Memory |year=2007 |publisher=Morgan & Claypool |isbn=978-1-59829-124-7 |pages=55}}

{{cite book |editor1=Andrew Waterman |editor2=Krste Asanović |editor2-link=Krste Asanović |date=2017-05-07 |title=The RISC-V Instruction Set Manual, Volume 1: User-Level ISA, Version 2.2 |chapter=7.2 Load-Reserved/Store-Conditional Instructions |url=https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf}}

{{cite web |url=http://forum.stanford.edu/wiki/index.php/S-1_project |title=S-1 project |website=Stanford Computer Science wiki |date=2018-11-30}}

}}

{{cite tech report |first=Eric H. |last=Jensen |first2=Gary W. |last2=Hagensen |first3=Jeffrey M. |last3=Broughton |title=A New Approach to Exclusive Data Access in Shared Memory Multiprocessors |institution=Lawrence Livermore National Laboratory |number=UCRL-97663 |date=November 1987 |url=https://e-reports-ext.llnl.gov/pdf/212157.pdf |format=PDF |access-date=2012-02-22 |url-status=dead |archive-url=https://web.archive.org/web/20170202203202/https://e-reports-ext.llnl.gov/pdf/212157.pdf |archive-date=2017-02-02}}
{{cite tech report |first=John D. |last=Bruner |first2=Gary W. |last2=Hagensen |first3=Eric H. |last3=Jensen |first4=Jay C. |last4=Pattin |first5=Jeffrey M. |last5=Broughton |title=Cache Coherence on the S-1 AAP |institution=Lawrence Livermore National Laboratory |number=UCRL-97646 |date=11 November 1987 |url=https://e-reports-ext.llnl.gov/pdf/212414.pdf |format=PDF |access-date=10 November 2013 |archive-url=https://web.archive.org/web/20170202203224/https://e-reports-ext.llnl.gov/pdf/212414.pdf |archive-date=2 February 2017 |url-status=dead}}
{{cite book |first1=D. |last1=Detlefs |first2=P. |last2=Martin |first3=M. |last3=Moir |first4=Guy L. |last4=Steele, Jr. |chapter=Lock-free reference counting |title=PODC '01 Proceedings of the twentieth annual ACM symposium on Principles of distributed computing |publisher=ACM |year=2001 |isbn=1-58113-383-9 |pages=190–9 |doi=10.1145/383962.384016 |citeseerx=10.1.1.92.8221}}
{{cite journal |first=Kirk |last=Reinholtz |title=Atomic Reference Counting Pointers |journal=C/C++ Users Journal |date=December 2004 |url=https://archive.org/details/DDJDVD6}}
{{cite journal |first=R. L. |last=Sites |title=Alpha AXP architecture |journal=Comm. ACM |volume=36 |issue=2 |pages=33–44 |date=February 1993 |doi=10.1145/151220.151226|s2cid=5473184 |doi-access=free }}

Category:Concurrency control

Category:Programming constructs

Category:Transactional memory