consistent hashing

{{Short description|Hashing technique}}

{{Technical|date=October 2024}}

In computer science, consistent hashing is a special kind of hashing technique such that when a hash table is resized, only n/m keys need to be remapped on average where n is the number of keys and m is the number of slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped because the mapping between the keys and the slots is defined by a modular operation.

Consistent hashing evenly distributes cache keys across shards, even if some of the shards crash or become unavailable.{{Cite book |title=Designing Distributed Systems Patterns and Paradigms for Scalable, Reliable Services |publisher=O'Reilly Media |year=2018 |isbn=9781491983607}}

History

The term "consistent hashing" was introduced by David Karger et al. at MIT for use in distributed caching, particularly for the web.{{sfn|Roughgarden|Valiant|2021|p=2}} This academic paper from 1997 in Symposium on Theory of Computing introduced the term "consistent hashing" as a way of distributing requests among a changing population of web servers.{{sfn|Roughgarden|Valiant|2021|p=7}} Each slot is then represented by a server in a distributed system or cluster. The addition of a server and the removal of a server (during scalability or outage) requires only num\_keys/num\_slots items to be re-shuffled when the number of slots (i.e. servers) change. The authors mention linear hashing and its ability to handle sequential server addition and removal, while consistent hashing allows servers to be added and removed in an arbitrary order.

{{cite conference

| url = http://portal.acm.org/citation.cfm?id=258660

| doi = 10.1145/258533.258660

|author1= Karger, D. |author2=Lehman, E. |author3=Leighton, T. |author3-link=F. Thomson Leighton |author4=Panigrahy, R. |author5=Levine, M. |author6=Lewin, D. |author6-link=Daniel M. Lewin | conference = Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing

| pages = 654–663

| year = 1997

| publisher = ACM Press New York, NY, USA

| title = Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web

|url-access= subscription }} The paper was later re-purposed to address technical challenge of keeping track of a file in peer-to-peer networks such as a distributed hash table.{{sfn|Roughgarden|Valiant|2021|p=8}}I. Stoica et al., "Chord: a scalable peer-to-peer lookup protocol for Internet applications," in IEEE/ACM Transactions on Networking, vol. 11, no. 1, pp. 17–32, Feb. 2003, doi: 10.1109/TNET.2002.808407.

Teradata used this technique in their distributed database{{Citation needed|date=September 2024}}, released in 1986, although they did not use this term. Teradata still uses the concept of a hash table to fulfill exactly this purpose. Akamai Technologies was founded in 1998 by the scientists Daniel Lewin and F. Thomson Leighton (co-authors of the article coining "consistent hashing"). In Akamai's content delivery network,{{cite journal |author1=Nygren., E. |author2=Sitaraman R. K. |author3=Sun, J. |title=The Akamai Network: A Platform for High-Performance Internet Applications |journal=ACM SIGOPS Operating Systems Review |volume=44 |issue=3 |year=2010 |pages=2–19 |doi=10.1145/1842733.1842736 |s2cid=207181702 |url=https://www.akamai.com/site/en/documents/research-paper/the-akamai-network-a-platform-for-high-performance-internet-applications-technical-publication.pdf |accessdate=August 29, 2023 |archive-url=https://web.archive.org/web/20221130124826/https://www.akamai.com/site/en/documents/research-paper/the-akamai-network-a-platform-for-high-performance-internet-applications-technical-publication.pdf |archive-date=November 30, 2022 |url-status=live }} consistent hashing is used to balance the load within a cluster of servers, while a stable marriage algorithm is used to balance load across clusters.{{cite journal | author= Bruce Maggs and Ramesh Sitaraman | title = Algorithmic nuggets in content delivery | journal= ACM SIGCOMM Computer Communication Review |year=2015|volume=45|issue=3|url = http://www.sigcomm.org/sites/default/files/ccr/papers/2015/July/0000000-0000009.pdf}}

Consistent hashing has also been used to reduce the impact of partial system failures in large web applications to provide robust caching without incurring the system-wide fallout of a failure.{{cite journal |url=http://www8.org/w8-papers/2a-webserver/caching/paper2.html |doi=10.1016/S1389-1286(99)00055-9 |author1=Karger, D. |author2=Sherman, A. |author3=Berkheimer, A. |author4=Bogstad, B. |author5=Dhanidina, R. |author6=Iwamoto, K. |author7=Kim, B. |author8=Matkins, L. |author9=Yerushalmi, Y. |journal=Computer Networks |volume=31 |issue=11 |pages=1203–1213 |year=1999 |title=Web Caching with Consistent Hashing |access-date=2008-02-05 |archive-url=https://web.archive.org/web/20080721013638/http://www8.org/w8-papers/2a-webserver/caching/paper2.html |archive-date=2008-07-21 |url-status=dead |url-access=subscription }} Consistent hashing is also the cornerstone of distributed hash tables (DHTs), which employ hash values to partition a keyspace across a distributed set of nodes, then construct an overlay network of connected nodes that provide efficient node retrieval by key.

Rendezvous hashing, designed in 1996, is a simpler and more general technique {{Citation needed|date=April 2021}}. It achieves the goals of consistent hashing using the very different highest random weight (HRW) algorithm.

Basic technique

File:Consistent Hashing Sample Illustration.png

In the problem of load balancing, for example, when a BLOB has to be assigned to one of n servers on a cluster, a standard hash function could be used in such a way that we calculate the hash value for that BLOB, assuming the resultant value of the hash is \beta, we perform modular operation with the number of servers (n in this case) to determine the server in which we can place the BLOB: \zeta = \beta\ \%\ n; hence the BLOB will be placed in the server whose \text{server ID} is successor of \zeta in this case. However, when a server is added or removed during outage or scaling (when n changes), all the BLOBs in every server should be reassigned and moved due to rehashing, but this operation is expensive.

Consistent hashing was designed to avoid the problem of having to reassign every BLOB when a server is added or removed throughout the cluster. The central idea is to use a hash function that maps both the BLOB and servers to a unit circle, usually 2\pi radians. For example, \zeta = \Phi\ \%\ 360 (where \Phi is hash of a BLOB or server's identifier, like IP address or UUID). Each BLOB is then assigned to the next server that appears on the circle in clockwise order. Usually, binary search algorithm or linear search is used to find a "spot" or server to place that particular BLOB in O(\log N) or O(N) complexities respectively; and in every iteration, which happens in clockwise manner, an operation \zeta\ \le\ \Psi (where \Psi is the value of the server within the cluster) is performed to find the server to place the BLOB. This provides an even distribution of BLOBs to servers. But, more importantly, if a server fails and is removed from the circle, only the BLOBs that were mapped to the failed server need to be reassigned to the next server in clockwise order. Likewise, if a new server is added, it is added to the unit circle, and only the BLOBs mapped to that server need to be reassigned.

Importantly, when a server is added or removed, the vast majority of the BLOBs maintain their prior server assignments, and the addition of n^{th} server only causes 1/n fraction of the BLOBs to relocate. Although the process of moving BLOBs across cache servers in the cluster depends on the context, commonly, the newly added cache server identifies its "predecessor" and moves all the BLOBs, whose mapping belongs to this server (i.e. whose hash value is less than that of the new server), from it. However, in the case of web page caches, in most implementations there is no involvement of moving or copying, assuming the cached BLOB is small enough. When a request hits a newly added cache server, a cache miss happens and a request to the actual web server is made and the BLOB is cached locally for future requests. The redundant BLOBs on the previously used cache servers would be removed as per the cache eviction policies.{{sfn|Roughgarden|Valiant|2021|p=6}}

=Implementation=

Let h_{b}(x) and h_{s}(x) be the hash functions used for the BLOB and server's unique identifier respectively. In practice, a binary search tree (BST) is used to dynamically maintain the \text{server ID} within a cluster or hashring, and to find the successor or minimum within the BST, tree traversal is used.

;Inserting x into the cluster

:Let \beta be the hash value of a BLOB such that, h_{b}(x)=\beta\ \%\ 360 where x \in \mathrm{BLOB} and h_{b}(x)=\zeta. To insert x, find the successor of \zeta in the BST of \text{server ID}s. If \zeta is larger than all of the \text{server ID}s, the BLOB is placed in the server with smallest \text{server ID} value.

;Deleting x from the cluster

:Find the successor of \zeta in the BST, remove the BLOB from the returned \text{server ID}. If \zeta has no successor, remove the BLOB from the smallest of the \text{server ID}s.{{sfn|Moitra|2016|p=2}}

;Insert a server into cluster

:Let \Phi be the hash value of a server's identifier such that, h_{s}(x)=\Phi\ \%\ 360 where x \in \{\text{IP address, UUID}\} and h_{s}(x)=\theta. Move all the BLOBs, whose hash value is smaller than \theta, from the server whose \text{server ID} is successor of \theta. If \theta is largest of all the \text{server ID}s, move the relevant BLOBs from the smallest of the \text{server ID}s into \theta.{{sfn|Moitra|2016|p=2–3}}

;Delete a server from cluster

:Find the successor of \theta in the BST, move the BLOBs from \theta into its successor server. If \theta doesn't have a successor, move the BLOBs into the smallest of the \text{server ID}s.{{sfn|Moitra|2016|p=3}}

=Variance reduction=

To avoid skewness of multiple nodes within the radian, which happen due to lack of uniform distribution of the servers within the cluster, multiple labels are used. Those duplicate labels are called "virtual nodes" i.e. multiple labels which point to a single "real" label or server within the cluster. The amount of virtual nodes or duplicate labels used for a particular server within a cluster is called the "weight" of that particular server.{{sfn|Roughgarden|Valiant|2021|p=6–7}}

Practical extensions

A number of extensions to the basic technique are needed for effectively using consistent hashing for load balancing in practice. In the basic scheme above, if a server fails, all its BLOBs are reassigned to the next server in clockwise order, potentially doubling the load of that server. This may not be desirable. To ensure a more even redistribution of BLOBs on server failure, each server can be hashed to multiple locations on the unit circle. When a server fails, the BLOBs assigned to each of its replicas on the unit circle will get reassigned to a different server in clockwise order, thus redistributing the BLOBs more evenly. Another extension concerns a situation where a single BLOB gets "hot" and is accessed a large number of times and will have to be hosted in multiple servers. In this situation, the BLOB may be assigned to multiple contiguous servers by traversing the unit circle in clockwise order. A more complex practical consideration arises when two BLOBs are hashed near each other in the unit circle and both get "hot" at the same time. In this case, both BLOBs will use the same set of contiguous servers in the unit circle. This situation can be ameliorated by each BLOB choosing a different hash function for mapping servers to the unit circle.

Comparison with rendezvous hashing and other alternatives

Rendezvous hashing, designed in 1996, is a simpler and more general technique, and permits fully distributed agreement on a set of k options out of a possible set of n options. It can in fact be shown that consistent hashing is a special case of rendezvous hashing. Because of its simplicity and generality, rendezvous hashing is now being used in place of Consistent Hashing in many applications.

If key values will always increase monotonically, an alternative approach using a hash table with monotonic keys may be more suitable than consistent hashing.{{citation needed|date=October 2019}}

Complexity

class="wikitable"

|+Asymptotic time complexities for N nodes (or slots) and K keys

!

!Classic hash table

!Consistent hashing

add a node

|O(K)

|O(K/N + \log N)

remove a node

|O(K)

|O(K/N + \log N)

lookup a key

|O(1)

|O(\log N)

add a key

|O(1)

|O(\log N)

remove a key

|O(1)

|O(\log N)

The O(K/N) is an average cost for redistribution of keys and the O(\log N) complexity for consistent hashing comes from the fact that a binary search among nodes angles is required to find the next node on the ring.{{citation needed|date=October 2019}}

Examples

Known examples of consistent hashing use include:

  • Couchbase automated data partitioning {{cite web|url=https://blog.couchbase.com/what-exactly-membase/|title=What Exactly Is Membase?|date=16 December 2014 |accessdate=2020-10-29}}
  • OpenStack's Object Storage Service Swift{{cite web|url=https://docs.openstack.org/swift/latest/ring_background.html|title=Building a Consistent Hashing Ring |date=February 2011| access-date=2019-11-17| website=openstack.org| first1=Greg|last1=Holt}}
  • Partitioning component of Amazon's storage system Dynamo{{cite journal

|author1=DeCandia, G. |author2=Hastorun, D. |author3=Jampani, M. |author4=Kakulapati, G. |author5=Lakshman, A. |author6=Pilchin, A. |author7=Sivasubramanian, S. |author8=Vosshall, P. |author9=Vogels, Werner |title=Dynamo |journal=ACM SIGOPS Operating Systems Review |author9-link=Werner Vogels | year = 2007|doi=10.1145/1323293.1294281|pages=205–220|volume=41|issue=6

| url = http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

| access-date = 2018-06-07

}}

|author1=Lakshman, Avinash |author2=Malik, Prashant | journal = ACM SIGOPS Operating Systems Review

| year = 2010|doi=10.1145/1773912.1773922|volume=44|issue=2|pages=35–40

| title = Cassandra: a decentralized structured storage system

|s2cid=916681 }}

  • Data partitioning in ScyllaDB{{cite web|title=NoSQL Comparison: MongoDB vs ScyllaDB|url=https://benchant.com/blog/mongodb-vs-scylladb-comparison|website=benchant.com|accessdate=21 March 2024}}
  • Data partitioning in Voldemort{{cite web|title=Design -- Voldemort|url=http://www.project-voldemort.com/voldemort/design.html|website=www.project-voldemort.com/|accessdate=9 February 2015|archiveurl=https://web.archive.org/web/20150209164650/http://www.project-voldemort.com/voldemort/design.html|archivedate=9 February 2015|quote=Consistent hashing is a technique that avoids these problems, and we use it to compute the location of each key on the cluster.}}
  • Akka's consistent hashing router{{cite web|url=http://doc.akka.io/docs/akka/snapshot/scala/routing.html|title=Akka Routing|access-date=2019-11-16|website=akka.io}}
  • Riak, a distributed key-value database{{cite web|url=http://docs.basho.com/riak/1.4.8/theory/why-riak/|title=Riak Concepts|url-status=dead|access-date=2016-12-06|archive-url=https://web.archive.org/web/20150919042730/http://docs.basho.com/riak/1.4.8/theory/why-riak/|archive-date=2015-09-19}}
  • Gluster, a network-attached storage file system{{cite web|title=GlusterFS Algorithms: Distribution|url=http://www.gluster.org/2012/03/glusterfs-algorithms-distribution/|date=2012-03-01|access-date=2019-11-16|website=gluster.org}}
  • Akamai content delivery network{{cite web|url=http://theory.stanford.edu/~tim/s16/l/l1.pdf|title=Modern Algorithmic Toolbox|access-date=2019-11-17|date=2016-03-28|first1=Tim|first2=Gregory|last2=Valiant|last1=Roughgarden|website=stanford.edu}}
  • Discord chat application{{cite web|url=https://discord.com/blog/how-discord-scaled-elixir-to-5-000-000-concurrent-users|title=How Discord Scaled Elixir to 5,000,000 Concurrent Users|first1=Stanislav|last1=Vishnevskiy|date=2017-07-06|access-date=2022-08-16}}
  • Load balancing gRPC requests to a distributed cache in SpiceDB{{cite web|url=https://authzed.com/blog/consistent-hash-load-balancing-grpc|title=Consistent Hash Load Balancing for gRPC|date=24 November 2021 |accessdate=2023-09-04}}
  • Chord algorithm

{{Cite journal

| last1 = Stoica | first1 = I. | author-link1 = Ion Stoica

| last2 = Morris | first2 = R.

| last3 = Liben-Nowell | first3 = D.

| last4 = Karger | first4 = D.

| last5 = Kaashoek | first5 = M. F.

| last6 = Dabek | first6 = F.

| last7 = Balakrishnan | first7 = H. | author-link7 = Hari Balakrishnan

| title = Chord: a scalable peer-to-peer lookup protocol for Internet applications

| journal = IEEE/ACM Transactions on Networking

| volume = 11

| issue = 1

| date = 25 Feb 2003

| pages = 17–32

| doi = 10.1109/TNET.2002.808407| s2cid = 221276912 }}

  • MinIO object storage system{{cite web|title=MinIO Versioning, Metadata and Storage Deep Dive|date=3 January 2022 |url=https://blog.min.io/minio-versioning-metadata-deep-dive|access-date=2023-10-24}}

References

{{Reflist}}

=Works cited=

  • {{cite web|url=https://people.csail.mit.edu/moitra/docs/6854lec3.pdf|publisher=Massachusetts Institute of Technology| first=Ankur|last=Moitra|date=10 February 2016|title=Advanced Algorithms, 6.854|archive-url=https://web.archive.org/web/20210413045612/http://people.csail.mit.edu/moitra/docs/6854lec3.pdf|archive-date=13 April 2021|url-status=live|access-date=8 October 2021}}
  • {{cite web|url=https://web.stanford.edu/class/cs168/l/l1.pdf|first1=Tim|last1=Roughgarden|first2=Gregory|last2=Valiant|publisher=Stanford University|title=The Modern Algorithmic Toolbox, Introduction to Consistent Hashing|date=28 March 2021|access-date=7 October 2021|archive-url=https://web.archive.org/web/20210725194111/https://web.stanford.edu/class/cs168/l/l1.pdf|archive-date=25 July 2021|url-status=live}}