SimHash

{{Short description|Technique for quickly estimating similariy of sets}}

In computer science, SimHash is a technique for quickly estimating how similar two sets are. The algorithm is used by the Google Crawler to find near duplicate pages. It was created by Moses Charikar. In 2021 Google announced its intent to also use the algorithm in their newly created FLoC (Federated Learning of Cohorts) system.{{Cite web|last=Cyphers|first=Bennett|date=2021-03-03|title=Google's FLoC Is a Terrible Idea|url=https://www.eff.org/deeplinks/2021/03/googles-floc-terrible-idea|access-date=2021-04-13|website=Electronic Frontier Foundation}}

Evaluation and benchmarks

A large scale evaluation has been conducted by Google in 2006

{{citation

| last1 = Henzinger

| first1 = Monika

| contribution = Finding near-duplicate web pages: a large-scale evaluation of algorithms

| title = Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

| pages = 284

| year = 2006

| doi = 10.1145/1148170.1148222

| isbn = 978-1595933690

| s2cid = 207160068

| url = http://infoscience.epfl.ch/record/99373

}}. to compare the performance of Minhash and Simhash{{citation

| author = Charikar, Moses S.

| contribution = Similarity estimation techniques from rounding algorithms

| title = Proceedings of the 34th Annual ACM Symposium on Theory of Computing

| pages = 380–388

| year = 2002

| doi = 10.1145/509907.509965| isbn = 978-1581134957

| s2cid = 4229473

}}. algorithms. In 2007 Google reported using Simhash for duplicate detection for web crawling

{{citation

| last1 = Gurmeet Singh | first1 = Manku

| last2 = Jain | first2 = Arvind

| last3 = Das Sarma | first3 = Anish

| contribution = Detecting near-duplicates for web crawling

| title = Proceedings of the 16th International Conference on World Wide Web

| pages = 141

| url = http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33026.pdf

| year = 2007

| doi = 10.1145/1242572.1242592| isbn = 9781595936547

}}. and using Minhash and LSH for Google News personalization.

{{citation

| last1 = Das | first1 = Abhinandan S.

| last2 = Datar | first2 = Mayur

| last3 = Garg | first3 = Ashutosh

| last4 = Rajaram | first4 = Shyam

| contribution = Google news personalization: scalable online collaborative filtering

| title = Proceedings of the 16th International Conference on World Wide Web

| pages = 271

| year = 2007

| doi = 10.1145/1242572.1242610 |display-authors=etal| isbn = 9781595936547

| s2cid = 207163129

}}.

References

External links

[http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf Simhash Princeton Paper]
[http://matpalm.com/resemblance/simhash/ Simhash explained]
[http://proceedings.mlr.press/v33/shrivastava14.pdf Comparison of MinHash vs. Simhash]

Category:Hash functions

Category:Clustering criteria

Category:Hashing

Category:Probabilistic data structures

SimHash

Evaluation and benchmarks

See also

References

External links