similarity search

{{Short description|Searching for similar items in a data set}}

{{Recommender systems}}

Similarity search is the most general term used for a range of mechanisms which share the principle of searching (typically very large) spaces of objects where the only available comparator is the similarity between any pair of objects. This is becoming increasingly important in an age of large information repositories where the objects contained do not possess any natural order, for example large collections of images, sounds and other sophisticated digital objects.

Nearest neighbor search and range queries are important subclasses of similarity search, and a number of solutions exist. Research in similarity search is dominated by the inherent problems of searching over complex objects. Such objects cause most known techniques to lose traction over large collections, due to a manifestation of the so-called curse of dimensionality, and there are still many unsolved problems. Unfortunately, in many cases where similarity search is necessary, the objects are inherently complex.

The most general approach to similarity search relies upon the mathematical notion of metric space, which allows the construction of efficient index structures in order to achieve scalability in the search domain.

Similarity search evolved independently in a number of different scientific and computing contexts, according to various needs. In 2008 a few leading researchers in the field felt strongly that the subject should be a research topic in its own right, to allow focus on the general issues applicable across the many diverse domains of its use. This resulted in the formation of the [http://www.sisap.org SISAP] foundation, whose main activity is a series of annual international conferences on the generic topic.

Types

{{Expand section|date=April 2025}}

= Locality-sensitive hashing =

A popular approach for similarity search is locality sensitive hashing (LSH).Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB. Vol. 99. No. 6. 1999. It hashes input items so that similar items map to the same "buckets" in memory with high probability (the number of buckets being much smaller than the universe of possible input items). It is often applied in nearest neighbor search on large scale high-dimensional data, e.g., image databases, document collections, time-series databases, and genome databases.{{cite web

| first1 = A.|last1=Rajaraman |first2= J.|last2=Ullman|author2-link=Jeffrey Ullman

| url = http://infolab.stanford.edu/~ullman/mmds.html

| title=Mining of Massive Datasets, Ch. 3.

| year = 2010

}}

See also

Bibliography

  • Pei Lee, Laks V. S. Lakshmanan, Jeffrey Xu Yu: On Top-k Structural Similarity Search. ICDE 2012:774-785
  • Zezula, P., Amato, G., Dohnal, V., and Batko, M. Similarity Search - The Metric Space Approach. Springer, 2006. {{ISBN|0-387-29146-6}}
  • Samet, H.. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann, 2006. {{ISBN|0-12-369446-9}}
  • E. Chavez, G. Navarro, R.A. Baeza-Yates, J.L. Marroquin, [https://dx.doi.org/10.1145/502807.502808 Searching in metric spaces], ACM Computing Surveys, 2001
  • M.L. Hetland, [https://link.springer.com/chapter/10.1007%2F978-3-642-03625-5_9# The Basic Principles of Metric Indexing], Swarm Intelligence for Multi-objective Problems in Data Mining, Studies in Computational Intelligence Volume 242, 2009, pp 199–232

Resources

  • [http://mufin.fi.muni.cz/ The Multi-Feature Indexing Network (MUFIN) Project]
  • [https://code.google.com/p/mi-file/ MI-File (Metric Inverted File)]
  • [http://cophir.isti.cnr.it/ Content-based Photo Image Retrieval Test-Collection (CoPhIR)]
  • [https://www.sisap.org/ International Conference on Similarity Search and Applications (SISAP)]
  • [http://ann-benchmarks.com/ ANN-Benchmarks], for benchmark of approximate nearest neighbor algorithms search

References