Random indexing

{{short description|Dimensionality reduction method for distributional semantics}}

Random indexing is a dimensionality reduction method and computational framework for distributional semantics, based on the insight that very-high-dimensional vector space model implementations are impractical, that models need not grow in dimensionality when new items (e.g. new terminology) are encountered, and that a high-dimensional model can be projected into a space of lower dimensionality without compromising L2 distance metrics if the resulting dimensions are chosen appropriately.

This is the original point of the random projection approach to dimension reduction first formulated as the Johnson–Lindenstrauss lemma, and locality-sensitive hashing has some of the same starting points. Random indexing, as used in representation of language, originates from the work of Pentti KanervaKanerva, Pentti, Kristoferson, Jan and Holst, Anders (2000): [https://cloudfront.escholarship.org/dist/prd/content/qt5644k0w6/qt5644k0w6.pdf Random Indexing of Text Samples for Latent Semantic Analysis], Proceedings of the 22nd Annual Conference of the Cognitive Science Society, p. 1036. Mahwah, New Jersey: Erlbaum, 2000.Sahlgren, Magnus (2005) [http://eprints.sics.se/221/1/RI_intro.pdf An Introduction to Random Indexing], Proceedings of the Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE 2005, August 16, Copenhagen, DenmarkSahlgren, Magnus, Holst, Anders and Pentti Kanerva (2008) [http://eprints.sics.se/3436/01/permutationsCogSci08.pdf Permutations as a Means to Encode Order in Word Space], In Proceedings of the 30th Annual Conference of the Cognitive Science Society: 1300-1305.Kanerva, Pentti (2009) [http://www.rctn.org/vs265/kanerva09-hyperdimensional.pdf Hyperdimensional Computing: An Introduction to Computing in Distributed Representation with High-Dimensional Random Vectors], Cognitive Computation, Volume 1, Issue 2, pp. 139–159.Joshi, Aditya, Johan Halseth, and Pentti Kanerva. "[https://arxiv.org/abs/1412.7026 Language Recognition using Random Indexing]." arXiv preprint arXiv:1412.7026 (2014). on sparse distributed memory, and can be described as an incremental formulation of a random projection.Recchia, Gabriel, et al. "[https://cloudfront.escholarship.org/dist/prd/content/qt7wc694rn/qt7wc694rn.pdf Encoding sequential information in vector space models of semantics: Comparing holographic reduced representation and random permutation]." (2010): 865-870.

It can be also verified that random indexing is a random projection technique for the construction of Euclidean spaces—i.e. L2 normed vector spaces.Qasemi Zadeh, Behrang & Handschuh, Siegrfied. (2014) [http://aran.library.nuigalway.ie/xmlui/bitstream/handle/10379/4389/tir-rmi.pdf?sequence=1 Random Manhattan Indexing], In Proceedings of the 25th International Workshop on Database and Expert Systems Applications. In Euclidean spaces, random projections are elucidated using the Johnson–Lindenstrauss lemma.

Johnson, W. and Lindenstrauss, J. (1984) [https://www.researchgate.net/profile/William_Johnson16/publication/235008656_Extensions_of_Lipschitz_maps_into_a_Hilbert_space/links/55e9abf908aeb65162649527.pdf Extensions of Lipschitz mappings into a Hilbert space], in Contemporary Mathematics. American Mathematical Society, vol. 26, pp. 189–206.

The TopSig techniqueGeva, S. & De Vries, C.M. (2011) [http://eprints.qut.edu.au/43451/ TopSig: Topology Preserving Document Signatures], In Proceedings of Conference on Information and Knowledge Management 2011, 24–28 October 2011, Glasgow, Scotland. extends the random indexing model to produce bit vectors for comparison with the Hamming distance similarity function. It is used for improving the performance of information retrieval and document clustering. In a similar line of research, Random Manhattan Integer Indexing (RMII)Qasemi Zadeh, Behrang. & Handschuh, Siegfried. (2014) [http://emnlp2014.org/papers/pdf/EMNLP2014178.pdf random Manhattan integer indexing: Incremental L1 Normed Vector Space Construction], In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1713–1723,

October 25–29, 2014, Doha, Qatar. is proposed for improving the performance of the methods that employ the Manhattan distance between text units. Many random indexing methods primarily generate similarity from co-occurrence of items in a corpus. Reflexive Random Indexing (RRI)Cohen T., Schvaneveldt Roger & Widdows Dominic (2009) [https://www.sciencedirect.com/science/article/pii/S1532046409001208 Reflective Random Indexing and indirect inference: a scalable method for discovery of implicit connections], Journal of Biomedical Informatics, 43(2):240-56. generates similarity from co-occurrence and from shared occurrence with other items.

References

External links

Zadeh Behrang Qasemi, Handschuh Siegfried. (2015) [http://pars.ie/publications/papers/pre-prints/random-indexing-dr-explained.pdf Random indexing explained with high probability], TSD.

Category:Dimension reduction