Word embedding

File:Word embedding illustration.svg

In natural language processing, a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning.{{cite book |last1=Jurafsky |first1=Daniel |last2=H. James |first2=Martin |title=Speech and language processing : an introduction to natural language processing, computational linguistics, and speech recognition |date=2000 |publisher=Prentice Hall |location=Upper Saddle River, N.J. |isbn=978-0-13-095069-7 |url=https://web.stanford.edu/~jurafsky/slp3/}} Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

Methods to generate this mapping include neural networks,{{cite arXiv|eprint=1310.4546|last1=Mikolov|first1=Tomas|title=Distributed Representations of Words and Phrases and their Compositionality|last2=Sutskever|first2=Ilya|last3=Chen|first3=Kai|last4=Corrado|first4=Greg|last5=Dean|first5=Jeffrey|class=cs.CL|year=2013}} dimensionality reduction on the word co-occurrence matrix,{{Cite book|arxiv=1312.5542|last1=Lebret|first1=Rémi|chapter=Word Emdeddings through Hellinger PCA|title=Conference of the European Chapter of the Association for Computational Linguistics (EACL)|volume=2014|last2=Collobert|first2=Ronan|year=2013}}{{Cite conference|url=http://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf|title=Neural Word Embedding as Implicit Matrix Factorization|last1=Levy|first1=Omer|conference=NIPS|year=2014|last2=Goldberg|first2=Yoav}}{{Cite conference|url=http://ijcai.org/papers15/Papers/IJCAI15-513.pdf|title=Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective|last1=Li|first1=Yitan|conference=Int'l J. Conf. on Artificial Intelligence (IJCAI)|year=2015|last2=Xu|first2=Linli}} probabilistic models,{{Cite journal|last=Globerson|first=Amir|date=2007|title=Euclidean Embedding of Co-occurrence Data|url=http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/34951.pdf|journal=Journal of Machine Learning Research}} explainable knowledge base method,{{Cite journal|last1=Qureshi|first1=M. Atif|last2=Greene|first2=Derek|date=2018-06-04|title=EVE: explainable vector based embedding technique using Wikipedia|journal=Journal of Intelligent Information Systems|volume=53|pages=137–165|language=en|doi=10.1007/s10844-018-0511-x|issn=0925-9902|arxiv=1702.06891|s2cid=10656055}} and explicit representation in terms of the context in which words appear.{{cite conference|last1=Levy|first1=Omer|last2=Goldberg|first2=Yoav|title=Linguistic Regularities in Sparse and Explicit Word Representations|conference=CoNLL|pages=171–180|year=2014|url=https://levyomer.files.wordpress.com/2014/04/linguistic-regularities-in-sparse-and-explicit-word-representations-conll-2014.pdf}}

Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as syntactic parsing{{cite conference|last1=Socher|first1=Richard|last2=Bauer|first2=John|last3=Manning|first3=Christopher|last4=Ng|first4=Andrew|title=Parsing with compositional vector grammars|conference=Proc. ACL Conf.|year=2013|url=http://www.socher.org/uploads/Main/SocherBauerManningNg_ACL2013.pdf|access-date=2014-08-14|archive-date=2016-08-11|archive-url=https://web.archive.org/web/20160811041232/http://www.socher.org/uploads/Main/SocherBauerManningNg_ACL2013.pdf|url-status=dead}} and sentiment analysis.{{cite conference|last1=Socher|first1=Richard|last2=Perelygin|first2=Alex|last3=Wu|first3=Jean|last4=Chuang|first4=Jason|last5=Manning|first5=Chris|last6=Ng|first6=Andrew|last7=Potts|first7=Chris|title=Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank|conference=EMNLP|year=2013|url=http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf}}

Development and history of the approach

In distributional semantics, a quantitative methodological approach for understanding meaning in observed language, word embeddings or semantic feature space models have been used as a knowledge representation for some time.{{cite web|url=https://www.linkedin.com/pulse/brief-history-word-embeddings-some-clarifications-magnus-sahlgren/|first=Magnus|last=Sahlgren|title=A brief history of word embeddings}} Such models aim to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The underlying idea that "a word is characterized by the company it keeps" was proposed in a 1957 article by John Rupert Firth,{{cite journal|last=Firth|first=J.R.|year=1957|title=A synopsis of linguistic theory 1930–1955|journal=Studies in Linguistic Analysis|pages=1–32}} Reprinted in {{cite book|editor=F.R. Palmer|title=Selected Papers of J.R. Firth 1952–1959|publisher=London: Longman|year=1968}} but also has roots in the contemporaneous work on search systems{{cite journal|last=Luhn|first=H.P.|year=1953|title=A New Method of Recording and Searching Information|journal=American Documentation|volume=4 |pages=14–16|doi=10.1002/asi.5090040104}} and in cognitive psychology.{{cite book|title=The Measurement of Meaning.|year=1957|last1=Osgood|first1=C.E.|last2=Suci|first2=G.J.|last3=Tannenbaum|first3=P.H.|publisher=University of Illinois Press}}

The notion of a semantic space with lexical items (words or multi-word terms) represented as vectors or embeddings is based on the computational challenges of capturing distributional characteristics and using them for practical application to measure similarity between words, phrases, or entire documents. The first generation of semantic space models is the vector space model for information retrieval.{{cite book |last1=Salton |first1=Gerard |title=Proceedings of the December 4-6, 1962, fall joint computer conference on - AFIPS '62 (Fall) |chapter=Some experiments in the generation of word and document associations |date=1962 |pages=234–250 |doi=10.1145/1461518.1461544 |isbn=9781450378796 |s2cid=9937095 |doi-access=free }}{{cite journal |last1=Salton |first1=Gerard |last2=Wong |first2=A |last3=Yang |first3=C S |title=A Vector Space Model for Automatic Indexing |journal=Communications of the ACM |date=1975 |volume=18 |issue=11 |pages=613–620|doi=10.1145/361219.361220 |hdl=1813/6057 |s2cid=6473756 |hdl-access=free }}{{cite web |last1=Dubin |first1=David |title=The most influential paper Gerard Salton never wrote. |url=https://www.thefreelibrary.com/The+most+influential+paper+Gerard+Salton+never+wrote.-a0125151308 |access-date=18 October 2020 |date=2004 |archive-date=18 October 2020 |archive-url=https://web.archive.org/web/20201018193507/https://www.thefreelibrary.com/The+most+influential+paper+Gerard+Salton+never+wrote.-a0125151308 |url-status=dead }} Such vector space models for words and their distributional data implemented in their simplest form results in a very sparse vector space of high dimensionality (cf. curse of dimensionality). Reducing the number of dimensions using linear algebraic methods such as singular value decomposition then led to the introduction of latent semantic analysis in the late 1980s and the random indexing approach for collecting word co-occurrence contexts.Kanerva, Pentti, Kristoferson, Jan and Holst, Anders (2000): [https://cloudfront.escholarship.org/dist/prd/content/qt5644k0w6/qt5644k0w6.pdf Random Indexing of Text Samples for Latent Semantic Analysis], Proceedings of the 22nd Annual Conference of the Cognitive Science Society, p. 1036. Mahwah, New Jersey: Erlbaum, 2000.{{cite journal |last1=Karlgren |first1=Jussi |last2=Sahlgren |first2=Magnus |editor1-last=Uesaka |editor1-first=Yoshinori |editor2-last=Kanerva |editor2-first=Pentti |editor3-last=Asoh |editor3-first=Hideki |title=From words to understanding |journal=Foundations of Real-World Intelligence |date=2001 |pages=294–308 |publisher=CSLI Publications}}Sahlgren, Magnus (2005) [http://eprints.sics.se/221/1/RI_intro.pdf An Introduction to Random Indexing], Proceedings of the Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE 2005, August 16, Copenhagen, DenmarkSahlgren, Magnus, Holst, Anders and Pentti Kanerva (2008) [http://eprints.sics.se/3436/01/permutationsCogSci08.pdf Permutations as a Means to Encode Order in Word Space], In Proceedings of the 30th Annual Conference of the Cognitive Science Society: 1300–1305. In 2000, Bengio et al. provided in a series of papers titled "Neural probabilistic language models" to reduce the high dimensionality of word representations in contexts by "learning a distributed representation for words".{{Cite journal |last1=Bengio |first1=Yoshua |last2=Réjean |first2=Ducharme |last3=Pascal |first3=Vincent |date=2000 |title=A Neural Probabilistic Language Model |url=https://proceedings.neurips.cc/paper_files/paper/2000/file/728f206c2a01bf572b5940d7d9a8fa4c-Paper.pdf |journal=NeurIPS}}{{cite journal|last1=Bengio|first1=Yoshua|author-link1=Yoshua Bengio|last2=Ducharme|first2=Réjean|last3=Vincent|first3=Pascal|last4=Jauvin|first4=Christian|year=2003|title=A Neural Probabilistic Language Model|url=https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf|journal=Journal of Machine Learning Research|volume=3|pages=1137–1155}}{{cite book |last1=Bengio |first1=Yoshua |title=Studies in Fuzziness and Soft Computing |last2=Schwenk |first2=Holger |last3=Senécal |first3=Jean-Sébastien |last4=Morin |first4=Fréderic |last5=Gauvain |first5=Jean-Luc |publisher=Springer |year=2006 |isbn=978-3-540-30609-2 |volume=194 |pages=137–186 |chapter=A Neural Probabilistic Language Model |doi=10.1007/3-540-33486-6_6}}

A study published in NeurIPS (NIPS) 2002 introduced the use of both word and document embeddings applying the method of kernel CCA to bilingual (and multi-lingual) corpora, also providing an early example of self-supervised learning of word embeddings.{{cite conference|year=2002|last1=Vinkourov|first1=Alexei|last2=Cristianini|first2=Nello|last3=Shawe-Taylor|first3=John|title=Inferring a semantic representation of text via cross-language correlation analysis.|conference=Advances in Neural Information Processing Systems |volume=15

|url=https://proceedings.neurips.cc/paper/2002/file/d5e2fbef30a4eb668a203060ec8e5eef-Paper.pdf }}

Word embeddings come in two different styles, one in which words are expressed as vectors of co-occurring words, and another in which words are expressed as vectors of linguistic contexts in which the words occur; these different styles are studied in Lavelli et al., 2004.{{cite conference|year=2004|last1=Lavelli|first1=Alberto|last2=Sebastiani|first2=Fabrizio|last3=Zanoli|first3=Roberto|title=Distributional term representations: an experimental comparison|conference=13th ACM International Conference on Information and Knowledge Management|pages=615–624|doi=10.1145/1031171.1031284 }} Roweis and Saul published in Science how to use "locally linear embedding" (LLE) to discover representations of high dimensional data structures.{{cite journal|title=Nonlinear Dimensionality Reduction by Locally Linear Embedding|journal=Science|volume=290|issue=5500|pages=2323–6|bibcode=2000Sci...290.2323R|last1=Roweis|first1=Sam T.|last2=Saul|first2=Lawrence K.|year=2000|doi=10.1126/science.290.5500.2323|pmid=11125150|citeseerx=10.1.1.111.3313|s2cid=5987139 }} Most new word embedding techniques after about 2005 rely on a neural network architecture instead of more probabilistic and algebraic models, after foundational work done by Yoshua Bengio:he:%D7%99%D7%94%D7%95%D7%A9%D7%A2 %D7%91%D7%A0%D7%92%27%D7%99%D7%95{{Circular reference|date=May 2024}} and colleagues.{{cite book|last1=Morin|first1=Fredric|last2=Bengio|first2=Yoshua|chapter=Hierarchical probabilistic neural network language model|chapter-url=http://proceedings.mlr.press/r5/morin05a/morin05a.pdf|title=Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics|series=Proceedings of Machine Learning Research|volume=R5|pages=246–252|year=2005 |editor1 = Cowell, Robert G. |editor2=Ghahramani, Zoubin}}{{cite journal|last1=Mnih|first1=Andriy|last2=Hinton|first2=Geoffrey|title=A Scalable Hierarchical Distributed Language Model|pages=1081–1088|url=http://papers.nips.cc/paper/3583-a-scalable-hierarchical-distributed-language-model|journal=Advances in Neural Information Processing Systems |volume=21 (NIPS 2008)|publisher=Curran Associates, Inc.|year=2009}}

The approach has been adopted by many research groups after theoretical advances in 2010 had been made on the quality of vectors and the training speed of the model, as well as after hardware advances allowed for a broader parameter space to be explored profitably. In 2013, a team at Google led by Tomas Mikolov created word2vec, a word embedding toolkit that can train vector space models faster than previous approaches. The word2vec approach has been widely used in experimentation and was instrumental in raising interest for word embeddings as a technology, moving the research strand out of specialised research into broader experimentation and eventually paving the way for practical application.{{cite web |title=word2vec |url=https://code.google.com/archive/p/word2vec/|website=Google Code Archive |access-date=23 July 2021}}

Polysemy and homonymy

Historically, one of the main limitations of static word embeddings or word vector space models is that words with multiple meanings are conflated into a single representation (a single vector in the semantic space). In other words, polysemy and homonymy are not handled properly. For example, in the sentence "The club I tried yesterday was great!", it is not clear if the term club is related to the word sense of a club sandwich, clubhouse, golf club, or any other sense that club might have. The necessity to accommodate multiple meanings per word in different vectors (multi-sense embeddings) is the motivation for several contributions in NLP to split single-sense embeddings into multi-sense ones.{{Cite book|url=https://www.aclweb.org/anthology/N10-1013/|title=Multi-Prototype Vector-Space Models of Word Meaning|last1=Reisinger|first1=Joseph|last2=Mooney|first2=Raymond J.|date=2010|publisher=Association for Computational Linguistics|isbn=978-1-932432-65-7|volume=Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics|location=Los Angeles, California|pages=109–117|access-date=October 25, 2019}}{{Cite book|title=Improving word representations via global context and multiple word prototypes|last=Huang, Eric.|date=2012|oclc=857900050}}

Most approaches that produce multi-sense embeddings can be divided into two main categories for their word sense representation, i.e., unsupervised and knowledge-based.{{cite arXiv|last1=Camacho-Collados|first1=Jose|last2=Pilehvar|first2=Mohammad Taher|year=2018|title=From Word to Sense Embeddings: A Survey on Vector Representations of Meaning|class=cs.CL|eprint=1805.04032}} Based on word2vec skip-gram, Multi-Sense Skip-Gram (MSSG){{Cite book|last1=Neelakantan|first1=Arvind|last2=Shankar|first2=Jeevan|last3=Passos|first3=Alexandre|last4=McCallum|first4=Andrew|title=Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) |chapter=Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space |date=2014|pages=1059–1069|location=Stroudsburg, PA, USA|publisher=Association for Computational Linguistics|doi=10.3115/v1/d14-1113|arxiv=1504.06654|s2cid=15251438}} performs word-sense discrimination and embedding simultaneously, improving its training time, while assuming a specific number of senses for each word. In the Non-Parametric Multi-Sense Skip-Gram (NP-MSSG) this number can vary depending on each word. Combining the prior knowledge of lexical databases (e.g., WordNet, ConceptNet, BabelNet), word embeddings and word sense disambiguation, Most Suitable Sense Annotation (MSSA){{Cite journal|last1=Ruas|first1=Terry|last2=Grosky|first2=William|last3=Aizawa|first3=Akiko|date=2019-12-01|title=Multi-sense embeddings through a word sense disambiguation process|journal=Expert Systems with Applications|volume=136|pages=288–303|doi=10.1016/j.eswa.2019.06.026|arxiv=2101.08700|issn=0957-4174|hdl=2027.42/145475|s2cid=52225306|hdl-access=free}} labels word-senses through an unsupervised and knowledge-based approach, considering a word's context in a pre-defined sliding window. Once the words are disambiguated, they can be used in a standard word embeddings technique, so multi-sense embeddings are produced. MSSA architecture allows the disambiguation and annotation process to be performed recurrently in a self-improving manner.{{Cite journal |last1=Agre |first1=Gennady |last2=Petrov |first2=Daniel |last3=Keskinova |first3=Simona |date=2019-03-01 |title=Word Sense Disambiguation Studio: A Flexible System for WSD Feature Extraction |journal=Information |language=en |volume=10 |issue=3 |pages=97 |doi=10.3390/info10030097 |issn=2078-2489|doi-access=free }}

The use of multi-sense embeddings is known to improve performance in several NLP tasks, such as part-of-speech tagging, semantic relation identification, semantic relatedness, named entity recognition and sentiment analysis.{{Cite journal|last1=Akbik|first1=Alan|last2=Blythe|first2=Duncan|last3=Vollgraf|first3=Roland|date=2018|title=Contextual String Embeddings for Sequence Labeling|url=https://www.aclweb.org/anthology/C18-1139|journal=Proceedings of the 27th International Conference on Computational Linguistics|location=Santa Fe, New Mexico, USA|publisher=Association for Computational Linguistics|pages=1638–1649}}{{Cite book|last1=Li|first1=Jiwei|last2=Jurafsky|first2=Dan|title=Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing |chapter=Do Multi-Sense Embeddings Improve Natural Language Understanding? |date=2015|pages=1722–1732|location=Stroudsburg, PA, USA|publisher=Association for Computational Linguistics|doi=10.18653/v1/d15-1200|arxiv=1506.01070|s2cid=6222768}}

As of the late 2010s, contextually-meaningful embeddings such as ELMo and BERT have been developed.{{cite journal |last1=Devlin |first1=Jacob |last2=Chang |first2=Ming-Wei |last3=Lee |first3=Kenton |last4=Toutanova |first4=Kristina |title=Proceedings of the 2019 Conference of the North |journal=Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) |date=June 2019 |pages=4171–4186 |doi=10.18653/v1/N19-1423 |url=https://aclanthology.org/N19-1423/ |publisher=Association for Computational Linguistics|s2cid=52967399 |url-access=subscription }} Unlike static word embeddings, these embeddings are at the token-level, in that each occurrence of a word has its own embedding. These embeddings better reflect the multi-sense nature of words, because occurrences of a word in similar contexts are situated in similar regions of BERT’s embedding space.Lucy, Li, and David Bamman. "Characterizing English variation across social media communities with BERT." Transactions of the Association for Computational Linguistics 9 (2021): 538-556.Reif, Emily, Ann Yuan, Martin Wattenberg, Fernanda B. Viegas, Andy Coenen, Adam Pearce, and Been Kim. "Visualizing and measuring the geometry of BERT." Advances in Neural Information Processing Systems 32 (2019).

For biological sequences: BioVectors

Word embeddings for n-grams in biological sequences (e.g. DNA, RNA, and Proteins) for bioinformatics applications have been proposed by Asgari and Mofrad.{{cite journal|last1=Asgari|first1=Ehsaneddin|last2=Mofrad|first2=Mohammad R.K.|title=Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics|journal=PLOS ONE|date=2015|volume=10|issue=11|page=e0141287|doi=10.1371/journal.pone.0141287|pmid=26555596|pmc=4640716|bibcode=2015PLoSO..1041287A|arxiv=1503.05140|doi-access=free}} Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. The results presented by Asgari and Mofrad suggest that BioVectors can characterize biological sequences in terms of biochemical and biophysical interpretations of the underlying patterns.

Game design

Word embeddings with applications in game design have been proposed by Rabii and Cook{{Cite journal |last1=Rabii |first1=Younès |last2=Cook |first2=Michael |date=2021-10-04 |title=Revealing Game Dynamics via Word Embeddings of Gameplay Data |url=https://ojs.aaai.org/index.php/AIIDE/article/view/18907 |journal=Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment |language=en |volume=17 |issue=1 |pages=187–194 |doi=10.1609/aiide.v17i1.18907 |s2cid=248175634 |issn=2334-0924|doi-access=free }} as a way to discover emergent gameplay using logs of gameplay data. The process requires transcribing actions that occur during a game within a formal language and then using the resulting text to create word embeddings. The results presented by Rabii and Cook suggest that the resulting vectors can capture expert knowledge about games like chess that are not explicitly stated in the game's rules.

Sentence embeddings

The idea has been extended to embeddings of entire sentences or even documents, e.g. in the form of the thought vectors concept. In 2015, some researchers suggested "skip-thought vectors" as a means to improve the quality of machine translation.{{cite arXiv|title=skip-thought vectors|eprint=1506.06726|last1=Kiros|first1=Ryan|last2=Zhu|first2=Yukun|last3=Salakhutdinov|first3=Ruslan|last4=Zemel|first4=Richard S.|last5=Torralba|first5=Antonio|last6=Urtasun|first6=Raquel|last7=Fidler|first7=Sanja|class=cs.CL|year=2015}} A more recent and popular approach for representing sentences is Sentence-BERT, or SentenceTransformers, which modifies pre-trained BERT with the use of siamese and triplet network structures.Reimers, Nils, and Iryna Gurevych. "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982-3992. 2019.

Software

Software for training and using word embeddings includes Tomáš Mikolov's Word2vec, Stanford University's GloVe,{{cite web|url=http://nlp.stanford.edu/projects/glove/|title=GloVe}} GN-GloVe,{{cite arXiv|last=Zhao|first=Jieyu|collaboration=2018|title=Learning Gender-Neutral Word Embeddings|year=2018|eprint=1809.01496|class=cs.CL}} Flair embeddings, AllenNLP's ELMo,{{cite web|url=https://allennlp.org/elmo|title=Elmo|date=16 October 2024 }} BERT,{{cite arXiv|last1=Pires|first1=Telmo|last2=Schlinger|first2=Eva|last3=Garrette|first3=Dan|date=2019-06-04|title=How multilingual is Multilingual BERT?|class=cs.CL|eprint=1906.01502}} fastText, Gensim,{{cite web|url=http://radimrehurek.com/gensim/|title=Gensim}} Indra,{{cite web|url=https://github.com/Lambda-3/Indra|title=Indra|website=GitHub|date=2018-10-25}} and Deeplearning4j. Principal Component Analysis (PCA) and T-Distributed Stochastic Neighbour Embedding (t-SNE) are both used to reduce the dimensionality of word vector spaces and visualize word embeddings and clusters.{{Cite book|last1=Ghassemi|first1=Mohammad|last2=Mark|first2=Roger|last3=Nemati|first3=Shamim |title=2015 Computing in Cardiology Conference (CinC) |chapter=A visualization of evolving clinical sentiment using vector representations of clinical notes |journal= |date=2015|chapter-url=http://www.cinc.org/archives/2015/pdf/0629.pdf|volume=2015 |pages=629–632 |doi=10.1109/CIC.2015.7410989 |pmid=27774487 |pmc=5070922 |isbn=978-1-5090-0685-4 }}

=Examples of application=

For instance, the fastText is also used to calculate word embeddings for text corpora in Sketch Engine that are available online.{{cite web|url=https://embeddings.sketchengine.co.uk/|title=Embedding Viewer|author=|website=Embedding Viewer|publisher=Lexical Computing|access-date=7 Feb 2018|archive-date=8 February 2018|archive-url=https://web.archive.org/web/20180208004241/https://embeddings.sketchengine.co.uk/|url-status=dead}}

Ethical implications

Word embeddings may contain the biases and stereotypes contained in the trained dataset, as Bolukbasi et al. points out in the 2016 paper “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings” that a publicly available (and popular) word2vec embedding trained on Google News texts (a commonly used data corpus), which consists of text written by professional journalists, still shows disproportionate word associations reflecting gender and racial biases when extracting word analogies.{{cite arXiv | eprint=1607.06520 | last1=Bolukbasi | first1=Tolga | last2=Chang | first2=Kai-Wei | last3=Zou | first3=James | last4=Saligrama | first4=Venkatesh | last5=Kalai | first5=Adam | title=Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings | date=2016 | class=cs.CL }} For example, one of the analogies generated using the aforementioned word embedding is “man is to computer programmer as woman is to homemaker”.{{Cite arXiv |last1=Bolukbasi |first1=Tolga |last2=Chang |first2=Kai-Wei |last3=Zou |first3=James |last4=Saligrama |first4=Venkatesh |last5=Kalai |first5=Adam |date=2016-07-21 |title=Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings |class=cs.CL |eprint=1607.06520 }}{{cite journal | url=https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00325/96463/Topic-Modeling-in-Embedding-Spaces | doi=10.1162/tacl_a_00325 | title=Topic Modeling in Embedding Spaces | date=2020 | last1=Dieng | first1=Adji B. | last2=Ruiz | first2=Francisco J. R. | last3=Blei | first3=David M. | journal=Transactions of the Association for Computational Linguistics | volume=8 | pages=439–453 | arxiv=1907.04907 }}

Research done by Jieyu Zhou et al. shows that the applications of these trained word embeddings without careful oversight likely perpetuates existing bias in society, which is introduced through unaltered training data. Furthermore, word embeddings can even amplify these biases .{{cite book | chapter-url=https://aclanthology.org/D17-1323/ | doi=10.18653/v1/D17-1323 | chapter=Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints | title=Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing | date=2017 | last1=Zhao | first1=Jieyu | last2=Wang | first2=Tianlu | last3=Yatskar | first3=Mark | last4=Ordonez | first4=Vicente | last5=Chang | first5=Kai-Wei | pages=2979–2989 }}{{Cite journal |last1=Petreski |first1=Davor |last2=Hashim |first2=Ibrahim C. |date=2022-05-26 |title=Word embeddings are biased. But whose bias are they reflecting? |journal=AI & Society |volume=38 |issue=2 |pages=975–982 |language=en |doi=10.1007/s00146-022-01443-w |s2cid=249112516 |issn=1435-5655|doi-access=free }}