entity linking

In natural language processing, Entity Linking, also referred to as named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD), named-entity normalization (NEN), or Concept Recognition, is the task of assigning a unique identity to entities (such as famous individuals, locations, or companies) mentioned in text.{{cite journal | url=https://link.springer.com/content/pdf/10.1186/s12859-023-05350-9.pdf | doi=10.1186/s12859-023-05350-9 | doi-access=free | title=An analysis of entity normalization evaluation biases in specialized domains | date=2023 | last1=Ferré | first1=Arnaud | last2=Langlais | first2=Philippe | journal=BMC Bioinformatics | volume=24 | issue=1 | page=227 | pmid=37268890 | pmc=10236701 }} For example, given the sentence "Paris is the capital of France", the main idea is to first identify "Paris" and "France" as named entities, and then to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred to as "Paris" and "France" to the french country.

The Entity Linking task is composed of 3 subtasks.

Named Entity Recognition: Extraction of named entities from a text.
Candidate Generation: For each named entity, select possible candidates from a Knowledge Base (e.g. Wikipedia, Wikidata, DBPedia, ...).
Disambiguation: Choose the correct entity from this set of candidates.

File:Entity Linking - Short Example.png

Introduction

In entity linking, words of interest (names of persons, locations and companies) are mapped from an input text to corresponding unique entities in a target knowledge base. Words of interest are called named entities (NEs), mentions, or surface forms. The target knowledge base depends on the intended application, but for entity linking systems intended to work on open-domain text it is common to use knowledge-bases derived from Wikipedia (such as Wikidata or DBpedia).{{cite book |last1=Han |first1=Xianpei |last2=Sun |first2=Le |last3=Zhao |first3=Jun |title=Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval |chapter=Collective entity linking in web text |date=2011 |pages=765–774 |doi=10.1145/2009916.2010019 |chapter-url=https://dl.acm.org/citation.cfm?id=2010019 |publisher=ACM|isbn=9781450307574 |s2cid=14428938 }} In this case, each individual Wikipedia page is regarded as a separate entity. Entity linking techniques that map named entities to Wikipedia entities are also called wikification.Rada Mihalcea and Andras Csomai (2007)[https://digital.library.unt.edu/ark:/67531/metadc31001/m2/1/high_res_d/Mihalcea-2007-Wikify-Linking_Documents_to_Encyclopedic.pdf Wikify! Linking Documents to Encyclopedic Knowledge]. Proc. CIKM.

Considering again the example sentence "Paris is the capital of France", the expected output of an entity linking system will be Paris and France. These uniform resource locators (URLs) can be used as unique uniform resource identifiers

(URIs) for the entities in the knowledge base. Using a different knowledge base will return different URIs, but for knowledge bases built starting from Wikipedia there exist one-to-one URI mappings.{{cite web |title=Wikipedia Links |date=4 May 2023 |url=https://wiki.dbpedia.org/services-resources/documentation/datasets#WikipediaLinks}}

In most cases, knowledge bases are manually built,Wikidata but in applications where large text corpora are available, the knowledge base can be inferred automatically from the available text.Aaron M. Cohen (2005). Unsupervised gene/protein named entity normalization using automatically extracted dictionaries. Proc. ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, pp. 17–24.

Entity linking is a critical step to bridge web data with knowledge bases, which is beneficial for annotating the huge amount of raw and often noisy data on the Web and contributes to the vision of the Semantic Web.Shen W, Wang J, Han J. Entity linking with a knowledge base: Issues, techniques, and solutions[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 27(2): 443-460. In addition to entity linking, there are other critical steps including but not limited to event extraction,Chang Y C, Chu C H, Su Y C, et al. PIPE: a protein–protein interaction passage extraction module for BioCreative challenge[J]. Database, 2016, 2016. and event linkingLou P, Jimeno Yepes A, Zhang Z, et al. BioNorm: deep learning-based event normalization for the curation of reaction databases[J]. Bioinformatics, 2020, 36(2): 611-620. etc.

=Applications=

Entity linking is beneficial in fields that need to extract abstract representations from text, as it happens in text analysis, recommender systems, semantic search and chatbots. In all these fields, concepts relevant to the application are separated from text and other non-meaningful data.{{cite web |last1=Slawski |first1=Bill |title=How Google Uses Named Entity Disambiguation for Entities with the Same Names |date=16 September 2015 |url=http://www.seobythesea.com/2015/09/disambiguate-entities-in-queries-and-pages/}}{{cite book |last1=Zhou |first1=Ming |title=Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing |last2=Lv |first2=Weifeng |last3=Ren |first3=Pengjie |last4=Wei |first4=Furu |last5=Tan |first5=Chuanqi |chapter=Entity Linking for Queries by Searching Wikipedia Sentences |pages=68–77 |doi=10.18653/v1/D17-1007 |chapter-url=https://aclweb.org/anthology/papers/D/D17/D17-1007/ |language=en-us|arxiv=1704.02788 |year=2017 |s2cid=1125678 }}

For example, a common task performed by search engines is to find documents that are similar to one given as input, or to find additional information about the persons that are mentioned in it.

Consider a sentence that contains the expression "the capital of France": without entity linking, the search engine that looks at the content of documents would not be able to directly retrieve documents containing the word "Paris", leading to so-called false negatives (FN). Even worse, the search engine might produce

spurious matches (or false positives (FP)), such as retrieving documents referring to "France" as a country.

Many approaches orthogonal to entity linking exist to retrieve documents similar to an input document. For example, latent semantic analysis (LSA) or comparing document embeddings obtained with

doc2vec. However, these techniques do not allow the same fine-grained control that is offered by entity linking, as they will return other

documents instead of creating high-level representations of the original one. For example, obtaining schematic information about "Paris", as presented by Wikipedia infoboxes would be much less straightforward, or sometimes even unfeasible, depending on the query complexity.{{cite journal |last1=Le |first1=Quoc |last2=Mikolov |first2=Tomas |title=Distributed Representations of Sentences and Documents |journal=Proceedings of the 31st International Conference on International Conference on Machine Learning |volume=32 |date=2014 |pages=II–1188–II–1196 |url=http://dl.acm.org/citation.cfm?id=3044805.3045025 |arxiv=1405.4053 }}

Moreover, entity linking has been used to improve the performance of information retrieval systemsM. A. Khalid, V. Jijkoun and M. de Rijke (2008). [https://staff.fnwi.uva.nl/m.derijke/wp-content/papercite-data/pdf/khalid-impact-2008.pdf The impact of named entity normalization on information retrieval for question answering]. Proc. ECIR. and to improve search performance on digital libraries.Hui Han, Hongyuan Zha, C. Lee Giles, "Name disambiguation in author citations using a K-way spectral clustering method," ACM/IEEE Joint Conference on Digital Libraries 2005 (JCDL 2005): 334-343, 2005 Entity linking is also a key input for semantic search.{{Cite web |url=https://stics.mpi-inf.mpg.de/ |title=STICS |access-date=2015-11-16 |archive-date=2021-09-01 |archive-url=https://web.archive.org/web/20210901034108/https://stics.mpi-inf.mpg.de/ |url-status=dead }}{{Cite book |last1=Hoffart |first1=Johannes |last2=Milchevski |first2=Dragan |last3=Weikum |first3=Gerhard |chapter=STICS: Searching with strings, things, and cats |date=2014-07-03 |title=Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval |chapter-url=https://doi.org/10.1145/2600428.2611177 |series=SIGIR '14 |location=New York, NY, USA |publisher=Association for Computing Machinery |pages=1247–1248 |doi=10.1145/2600428.2611177 |isbn=978-1-4503-2257-7}}

=Challenges=

There are various difficulties in performing entity linking. Some of these are intrinsic to the task,{{cite book |last1=Rao |first1=Delip |last2=McNamee |first2=Paul |last3=Dredze |first3=Mark |title=Multi-source, Multilingual Information Extraction and Summarization |chapter=Entity Linking: Finding Extracted Entities in a Knowledge Base |date=2013 |pages=93–115 |doi=10.1007/978-3-642-28569-1_5 |publisher=Springer Berlin Heidelberg |language=en|series=Theory and Applications of Natural Language Processing |isbn=978-3-642-28568-4 |s2cid=6420241 }} such as text ambiguity. Others are relevant in real-world use, such as scalability and execution time.

Name variations: the same entity might appear with textual representations. Sources of these variations include abbreviations (New York, NY), aliases (New York, Big Apple), or spelling variations and errors ({{not a typo|New yokr}}).
Ambiguity: the same mention can often refer to many different entities, depending on the context, as many entity names tend to be homonyms (the same sequence of letters applies to different concepts with distinct meanings, e.g., "bank" can mean a financial institution or the land immediately adjacent to a river) or polysemous. (Polysemy is a subtype of homonymy where the meanings are related by historical or linguistic origin.). The word Paris, among other things, could be referring to the French capital or to Paris Hilton. In some cases, there may be no textual similarity between a mention in the text (e.g., "We visited France's capital last month") and the actual target entity (Paris).
Absence: named entities might not have a corresponding entity in the target knowledge base. This can happen if the entity is very specific or unusual, or is related to recent events and the knowledge base is stale, or if the knowledge base is domain-specific (for example, a biology knowledge base). In these cases, the system probably is expected to return a NIL entity link. Knowing when to return a NIL prediction is not straightforward, and many approaches have been proposed. Examples are thresholding a confidence score in the entity linking system, and including a NIL entity in the knowledge base, which is treated as any entity. However, in some cases, linking to an incorrect but related entity may be more useful to the user than having no result at all.
Scale and speed: it is desirable for an industrial entity linking system to provide results in a reasonable time, and often in real-time. This requirement is critical for search engines, chat-bots and for entity linking systems offered by data-analytics platforms. Ensuring low execution time can be challenging when using large knowledge bases or when processing large documents.{{cite book |last1=Parravicini |first1=Alberto |last2=Patra |first2=Rhicheek |last3=Bartolini |first3=Davide B. |last4=Santambrogio |first4=Marco D. |title=Proceedings of the 2nd Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA) |chapter=Fast and Accurate Entity Linking via Graph Embedding |date=2019 |pages=10:1–10:9 |doi=10.1145/3327964.3328499 |chapter-url=https://dl.acm.org/citation.cfm?doid=3327964.3328499 |publisher=ACM|isbn=9781450367899 |hdl=11311/1119019 |s2cid=195357229 |hdl-access=free }} For example, Wikipedia contains nearly 9 million entities and more than 170 million relationships among them.
Evolving information: an entity linking system should also deal with evolving information, and easily integrate updates in the knowledge base. The problem of evolving information is sometimes connected to the problem of missing entities, for example when processing recent news articles in which there are mentions of events that do not have a corresponding entry in the knowledge base due to their novelty.{{cite book |last1=Hoffart |first1=Johannes |last2=Altun |first2=Yasemin |last3=Weikum |first3=Gerhard |title=Proceedings of the 23rd international conference on World wide web |chapter=Discovering emerging entities with ambiguous names |date=2014 |pages=385–396 |doi=10.1145/2566486.2568003 |chapter-url=https://dl.acm.org/citation.cfm?id=2568003 |publisher=ACM|isbn=9781450327442 |s2cid=7562986 }}
Multiple languages: an entity linking system might support queries performed in multiple languages. Ideally, the accuracy of the entity linking system should not be influenced by the input language, and entities in the knowledge base should be the same across different languages.{{cite journal |last1=Doermann |first1=David S. |last2=Oard |first2=Douglas W. |last3=Lawrie |first3=Dawn J. |last4=Mayfield |first4=James |last5=McNamee |first5=Paul |s2cid=3801685 |title=Cross-Language Entity Linking |journal= |date=2011 |language=en}}

=Related concepts=

Entity linking related to other concepts. Definitions are often blurry and vary slightly between authors.

Named-entity disambiguation (NED) is usually considered the same as entity linking, but some authors (Alhelbawy et al.{{cite journal |last1=Alhelbawy |first1=Ayman |last2=Gaizauskas |first2=Robert |date=August 2014 |title=Collective Named Entity Disambiguation using Graph Ranking and Clique Partitioning Approaches |url=https://www.aclweb.org/anthology/C14-1147 |volume=Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers |issue=Dublin City University and Association for Computational Linguistics |pages=1544–1555}}) consider it a special case of entity linking that assumes that the entity is in the knowledge base.{{cite book |last1=Zwicklbauer |first1=Stefan |last2=Seifert |first2=Christin |last3=Granitzer |first3=Michael |title=Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval |chapter=Robust and Collective Entity Disambiguation through Semantic Embeddings |date=2016 |pages=425–434 |doi=10.1145/2911451.2911535 |chapter-url=https://dl.acm.org/citation.cfm?doid=2911451.2911535 |publisher=ACM|isbn=9781450340694 |s2cid=207237647 |url=https://ris.utwente.nl/ws/files/50950222/p425_zwicklbauer.pdf }}{{cite journal |last1=Hachey |first1=Ben |last2=Radford |first2=Will |last3=Nothman |first3=Joel |last4=Honnibal |first4=Matthew |last5=Curran |first5=James R. |title=Evaluating Entity Linking with Wikipedia |journal=Artif. Intell. |date=2013 |volume=194 |pages=130–150 |doi=10.1016/j.artint.2012.04.005 |issn=0004-3702|doi-access=free }}

Wikification is the task of linking textual mentions to entities in Wikipedia (generally, limiting the scope to the English Wikipedia in case of cross-lingual wikification).
Record linkage (RL) finds the same entity in multiple and often heterogeneous data-sets.{{cite book |last1=Tsai |first1=Chen-Tse |title=Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies |last2=Roth |first2=Dan |date=2016 |volume=Proceedings of NAACL-HLT 2016 |pages=589–598 |chapter=Cross-lingual Wikification Using Multilingual Embeddings |doi=10.18653/v1/N16-1072 |chapter-url=https://www.aclweb.org/anthology/N16-1072 |s2cid=15156124}} It considered a broader concept than entity linking, and is a key process in digitalizing archives and joining of knowledge bases.
Named-entity recognition (NER) locates and classifies named entities in unstructured text into pre-defined categories such as names, organizations, locations, and more. For example, the following sentence:

:would be processed by an NER system to obtain the following output:

{{blockquote|[Paris]_City is the capital of [France]_Country.}}

:NER is usually a preprocessing step of an entity linking system, as it can be useful to know in advance which words should be linked to entities of the knowledge base.

Coreference resolution understands whether multiple words in a text refer to the same entity. It can be useful, for example, to understand the word a pronoun refers to. Consider the following example:

{{blockquote|Paris is the capital of France. It is also the largest city in France.}}

:In this example, a coreference resolution algorithm would identify that the pronoun It refers to Paris, and not to France or to another entity. A notable distinction compared to entity linking is that Coreference Resolution does not assign any unique identity to the words it matches, but it simply says whether they refer to the same entity or not. In that sense, predictions from a coreference resolution system could be useful to a subsequent entity linking component.

Approaches

Entity linking has been a hot topic in industry and academia for the last decade. Many challenges are unsolved, but many entity linking systems have been proposed, with widely different strengths and weaknesses.{{cite journal |last1=Ji |first1=Heng |last2=Nothman |first2=Joel |last3=Hachey |first3=Ben |last4=Florian |first4=Radu |title=Overview of TAC-KBP2015 Tri-lingual Entity Discovery and Linking |journal=TAC |date=2015}}

Broadly speaking, modern entity linking systems can be divided into two categories:

Text-based approaches, which make use of textual features extracted from large text corpora (e.g. Term frequency–Inverse document frequency (Tf–Idf), word co-occurrence probabilities, etc...).{{cite book |last1=Cucerzan |first1=Silviu |chapter=Large-Scale Named Entity Disambiguation Based on Wikipedia Data |date=June 2007 |title=Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) |pages=708–716 |chapter-url=https://aclweb.org/anthology/papers/D/D07/D07-1074/ |publisher=Association for Computational Linguistics |language=en-us}}
Graph-based approaches, which use the structure of knowledge graphs to represent the context and the relation of entities.{{cite journal |last1=Weikum |first1=Gerhard |last2=Thater |first2=Stefan |last3=Taneva |first3=Bilyana |last4=Spaniol |first4=Marc |last5=Pinkal |first5=Manfred |last6=Fürstenau |first6=Hagen |last7=Bordino |first7=Ilaria |last8=Yosef |first8=Mohamed Amir |last9=Hoffart |first9=Johannes |title=Robust Disambiguation of Named Entities in Text |journal=Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing |date=2011 |pages=782–792 |url=https://www.aclweb.org/anthology/papers/D/D11/D11-1072/ |language=en-us}}

Often entity linking systems use both knowledge graphs and textual features extracted from, for example, the text corpora used to build the knowledge graphs themselves.

File:Entity Linking - Example of pipeline.png

=Text-based=

The seminal work by Cucerzan in 2007 published one of the first entity linking systems. Specifically, it tackled the task of wikification, that is, linking textual mentions to Wikipedia pages. This system categorizes pages into entity, disambiguation, or list pages. The set of entities present in each entity page is used to build the entity's context. The final step is a collective disambiguation by comparing binary vectors of hand-crafted features each entity's context. Cucerzan's system is still used as baseline for recent work.{{Cite conference| doi = 10.1145/1557019.1557073| title = Collective annotation of Wikipedia entities in web text| conference = Proc. 15th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (KDD)| pages = | year = 2009| last1 = Kulkarni | first1 = Sayali| last2 = Singh | first2 = Amit| last3 = Ramakrishnan | first3 = Ganesh| last4 = Chakrabarti | first4 = Soumen| isbn = 9781605584959| citeseerx = 10.1.1.151.1904}}

Rao et al. proposed a two-step algorithm to link named entities to entities in a target knowledge base. First, candidate entities are chosen using string matching, acronyms, and known aliases. Then, the best link among the candidates is chosen with a ranking support vector machine (SVM) that uses linguistic features.

Recent systems, such as by Tsai et al., use word embeddings obtained with a skip-gram model as language features, and can be applied to any language for which a large corpus to build word embeddings is available. Like most entity linking systems, it has two steps: an initial candidate selection, and ranking using linear SVM.

Various approaches have been tried to tackle the problem of entity ambiguity. The seminal approach of Milne and Witten uses supervised learning using the anchor texts of Wikipedia entities as training data.David Milne and Ian H. Witten (2008). Learning to link with Wikipedia. Proc. CIKM. Other approaches also collected training data based on unambiguous synonyms.{{cite journal|last=Zhang|first=Wei|author2=Jian Su |author3=Chew Lim Tan |title=Entity Linking Leveraging Automatically Generated Annotation|journal=Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)|year=2010}}

=Graph-based=

Modern entity linking systems also use large knowledge graphs created from knowledge bases such as Wikipedia, besides textual features generated from input documents or text corpora. Moreover, multilingual entity linking based on natural language processing (NLP) is difficult, because it requires either large text corpora, which are absent for many languages, or hand-crafted grammar rules, which are widely different between languages. Graph-based entity linking uses features of the graph topology or multi-hop connections between entities, which are hidden to simple text analysis.

Han et al. propose the creation of a disambiguation graph (a subgraph of the knowledge base which contains candidate entities). This graph is used for collective ranking to select the best candidate entity for each textual mention.

Another famous approach is AIDA,{{Cite journal |last1=Yosef |first1=Mohamed Amir |last2=Hoffart |first2=Johannes |last3=Bordino |first3=Ilaria |last4=Spaniol |first4=Marc |last5=Weikum |first5=Gerhard |date=2011 |title=AIDA: An Online Tool for Accurate Disambiguation of Named Entities in Text and Tables |journal=Proceedings of the 37th International Conference on Very Large Databases |volume=VLDB 2011 |pages=1450–1453}} which uses a series of complex graph algorithms and a greedy algorithm that identifies coherent mentions on a dense subgraph by also considering context similarities and vertex importance features to perform collective disambiguation.

Alhelbawy et al. presented an entity linking system that uses PageRank to perform collective entity linking on a disambiguation graph, and to understand which entities are more strongly related to each other and so would represent a better linking. Graph ranking (or vertex ranking) algorithms such as PageRank (PR) and Hyperlink-Induced Topic Search (HITS) aim to score node according their relative importance in the graph.

=Mathematical=

Mathematical expressions (symbols and formulae) can be linked to semantic entities (e.g., Wikipedia articles{{cite book|author1=Giovanni Yoko Kristianto|author2=Goran Topic|author3=Akiko Aizawa|title=Digital Libraries: Knowledge, Information, and Data in an Open Access Society |chapter=Entity Linking for Mathematical Expressions in Scientific Documents |display-authors=et al.|doi=10.1007/978-3-319-49304-6_18|date=2016|series=Lecture Notes in Computer Science|volume=10075|pages=144–149|publisher=Springer|isbn=978-3-319-49303-9}} or Wikidata items{{cite conference|author1=Philipp Scharpf|author2=Moritz Schubotz|display-authors=et al.|title=Representing Mathematical Formulae in Content MathML using Wikidata|date=2018|conference=ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2018)}}) labeled with their natural language meaning. This is essential for disambiguation, since symbols may have different meanings (e.g., "E" can be "energy" or "expectation value", etc.).{{cite journal|author1=Moritz Schubotz|author2=Philipp Scharpf|display-authors=et al.|title=Introducing MathQA: a Math-Aware question answering system|doi=10.1108/IDD-06-2018-0022|date=2018|journal=Information Discovery and Delivery|volume=46|issue=4|pages=214–224|publisher=Emerald Publishing Limited|arxiv=1907.01642|s2cid=49484035}} The math entity linking process can be facilitated and accelerated through annotation recommendation, e.g., using the "AnnoMathTeX" system that is hosted by Wikimedia.{{cite web|url=http://annomathtex.wmflabs.org|title=AnnoMathTeX Formula/Identifier Annotation Recommender System}}{{cite book|author1=Philipp Scharpf|author2=Ian Mackerracher|title=Proceedings of the 13th ACM Conference on Recommender Systems |chapter=AnnoMathTeX - a formula identifier annotation recommender system for STEM documents |display-authors=et al.|doi=10.1145/3298689.3347042|date=17 September 2019|pages=532–533|isbn=9781450362436|s2cid=202639987|url=https://www.gipp.com/wp-content/papercite-data/pdf/scharpf2019b.pdf}}{{cite book|author1=Philipp Scharpf|author2=Moritz Schubotz|author3=Bela Gipp|title=Companion Proceedings of the Web Conference 2021 |chapter=Fast Linking of Mathematical Wikidata Entities in Wikipedia Articles Using Annotation Recommendation |doi=10.1145/3442442.3452348|date=14 April 2021|pages=602–609|arxiv=2104.05111|isbn=9781450383134|s2cid=233210264 |url=https://www.gipp.com/wp-content/papercite-data/pdf/scharpf2021.pdf}}

To facilitate the reproducibility of Mathematical Entity Linking (MathEL) experiments, the benchmark MathMLben was created.{{cite web|url=http://mathmlben.wmflabs.org|title=MathMLben formula benchmark}}{{cite book|author1=Moritz Schubotz|author2=André Greiner-Petter|author3=Philipp Scharpf|author4=Norman Meuschke|author5=Howard Cohl|author6=Bela Gipp|title=Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries |chapter=Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context |year=2018|volume=39 |issue=3 |pages=233–242|doi=10.1145/3197026.3197058|pmid=34584342 |pmc=8474120 |arxiv=1804.04956|isbn=9781450351782|s2cid=4872257|url=https://www.gipp.com/wp-content/papercite-data/pdf/schubotz2018.pdf}} It contains formulae from Wikipedia, the arXiV and the

NIST Digital Library of Mathematical Functions (DLMF). Formulae entries in the benchmark are labeled and augmented by Wikidata markup. Furthermore, for two large corporae from the arXiv{{cite web|url=http://arxiv.org|title=arXiv preprint repository}} and zbMATH{{cite web|url=http://zbmath.org|title=zbMath mathematical document library}} repository distributions of mathematical notation were examined. Mathematical Objects of Interest (MOI) are identified as potential candidates for MathEL.{{cite book|author1=André Greiner-Petter|author2=Moritz Schubotz|author3=Fabian Mueller|author4=Corinna Breitinger|author5=Howard S. Cohl|author6=Akiko Aizawa|author7=Bela Gipp|title=Proceedings of the Web Conference 2020 |chapter=Discovering Mathematical Objects of Interest—A Study of Mathematical Notations |year=2020|pages=1445–1456|doi=10.1145/3366423.3380218|arxiv=2002.02712|isbn=9781450370233|s2cid=211066554|url=https://www.gipp.com/wp-content/papercite-data/pdf/greinerpetter2020.pdf}}

Besides linking to Wikipedia, Schubotz and Scharpf et al. describe linking mathematical formula content to Wikidata, both in MathML and LaTeX markup. To extend classical citations by mathematical, they call for a Formula Concept Discovery (FCD) and Formula Concept Recognition (FCR) challenge to elaborate automated MathEL. Their FCD approach yields a recall of 68% for retrieving equivalent representations of frequent formulae, and 72% for extracting the formula name from the surrounding text on the NTCIR{{cite journal|author1=Akiko Aizawa|author2=Michael Kohlhase|author3=Iadh Ounis|author4=Moritz Schubotz|title=NTCIR-11 Math-2 Task Overview|journal=Proceedings of the 11th NTCIR Conference on Evaluation of Information Access Technologies}} arXiv dataset.

References

Category:Natural language processing

Category:Tasks of natural language processing