spaCy

{{Short description|Software library for natural language processing}}

{{lowercase title}}

{{For|the 1981 film|Spacy (film) {{!}} Spacy (film)}}

{{Distinguish|Scapy}}

{{Infobox software

| title = spaCy

| name = spaCy

| logo = SpaCy logo.svg

| author = Matthew Honnibal

| developer = Explosion AI, various

| released = {{Start date and age|2015|02}}{{cite web |title=Introducing spaCy |publisher=explosion.ai |url=https://explosion.ai/blog/introducing-spacy|access-date=2016-12-18}}

| latest release version = {{wikidata|property|reference|edit| Q28406945 |P348}}

| latest release date = {{start date and age|{{wikidata|qualifier| Q28406945 |P348|P577}}}}

| programming language = Python, Cython

| operating system = Linux, Windows, macOS, OS X

| platform = Cross-platform

| genre = Natural language processing

| license = MIT License

| website = {{Official URL}}

}}

spaCy ({{IPAc-en|s|p|eɪ|ˈ|s|iː}} {{respell|spay|SEE|'}}) is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.Choi et al. (2015). [https://aclweb.org/anthology/P/P15/P15-1038.pdf It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool].{{Cite web|url=https://www.washingtonpost.com/news/wonk/wp/2016/05/18/googles-new-artificial-intelligence-cant-understand-these-sentences-can-you/|title=Google's new artificial intelligence can't understand these sentences. Can you?|website=Washington Post|access-date=2016-12-18}} The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion.

Unlike NLTK, which is widely used for teaching and research, spaCy focuses on providing software for production usage.{{Cite web|url=https://spacy.io/usage/facts-figures#other-libraries|title=Facts & Figures - spaCy|website=spacy.io|language=en|access-date=2020-04-04}}{{cite journal|last=Bird|first=Steven|author2=Klein, Ewan|author3=Loper, Edward|author4=Baldridge, Jason|year=2008|title=Multidisciplinary instruction with the Natural Language Toolkit|url=https://www.aclweb.org/anthology/W/W08/W08-0208.pdf|journal=Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics, ACL|page=62|doi=10.3115/1627306.1627317|isbn=9781932432145|s2cid=16932735|doi-access=free}} spaCy also supports deep learning workflows that allow connecting statistical models trained by popular machine learning libraries like TensorFlow, PyTorch or MXNet through its own machine learning library Thinc.{{Cite web|url=https://thinc.ai/docs/usage-frameworks|title=PyTorch, TensorFlow & MXNet|website=thinc.ai|access-date=2020-04-04}}{{Cite web|url=https://github.com/explosion/thinc|title=explosion/thinc|website=GitHub|access-date=2016-12-30}} Using Thinc as its backend, spaCy features convolutional neural network models for part-of-speech tagging, dependency parsing, text categorization and named entity recognition (NER). Prebuilt statistical neural network models to perform these tasks are available for 23 languages, including English, Portuguese, Spanish, Russian and Chinese, and there is also a multi-language NER model. Additional support for tokenization for more than 65 languages allows users to train custom models on their own datasets as well.{{Cite web|url=https://spacy.io/usage/models#languages|title=Models & Languages {{!}} spaCy Usage Documentation|website=spacy.io|access-date=2020-03-10}}

History

  • Version 1.0 was released on October 19, 2016, and included preliminary support for deep learning workflows by supporting custom processing pipelines.{{Cite web|url=https://github.com/explosion/spaCy/releases/tag/v1.0.0|title=explosion/spaCy|website=GitHub|access-date=2021-02-08}} It further included a rule matcher that supported entity annotations, and an officially documented training API.
  • Version 2.0 was released on November 7, 2017, and introduced convolutional neural network models for 7 different languages.{{Cite web|url=https://github.com/explosion/spaCy/releases/tag/v2.0.0|title=explosion/spaCy|website=GitHub|access-date=2021-02-08}} It also supported custom processing pipeline components and extension attributes, and featured a built-in trainable text classification component.
  • Version 3.0 was released on February 1, 2021, and introduced state-of-the-art transformer-based pipelines.{{Cite web|url=https://github.com/explosion/spaCy/releases/tag/v3.0.0|title=explosion/spaCy|website=GitHub|access-date=2021-02-08}} It also introduced a new configuration system and training workflow, as well as type hints and project templates. This version dropped support for Python 2.

Main features

  • Non-destructive tokenization
  • "Alpha tokenization" support for over 65 languages{{Cite web|url=https://spacy.io/usage/models|title=Models & Languages - spaCy|website=spacy.io|language=en|access-date=2021-02-08}}
  • Built-in support for trainable pipeline components such as Named entity recognition, Part-of-speech tagging, dependency parsing, Text classification, Entity Linking and more
  • Statistical models for 19 languages{{Cite web|url=https://spacy.io/usage/models|title=Models & Languages {{!}} spaCy Usage Documentation|website=spacy.io|language=en|access-date=2021-02-08}}
  • Multi-task learning with pretrained transformers like BERT
  • Support for custom models in PyTorch, TensorFlow and other frameworks
  • State-of-the-art speed and accuracy{{Cite web|url=https://spacy.io/usage/facts-figures#benchmarks|title=Benchmarks {{!}} spaCy Usage Documentation|website=spacy.io|language=en|access-date=2021-02-08}}
  • Production-ready training system
  • Built-in visualizers for syntax and named entities
  • Easy model packaging, deployment and workflow management

Extensions and visualizers

File:Displacy.png visualization generated with the displaCy visualizer ]]

spaCy comes with several extensions and visualizations that are available as free, open-source libraries:

References

{{Reflist}}