BookCorpus

BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 7,000 self-published books scraped from the indie ebook distribution website Smashwords.{{Cite journal |last=Bandy |first=Jack |last2=Vincent |first2=Nicholas |date=2021 |title=Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus |url=https://arxiv.org/abs/2105.05241 |journal=NeurIPS}} It was the main corpus used to train the initial GPT model by OpenAI,{{Cite web|url=https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf|title=Improving Language Understanding by Generative Pre-Training|access-date=June 9, 2020|archive-date=January 26, 2021|archive-url=https://web.archive.org/web/20210126024542/https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf|url-status=live}} and has been used as training data for other early large language models including Google's BERT.{{cite arXiv |last1=Devlin |first1=Jacob |last2=Chang |first2=Ming-Wei |last3=Lee |first3=Kenton |last4=Toutanova |first4=Kristina |title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |date=11 October 2018 |eprint=1810.04805v2|class=cs.CL }} The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.

The corpus was introduced in a 2015 paper by researchers from the University of Toronto and MIT titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books". The authors described it as consisting of "free books written by yet unpublished authors," yet this is factually incorrect. These books were published by self-published ("indie") authors who priced them at free; the books were downloaded without the consent or permission of Smashwords or Smashwords authors and in violation of the Smashwords Terms of Service.{{cite web|title=Google swallows 11,000 novels to improve AI's conversation|last=Lea|first=Richard|work=The Guardian|date=28 September 2016|url=https://www.theguardian.com/books/2016/sep/28/google-swallows-11000-novels-to-improve-ais-conversation}} The dataset was initially hosted on a University of Toronto webpage. An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created. Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be Smashwords.

References

Category:Datasets in machine learning

Category:English corpora