List of datasets for machine-learning research#GLUE

These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets.{{cite web|url = https://edge.org/response-detail/26587|title = Datasets Over Algorithms|publisher = Edge.com|access-date = 8 January 2016|last = Wissner-Gross|first = A.}} High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.{{cite journal |last1=Weiss |first1=G. M. |last2=Provost |first2=F. |title=Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction |journal=Journal of Artificial Intelligence Research |date=October 2003 |volume=19 |pages=315–354 |doi=10.1613/jair.1199 }}{{cite book |last1=Abney |first1=Steven |title=Semisupervised Learning for Computational Linguistics |date=2007 |publisher=CRC Press |isbn=978-1-4200-1080-0 }}{{page needed|date=September 2024}}{{cite book |doi=10.1007/978-3-642-23808-6_39 |chapter=Active Learning with Evolving Streaming Data |title=Machine Learning and Knowledge Discovery in Databases |series=Lecture Notes in Computer Science |date=2011 |last1=Žliobaitė |first1=Indrė |last2=Bifet |first2=Albert |last3=Pfahringer |first3=Bernhard |last4=Holmes |first4=Geoff |volume=6913 |pages=597–612 |isbn=978-3-642-23807-9 }}

Many organizations, including governments, publish and share their datasets. The datasets are classified, based on the licenses, as Open data and Non-Open data.

The datasets from various governmental-bodies are presented in List of open government data sites. The datasets are ported on open data portals. They are made available for searching, depositing and accessing through interfaces like Open API. The datasets are made available as various sorted types and subtypes.

List of sorting used for datasets

class="wikitable"

!Type

!Subtypes

Specific category

|Finance, Economics, Commerce, Societal, Health, Academy, Sports, Food, Agriculture, Travel, Geospatial, Political, Consumer, Transport, Logistics, Environmental, Real-Estate, Legal, Entertainment, Energy, Hospitality

Scope

|Supranational Union, National, Subnational, Municipality, Urban, Rural

Language

|Mandarin Chinese, Spanish, English, Arabic, Hindi, Bengali

Type

|Tabular, Graph, Text, Image, Sound, Video

Usage

|Training, validating, and testing

File-Formats

|CSV, JSON, XML, KML, GeoJSON, Shapefile, GML

Licenses

|Creative-Commons, GPL, Other Non-Open data licenses

Last-Updated

|Last-Hour, Last-Day, Last-Week, Last-Month, Last-Year

File-Size

|Minimum, Maximum, Range

[https://docs.openml.org/#dataset-status Status]

|Verified, In-Preparation, Deactivated(or Deprecated)

Number of records

|100s, 1000s, 10000s, 100000s, Millions

Number of variables

|Less than 10, 10s, 100s, 1000s, 10000s

Services

|Individual, Aggregation

The data portal is classified based on its type of license. The open source license based data portals are known as open data portals which are used by many government organizations and academic institutions.

List of open data portals

class="wikitable"

!Portal-name

!License

!List of installations of the portal

!Typical usages

Comprehensive Knowledge Archive Network (CKAN)

|AGPL

|https://ckan.github.io/ckan-instances/

https://github.com/sebneu/ckan_instances/blob/master/instances.csv

|Data repository for government or non-profit organisations, Data Management Solution for Research Institutes

[https://getdkan.org/ DKAN]

|GPL

|https://getdkan.org/community

|Data repository for government or non-profit organisations, Data Management Solution for Research Institutes

Dataverse

|Apache

|https://dataverse.org/installations

https://dataverse.org/metrics

|Data Management Solution for Research Institutes

DSpace

|BSD

|https://registry.lyrasis.org/

|Data Management Solution for Research Institutes

[https://www.openml.org/ OpenML]

|BSD

|https://www.openml.org/search?type=data&sort=runs&status=active

|Data Management Solution to share datasets, algorithms, and experiments results through APIs.

List of portals suitable for multiple types of applications

The data portal sometimes lists a wide variety of subtypes of datasets pertaining to many machine learning applications.

class="wikitable"

|Academic Torrents

|https://academictorrents.com

Amazon Datasets

|https://registry.opendata.aws/

Awesome Public Datasets Collection

|https://github.com/awesomedata/awesome-public-datasets

data.world

|https://data.world/datasets/machine-learning

Datahub – Core Datasets

|https://datahub.io/docs/core-data

DataONE

|https://www.dataone.org/

DataPortals

|https://dataportals.org/

Datasetlist.com

|https://www.datasetlist.com

Global Open Data Index – Open Knowledge Foundation

|https://okfn.org/ {{Webarchive|url=https://web.archive.org/web/20200525213547/https://index.okfn.org/ |date=25 May 2020 }}

Google Dataset Search

|https://datasetsearch.research.google.com/

Hugging Face

|https://huggingface.co/docs/datasets/

IBM's Data Asset Exchange

|https://developer.ibm.com/exchanges/data/

Jupyter – Tutorial Data

|https://jupyter-tutorial.readthedocs.io/en/latest/data-processing/opendata.html

Kaggle

|https://www.kaggle.com/datasets

Machine learning datasets

|https://macgence.com/data-sets-and-cataloges/

Major Smart Cities with Open Data

|https://rlist.io/l/major-smart-cities-with-open-data-portals

Microsoft Datasets

|https://msropendata.com/datasets

Open Data Inception

|https://opendatainception.io/

Opendatasoft

|https://data.opendatasoft.com/explore/dataset/open-data-sources%40public/table/?sort=code_en

OpenDOAR

|https://v2.sherpa.ac.uk/opendoar/

OpenML

|https://www.openml.org/search?type=data

Papers with Code

|https://paperswithcode.com/datasets

Penn Machine Learning Benchmarks

|https://github.com/EpistasisLab/pmlb/tree/master/datasets

Public APIs

|https://github.com/public-apis/public-apis

Registry of Open Access Repositories

|http://roar.eprints.org/

REgistry of REsearch Data REpositories

|https://www.re3data.org/

UCI Machine Learning Repository

|http://mlr.cs.umass.edu/ml/ {{Webarchive|url=https://web.archive.org/web/20200626215834/http://mlr.cs.umass.edu/ml/ |date=26 June 2020 }}

Speech Dataset

|https://www.shaip.com/offerings/speech-data-catalog/

Visual Data Discovery

|https://visualdata.io/discovery

List of portals suitable for a specific subtype of applications

The data portals which are suitable for a specific subtype of machine learning application are listed in the subsequent sections.

Image data

{{Main|List of datasets in computer vision and image processing}}

Text data

These datasets consist primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.

= Reviews =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Netflix Prize

|Movie ratings on Netflix.

|100,480,507 ratings that 480,189 users gave to 17,770 movies

|Text, rating

|Rating prediction

|2006

|{{cite conference |last=James Bennett |author2=Stan Lanning |date=August 12, 2007 |title=The Netflix Prize |url=http://www.netflixprize.com/assets/NetflixPrizeKDD_to_appear.pdf |archive-url=https://web.archive.org/web/20070927051207/http://www.netflixprize.com/assets/NetflixPrizeKDD_to_appear.pdf |archive-date=September 27, 2007 |access-date=2007-08-25 |book-title=Proceedings of KDD Cup and Workshop 2007 |url-status=dead}}

|Netflix

Amazon reviews

| US product reviews from Amazon.com.

|None.

| 233.1 million

|Text

|Classification, sentiment analysis

|2015 (2018)

|{{cite arXiv | eprint=1506.04757 | last1=McAuley | first1=Julian | last2=Targett | first2=Christopher | last3=Shi | first3=Qinfeng | author4=Anton van den Hengel | title=Image-based Recommendations on Styles and Substitutes | year=2015 | class=cs.CV }}{{Cite web|title=Amazon review data|url=https://nijianmo.github.io/amazon/index.html|access-date=2021-10-08|website=nijianmo.github.io}}

|McAuley et al.

OpinRank Review Dataset

|Reviews of cars and hotels from Edmunds.com and TripAdvisor respectively.

|None.

|42,230 / ~259,000 respectively

|Text

|Sentiment analysis, clustering

|2011

|{{cite journal | last1 = Ganesan | first1 = Kavita | last2 = Zhai | first2 = Chengxiang | year = 2012 | title = Opinion-based entity ranking | journal = Information Retrieval | volume = 15 | issue = 2| pages = 116–150 | doi=10.1007/s10791-011-9174-8| hdl = 2142/15252 | s2cid = 16258727 | hdl-access = free }}{{cite book |doi=10.1145/2348283.2348325 |chapter=An exploration of ranking heuristics in mobile local search |title=Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval |date=2012 |last1=Lv |first1=Yuanhua |last2=Lymberopoulos |first2=Dimitrios |last3=Wu |first3=Qiang |pages=295–304 |isbn=978-1-4503-1472-5 }}

|K. Ganesan et al.

MovieLens

|22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users.

|None.

|~ 22M

|Text

|Regression, clustering, classification

|2016

|{{cite journal | last1 = Harper | first1 = F. Maxwell | last2 = Konstan | first2 = Joseph A. | year = 2015 | title = The MovieLens Datasets: History and Context | journal = ACM Transactions on Interactive Intelligent Systems | volume = 5 | issue = 4| page = 19 | doi = 10.1145/2827872 | s2cid = 16619709 }}

|GroupLens Research

Yahoo! Music User Ratings of Musical Artists

|Over 10M ratings of artists by Yahoo users.

|None described.

|~ 10M

|Text

|Clustering, regression

|2004

|{{cite book |doi=10.1145/2043932.2043964 |chapter=Yahoo! Music recommendations: Modeling music ratings with temporal dynamics and item taxonomy |title=Proceedings of the fifth ACM conference on Recommender systems |date=2011 |last1=Koenigstein |first1=Noam |last2=Dror |first2=Gideon |last3=Koren |first3=Yehuda |pages=165–172 |isbn=978-1-4503-0683-6 }}{{cite book |doi=10.1145/2187980.2188222 |chapter=The million song dataset challenge |title=Proceedings of the 21st International Conference on World Wide Web |date=2012 |last1=McFee |first1=Brian |last2=Bertin-Mahieux |first2=Thierry |last3=Ellis |first3=Daniel P.W. |last4=Lanckriet |first4=Gert R.G. |pages=909–916 |isbn=978-1-4503-1230-1 }}

|Yahoo!

Car Evaluation Data Set

|Car properties and their overall acceptability.

|Six categorical features given.

|1728

|Text

|Classification

|1997

|Bohanec, Marko, and Vladislav Rajkovic. "[https://www.researchgate.net/profile/Marko_Bohanec/publication/246614940_KNOWLEDGE_ACQUISITION_AND_EXPLANATION_FOR_MULTI-ATTRIBUTE_DECISION_MAKING/links/02e7e532152f452d87000000.pdf Knowledge acquisition and explanation for multi-attribute decision making]." 8th Intl Workshop on Expert Systems and their Applications. 1988.Tan, Peter J., and David L. Dowe. "[http://www.csse.monash.edu.au/~dld/Publications/2002/Tan+Dowe2002_MMLDecisionGraphs.ps MML inference of decision graphs with multi-way joins]." Australian Joint Conference on Artificial Intelligence. 2002.

|M. Bohanec

YouTube Comedy Slam Preference Dataset

|User vote data for pairs of videos shown on YouTube. Users voted on funnier videos.

|Video metadata given.

|1,138,562

|Text

|Classification

|2012

|{{Cite web

| url = https://metatext.io/datasets

| title = Quantifying comedy on YouTube: why the number of o's in your LOL matter

| website = Metatext NLP Database

| access-date = 2020-10-26

}}{{Cite book|chapter-url=https://link.springer.com/chapter/10.1007/978-3-642-32692-9_63|doi=10.1007/978-3-642-32692-9_63|chapter=A Classifier for Big Data|title=Convergence and Hybrid Information Technology|series=Communications in Computer and Information Science|year=2012|last1=Kim|first1=Byung Joo|volume=310|pages=505–512|isbn=978-3-642-32691-2}}

|Google

Skytrax User Reviews Dataset

|User reviews of airlines, airports, seats, and lounges from Skytrax.

|Ratings are fine-grain and include many aspects of airport experience.

|41396

|Text

|Classification, regression

|2015

|{{cite journal | last1 = Pérezgonzález | first1 = Jose D. | last2 = Gilbey | first2 = Andrew | year = 2011 | title = Predicting Skytrax airport rankings from customer reviews | url = https://www.ingentaconnect.com/content/hsp/cam/2011/00000005/00000004/art00007| journal = Journal of Airport Management | volume = 5 | issue = 4| pages = 335–339 | doi = 10.69554/RFZC4321 }}

|Q. Nguyen

Teaching Assistant Evaluation Dataset

|Teaching assistant reviews.

|Features of each instance such as class, class size, and instructor are given.

|151

|Text

|Classification

|1997

|Loh, Wei-Yin, and Yu-Shan Shih. "[http://www3.stat.sinica.edu.tw/statistica/oldpdf/A7n41.pdf Split selection methods for classification trees]." Statistica sinica(1997): 815–840.{{cite journal | last1 = Lim | first1 = Tjen-Sien | last2 = Loh | first2 = Wei-Yin | last3 = Shih | first3 = Yu-Shan | year = 2000 | title = A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms | journal = Machine Learning | volume = 40 | issue = 3| pages = 203–228 | doi=10.1023/a:1007608224229| s2cid = 17030953 }}

|W. Loh et al.

Vietnamese Students’ Feedback Corpus (UIT-VSFC)

|Students’ Feedback.

|Comments

|16,000

|Text

|Classification

|1997

|{{cite book |doi=10.1109/KSE.2018.8573337 |chapter=UIT-VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis |title=2018 10th International Conference on Knowledge and Systems Engineering (KSE) |date=2018 |last1=Nguyen |first1=Kiet Van |last2=Nguyen |first2=Vu Duc |last3=Nguyen |first3=Phu X. V. |last4=Truong |first4=Tham T. H. |last5=Nguyen |first5=Ngan Luu-Thuy |pages=19–24 |isbn=978-1-5386-6113-0 }}

|Nguyen et al.

Vietnamese Social Media Emotion Corpus (UIT-VSMEC)

|Users’ Facebook Comments.

|Comments

|6,927

|Text

|Classification

|1997

|{{Cite book|chapter-url=https://link.springer.com/chapter/10.1007/978-981-15-6168-9_27|doi = 10.1007/978-981-15-6168-9_27|chapter = Emotion Recognition for Vietnamese Social Media Text|title = Computational Linguistics|series = Communications in Computer and Information Science|year = 2020|last1 = Ho|first1 = Vong Anh|last2 = Nguyen|first2 = Duong Huynh-Cong|last3 = Nguyen|first3 = Danh Hoang|last4 = Pham|first4 = Linh Thi-Van|last5 = Nguyen|first5 = Duc-Vu|last6 = Nguyen|first6 = Kiet Van|last7 = Nguyen|first7 = Ngan Luu-Thuy|volume = 1215|pages = 319–333|arxiv = 1911.09339|isbn = 978-981-15-6167-2|s2cid = 208202333}}

|Nguyen et al.

Vietnamese Open-domain Complaint Detection dataset (ViOCD)

|Customer product reviews

|Comments

|5,485

|Text

|Classification

|2021

|{{cite arXiv |author=Nhung Thi-Hong Nguyen, Phuong Ha-Dieu Phan, Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen|eprint=2104.11969 |title=Vietnamese Open-domain Complaint Detection in E-Commerce Websites|date=24 April 2021|class=cs.CL }}

|Nguyen et al.

ViHOS: Hate Speech Spans Detection for Vietnamese

|Social Media Texts

|Comments

|Containing 26k spans on 11k comments

|Text

|Span Detection

|2021

|{{cite arXiv |author=Phu Gia Hoang, Canh Duc Luu, Khanh Quoc Tran, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen|eprint=2301.10186 |title=ViHOS: Hate Speech Spans Detection for Vietnamese|date=26 January 2023|class=cs.CL }}

|Hoang et al.

= News articles =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

NYSK Dataset

|English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn.

|Filtered and presented in XML format.

|10,421

|XML, text

|Sentiment analysis, topic extraction

|2013

|{{cite conference | last1=Dermouche | first1=Mohamed | last2=Velcin | first2=Julien | last3=Khouas | first3=Leila | last4=Loudcher | first4=Sabine | title=2014 IEEE International Conference on Data Mining | chapter=A Joint Model for Topic-Sentiment Evolution over Time | publisher=IEEE | year=2014 | pages=773–778 | isbn=978-1-4799-4302-9 | doi=10.1109/icdm.2014.82 }}

|Dermouche, M. et al.

The Reuters Corpus Volume 1

|Large corpus of Reuters news stories in English.

|Fine-grain categorization and topic codes.

|810,000

|Text

|Classification, clustering, summarization

|2002

|{{cite journal | last1 = Rose | first1 = Tony | last2 = Stevenson | first2 = Mark | last3 = Whitehead | first3 = Miles | title = The Reuters Corpus Volume 1-from Yesterday's News to Tomorrow's Language Resources | journal = LREC | volume = 2 | year = 2002 | s2cid = 9239414 }}

|Reuters

The Reuters Corpus Volume 2

|Large corpus of Reuters news stories in multiple languages.

|Fine-grain categorization and topic codes.

|487,000

|Text

|Classification, clustering, summarization

|2005

|{{cite journal | last1=Amini | first1=Massih R. | last2=Usunier | first2=Nicolas | last3=Goutte | first3=Cyril | title=Learning from Multiple Partially Observed Views – an Application to Multilingual Text Categorization | url=http://papers.nips.cc/paper/3690-learning-from-multiple-partially-observed-views-an-application-to-multilingual-text-categorization | year=2009 | pages=28–36 |journal=Advances in Neural Information Processing Systems| volume=22 }}

|Reuters

Thomson Reuters Text Research Collection

|Large corpus of news stories.

|Details not described.

|1,800,370

|Text

|Classification, clustering, summarization

|2009

|{{cite conference |last=Liu |first=Ming |display-authors=etal |url=https://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/download/10903/10990 |title=VRCA: a clustering algorithm for massive amount of texts |book-title=Proceedings of the 24th International Conference on Artificial Intelligence |publisher=AAAI Press |year=2015 |access-date=6 August 2019 |archive-date=5 November 2021 |archive-url=https://web.archive.org/web/20211105004605/https://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/download/10903/10990 |url-status=dead }}

|T. Rose et al.

Saudi Newspapers Corpus

|31,030 Arabic newspaper articles.

|Metadata extracted.

|31,030

|JSON

|Summarization, clustering

|2015

|{{cite conference |last1=Al-Harbi |first1=S |last2=Almuhareb |first2=A |last3=Al-Thubaity |first3=A |last4=Khorsheed |first4=M. S. |last5=Al-Rajeh |first5=A |year=2008 |title=Automatic Arabic Text Classification |book-title=Proceedings of the 9th International Conference on the Statistical Analysis of Textual Data, Lyon, France}}

|M. Alhagri

RE3D (Relationship and Entity Extraction Evaluation Dataset)

|Entity and Relation marked data from various news and government sources. Sponsored by Dstl

|Filtered, categorisation using Baleen types

|not known

|JSON

|Classification, Entity and Relation recognition

|2017

|{{Cite web | url=https://github.com/dstl/re3d | title=Relationship and Entity Extraction Evaluation Dataset: Dstl/re3d| website=GitHub| date=2018-12-17}}

|Dstl

Examiner Spam Clickbait Catalogue

|Clickbait, spam, crowd-sourced headlines from 2010 to 2015

|Publish date and headlines

|3,089,781

|CSV

|Clustering, Events, Sentiment

|2016

|{{Cite web | url=https://www.kaggle.com/therohk/examine-the-examiner | title=The Examiner – SpamClickBait Catalogue}}

|R. Kulkarni

ABC Australia News Corpus

|Entire news corpus of ABC Australia from 2003 to 2019

|Publish date and headlines

|1,186,018

|CSV

|Clustering, Events, Sentiment

|2020

|{{Cite web | url=https://www.kaggle.com/therohk/million-headlines | title=A Million News Headlines}}

|R. Kulkarni

Worldwide News – Aggregate of 20K Feeds

|One week snapshot of all online headlines in 20+ languages

|Publish time, URL and headlines

|1,398,431

|CSV

|Clustering, Events, Language Detection

|2018

|{{Cite web | url=https://www.kaggle.com/therohk/global-news-week | title=One Week of Global News Feeds}}

|R. Kulkarni

Reuters News Wire Headline

|11 Years of timestamped events published on the news-wire

|Publish time, Headline Text

|16,121,310

|CSV

|NLP, Computational Linguistics, Events

|2018

|{{Citation | title=Reuters News-Wire Archive|doi = 10.7910/DVN/XDB74W|year = 2018|last1 = Kulkarni|first1 = Rohit|publisher = Harvard Dataverse}}

|R. Kulkarni

The Irish Times Ireland News Corpus

|24 Years of Ireland News from 1996 to 2019

|Publish time, Headline Category and Text

|1,484,340

|CSV

|NLP, Computational Linguistics, Events

|2020

|{{Cite web | url=https://www.kaggle.com/therohk/ireland-historical-news | title=IrishTimes – the Waxy-Wany News}}

|R. Kulkarni

News Headlines Dataset for Sarcasm Detection

|High quality dataset with Sarcastic and Non-sarcastic news headlines.

|Clean, normalized text

|26,709

|JSON

|NLP, Classification, Linguistics

|2018

|{{Cite web|url=https://kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection|title=News Headlines Dataset For Sarcasm Detection|website=kaggle.com|access-date=2019-04-27}}

|Rishabh Misra

= Messages =

style="width: 100%" class="wikitable sortable"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Enron Corpus

|Emails from employees at Enron organized into folders.

|Attachments removed, invalid email addresses converted to user@enron.com or no_address@enron.com.

|~ 500,000

|Text

|Network analysis, sentiment analysis

|2004 (2015)

|Klimt, Bryan, and Yiming Yang. "[https://bklimt.com/papers/2004_klimt_ceas.pdf Introducing the Enron Corpus]." CEAS. 2004.{{cite arXiv | eprint=0806.3201 | last1=Kossinets | first1=Gueorgi | last2=Kleinberg | first2=Jon | last3=Watts | first3=Duncan | title=The Structure of Information Pathways in a Social Communication Network | year=2008 | class=physics.soc-ph }}

|Klimt, B. and Y. Yang

Ling-Spam Dataset

|Corpus containing both legitimate and spam emails.

|Four version of the corpus involving whether or not a lemmatiser or stop-list was enabled.

|2,412 Ham 481 Spam

|Text

|Classification

|2000

|{{Cite conference |arxiv=cs/0006013 |last1=Androutsopoulos |first1=Ion |last2=Koutsias |first2=John |last3= Chandrinos |first3=Konstantinos V. |last4=Paliouras |first4=George |last5= Spyropoulos |first5=Constantine D. |year=2000 |title=An evaluation of Naive Bayesian anti-spam filtering |book-title=Proceedings of the Workshop on Machine Learning in the New Information Age |conference=11th European Conference on Machine Learning, Barcelona, Spain |editor1-first=G. |editor1-last=Potamias |editor2-first=V. |editor2-last=Moustakis |editor3-first=M. |editor3-last=van Someren |volume=11 |pages=9–17 |bibcode=2000cs........6013A}}{{cite journal | last1 = Bratko | first1 = Andrej | display-authors = et al | year = 2006 | title = Spam filtering using statistical data compression models | url =http://www.jmlr.org/papers/volume7/bratko06a/bratko06a.pdf | journal = The Journal of Machine Learning Research | volume = 7 | pages = 2673–2698 }}

|Androutsopoulos, J. et al.

SMS Spam Collection Dataset

|Collected SMS spam messages.

|None.

|5,574

|Text

|Classification

|2011

|Almeida, Tiago A., José María G. Hidalgo, and Akebo Yamakami. "[http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/doceng11.pdf Contributions to the study of SMS spam filtering: new collection and results]."Proceedings of the 11th ACM symposium on Document engineering. ACM, 2011.{{cite journal | last1 = Delany | last2 = Jane | first2 = Sarah | last3 = Buckley | first3 = Mark | last4 = Greene | first4 = Derek | year = 2012 | title = SMS spam filtering: methods and data | url = https://arrow.dit.ie/cgi/viewcontent.cgi?article=1022&context=scschcomart| journal = Expert Systems with Applications | volume = 39 | issue = 10| pages = 9899–9908 | doi=10.1016/j.eswa.2012.02.053| s2cid = 15546924 }}

|T. Almeida et al.

Twenty Newsgroups Dataset

|Messages from 20 different newsgroups.

|None.

|20,000

|Text

|Natural language processing

|1999

|Joachims, Thorsten. [https://apps.dtic.mil/dtic/tr/fulltext/u2/a307731.pdf A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization]. No. CMU-CS-96-118. Carnegie-mellon univ pittsburgh pa dept of computer science, 1996.

|T. Mitchell et al.

Spambase Dataset

|Spam emails.

|Many text features extracted.

|4,601

|Text

|Spam detection, classification

|1999

|Dimitrakakis, Christos, and Samy Bengio. [https://infoscience.epfl.ch/record/82788/files/rr02-28.pdf Online Policy Adaptation for Ensemble Algorithms]. No. EPFL-REPORT-82788. IDIAP, 2002.

|M. Hopkins et al.

= Twitter and tweets =

style="width: 100%" class="wikitable sortable"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

MovieTweetings

|Movie rating dataset based on public and well-structured tweets

|~710,000

|Text

|Classification, regression

|2018

|Dooms, S. et al. "Movietweetings: a movie rating dataset collected from twitter, 2013. Available from https://github.com/sidooms/MovieTweetings."

|S. Dooms

Twitter100k

|Pairs of images and tweets

|100,000

|Text and Images

|Cross-media retrieval

|2017

|{{Cite arXiv|title=Twitter100k: A Real-world Dataset for Weakly Supervised Cross-Media Retrieval|eprint = 1703.06618|last1 = RoyChowdhury|first1 = Aruni|last2=Lin|first2=Tsung-Yu|last3=Maji|first3=Subhransu|last4=Learned-Miller|first4=Erik|class=cs.CV|year=2017}}{{Cite web|url=https://github.com/huyt16/Twitter100k|title=huyt16/Twitter100k|website=GitHub|language=en|access-date=2018-03-26}}

|Y. Hu, et al.

Sentiment140

|Tweet data from 2009 including original text, time stamp, user and sentiment.

|Classified using distant supervision from presence of emoticon in tweet.

|1,578,627

|Tweets, comma, separated values

|Sentiment analysis

|2009

|{{cite journal | last1 = Go | first1 = Alec | last2 = Bhayani | first2 = Richa | last3 = Huang | first3 = Lei | year = 2009 | title = Twitter sentiment classification using distant supervision | journal = CS224N Project Report, Stanford | volume = 1 | page = 12 }}Chikersal, Prerna, Soujanya Poria, and Erik Cambria. "[https://www.aclweb.org/anthology/S15-2108 SeNTU: sentiment analysis of tweets by combining a rule-based classifier with supervised learning]." Proceedings of the International Workshop on Semantic Evaluation, SemEval. 2015.

|A. Go et al.

ASU Twitter Dataset

|Twitter network data, not actual tweets. Shows connections between a large number of users.

|None.

|11,316,811 users, 85,331,846 connections

|Text

|Clustering, graph analysis

|2009

|Zafarani, Reza, and Huan Liu. "Social computing data repository at ASU." School of Computing, Informatics and Decision Systems Engineering, Arizona State University (2009).Data Science Course by DataTrained Education "[https://web.archive.org/web/20200928084958/https://www.datatrained.com/data-science-program IBM Certified Data Science Course]." IBM Certified Online Data Science Course

|R. Zafarani et al.

SNAP Social Circles: Twitter Database

|Large Twitter network data.

|Node features, circles, and ego networks.

|1,768,149

|Text

|Clustering, graph analysis

|2012

|{{cite journal | last1 = McAuley | first1 = Julian J. | last2 = Leskovec | first2 = Jure | title = Learning to Discover Social Circles in Ego Networks | journal = NIPS | volume = 2012 | page = 2012 }}{{cite journal | last1 = Šubelj | first1 = Lovro | last2 = Fiala | first2 = Dalibor | last3 = Bajec | first3 = Marko | title = Network-based statistical comparison of citation topology of bibliographic databases | journal = Scientific Reports | volume = 4 | issue = 6496| pages = 6496 | year = 2014 | doi=10.1038/srep06496| pmid = 25263231 | pmc = 4178292 | arxiv = 1502.05061 | bibcode = 2014NatSR...4.6496S }}

|J. McAuley et al.

Twitter Dataset for Arabic Sentiment Analysis

|Arabic tweets.

|Samples hand-labeled as positive or negative.

|2000

|Text

|Classification

|2014

|Abdulla, N., et al. "Arabic sentiment analysis: Corpus-based and lexicon-based." Proceedings of the IEEE conference on Applied Electrical Engineering and Computing Technologies (AEECT). 2013.{{cite journal |last1=Abooraig |first1=Raddad |last2=Al-Zu'bi |first2=Shadi |last3=Kanan |first3=Tarek |last4=Hawashin |first4=Bilal |last5=Al Ayoub |first5=Mahmoud |last6=Hmeidi |first6=Ismail |title=Automatic categorization of Arabic articles based on their political orientation |journal=Digital Investigation |date=June 2018 |volume=25 |pages=24–41 |doi=10.1016/j.diin.2018.04.003 }}

|N. Abdulla

Buzz in Social Media Dataset

|Data from Twitter and Tom's Hardware. This dataset focuses on specific buzz topics being discussed on those sites.

|Data is windowed so that the user can attempt to predict the events leading up to social media buzz.

|140,000

|Text

|Regression, Classification

|2013

|Kawala, François, et al. "[https://hal.archives-ouvertes.fr/hal-00881395/document Prédictions d'activité dans les réseaux sociaux en ligne]." 4ième conférence sur les modèles et l'analyse des réseaux: Approches mathématiques et informatiques. 2013.{{cite arXiv |eprint=1601.00024|last1=Sabharwal|first1=Ashish|title=Selecting Near-Optimal Learners via Incremental Data Allocation|last2=Samulowitz|first2=Horst|last3=Tesauro|first3=Gerald|class=cs.LG|year=2015}}

|F. Kawala et al.

Paraphrase and Semantic Similarity in Twitter (PIT)

|This dataset focuses on whether tweets have (almost) same meaning/information or not. Manually labeled.

|tokenization, part-of-speech and named entity tagging

|18,762

|Text

|Regression, Classification

|2015

|Xu et al. "[https://www.aclweb.org/anthology/S15-2001 SemEval-2015 Task 1: Paraphrase and Semantic Similarity in Twitter (PIT)]" Proceedings of the 9th International Workshop on Semantic Evaluation. 2015.Xu et al. "[https://transacl.org/ojs/index.php/tacl/article/viewFile/498/64 Extracting Lexically Divergent Paraphrases from Twitter]" Transactions of the Association for Computational (TACL). 2014.

|Xu et al.

Geoparse Twitter benchmark dataset

|This dataset contains tweets during different news events in different countries. Manually labeled location mentions.

|location annotations added to JSON metadata

|6,386

|Tweets, JSON

|Classification, Information Extraction

|2014

|{{cite journal|doi=10.1109/MIS.2013.126|title=Real-Time Crisis Mapping of Natural Disasters Using Social Media|journal=IEEE Intelligent Systems|volume=29|issue=2|pages=9–17|year=2014|last1=Middleton|first1=Stuart E|last2=Middleton|first2=Lee|last3=Modafferi|first3=Stefano|s2cid=15139204|url=https://eprints.soton.ac.uk/370581/1/ieee-is2014.pdf}}{{Cite web|url=https://pypi.org/project/geoparsepy|title=geoparsepy|year = 2016}} Python PyPI library

|S.E. Middleton et al.

Sarcasm, Perceived and Intended, by Reactive Supervision (SPIRS)

|Intended and perceived sarcastic tweets along with their context collected using reactive supervision; an equal number of negative (non-sarcastic) samples

|30,000

|Tweet IDs, CSV

|Classification

|2020

|{{Cite book |last1=Shmueli |first1=Boaz |last2=Ku |first2=Lun-Wei |last3=Ray |first3=Soumya |title=Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) |chapter=Reactive Supervision: A New Method for Collecting Sarcasm Data |date=2020 |chapter-url=https://aclanthology.org/2020.emnlp-main.201/ |publisher=Association for Computational Linguistics |pages=2553–2559 |doi=10.18653/v1/2020.emnlp-main.201|s2cid=221970454 }}{{Cite web |last=Shmueli |first=Boaz |title=SPIRS Sarcasm Dataset |url=https://github.com/bshmueli/SPIRS |website=GitHub}}

|B. Shmueli et al.

Dutch Social media collection

|This dataset contains COVID-19 tweets made by Dutch speakers or users from Netherlands. The data has been machine labeled

|classified for sentiment, tweet text & user description translated to English. Industry mention are extracted

|271,342

|JSONL

|Sentiment, multi-label classification, machine translation

|2020

|{{cite web| title=Dutch social media collection| author=Gupta, Aakash| url=https://huggingface.co/datasets/dutch_social/blob/main/dutch_social.py| publisher=COVID-19 Data Hub| date=2020| access-date=11 November 2023| doi=10.5072/FK2/MTPTL7}}{{Cite web|title=Streamlit|url=https://huggingface.co/datasets/viewer/?dataset=dutch_social|access-date=2020-12-18|website=huggingface.co}}{{Cite web|title=Dutch Social media collection|url=https://kaggle.com/skylord/dutch-tweets|access-date=2020-12-18|website=kaggle.com|language=en}}

|Aaaksh Gupta, CoronaWhy

ReactionGIF dataset

|A dataset of 30K tweets and their GIF reactions

|Classified for sentiment, reaction, and emotion

|30,000

|Tweet IDs, JSONL

|Classified for sentiment, reaction, and emotion

|2021

|{{Cite book |last1=Shmueli |first1=Boaz |last2=Ray |first2=Soumya |last3=Lun-Wei |title=Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) |chapter=Happy Dance, Slow Clap: Using Reaction GIFs to Predict Induced Affect on Twitter |date=2021 | url=https://aclanthology.org/2021.acl-short.50 |publisher=As |volume=Association for Computational Linguistics |pages=395–401 |doi=10.18653/v1/2021.acl-short.50|s2cid=235125510 }}{{Citation |last=Shmueli |first=Boaz |title=ReactionGIF |date=2023-05-05 |url=https://github.com/bshmueli/ReactionGIF |access-date=2023-10-06}}

|B. Shmueli et al.

= Dialogues =

style="width: 100%" class="wikitable sortable"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

NPS Chat Corpus

|Posts from age-specific online chat rooms.

|Hand privacy masked, tagged for part of speech and dialogue-act.

|~ 500,000

|XML

|NLP, programming, linguistics

|2007

|Forsyth, E., Lin, J., & Martell, C. (2008, June 25). The NPS Chat Corpus. Retrieved from http://faculty.nps.edu/cmartell/NPSChat.htm

|Forsyth, E., Lin, J., & Martell, C.

Twitter Triple Corpus

|A-B-A triples extracted from Twitter.

|4,232

|Text

|NLP

|2016

|{{cite arXiv | eprint=1506.06714 | last1=Sordoni | first1=Alessandro | last2=Galley | first2=Michel | last3=Auli | first3=Michael | last4=Brockett | first4=Chris | last5=Ji | first5=Yangfeng | last6=Mitchell | first6=Margaret | last7=Nie | first7=Jian-Yun | last8=Gao | first8=Jianfeng | last9=Dolan | first9=Bill | title=A Neural Network Approach to Context-Sensitive Generation of Conversational Responses | year=2015 | class=cs.CL }}

|Sordini, A. et al.

UseNet Corpus

|UseNet forum postings.

|Anonymized e-mails and URLs. Omitted documents with lengths <500 words or >500,000 words, or that were <90% English.

|7 billion

|Text

|2011

|Shaoul, C. & Westbury C. (2013) A reduced redundancy USENET corpus (2005–2011) Edmonton, AB: University of Alberta (downloaded from http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html)

|Shaoul, C., & Westbury C.

NUS SMS Corpus

|SMS messages collected between two users, with timing analysis.

|~ 10,000

|XML

|NLP

|2011

|KAN, M. (2011, January). NUS Short Message Service (SMS) Corpus. Retrieved from http://www.comp.nus.edu.sg/entrepreneurship/innovation/osr/corpus/ {{Webarchive|url=https://web.archive.org/web/20180629055042/http://www.comp.nus.edu.sg/entrepreneurship/innovation/osr/corpus/ |date=29 June 2018 }}

|KAN, M

Reddit All Comments Corpus

|All Reddit comments (as of 2015).

|~ 1.7 billion

|JSON

|NLP, research

|2015

|Stuck_In_the_Matrix. (2015, July 3). I have every publicly available Reddit comment for research. ~ 1.7 billion comments @ 250 GB compressed. Any interest in this? [Original post]. Message posted to https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/

|Stuck_In_the_Matrix

Ubuntu Dialogue Corpus

|Dialogues extracted from Ubuntu chat stream on IRC.

|930 thousand dialogues, 7.1 million utterances

|CSV

|Dialogue Systems Research

|2015

|{{cite arXiv | eprint=1506.08909 | last1=Lowe | first1=Ryan | last2=Pow | first2=Nissan | last3=Serban | first3=Iulian | last4=Pineau | first4=Joelle | title=The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems | year=2015 | class=cs.CL }}

|Lowe, R. et al.

Dialog State Tracking Challenge

|The Dialog State Tracking Challenges 2 & 3 (DSTC2&3) were research challenge focused on improving the state of the art in tracking the state of spoken dialog systems.

|Transcription of spoken dialogs with labelling

|DSTC2 contains ~3.2k calls – DSTC3 contains ~2.3k calls

|Json

|Dialogue state tracking

|2014

|Jason Williams Antoine Raux Matthew Henderson, "[https://www.microsoft.com/en-us/research/publication/the-dialog-state-tracking-challenge-series-a-review/]", Dialogue & Discourse | April 2016 .

|Henderson, Matthew and Thomson, Blaise and Williams, Jason D

= Legal =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

FreeLaw

|Filtered data from Court Listener, part of the FreeLaw project.

|Cleaned and normalized text

|4,940,710

|Json

|NLP, linguistics

|2020

|{{Citation |last=Hoppe |first=Travis |title=The-Pile-FreeLaw |date=2021-12-16 |url=https://github.com/thoppe/The-Pile-FreeLaw |access-date=2023-01-11}}

|T. Hoppe

Pile of Law

|Corpus of legal and administrative data

|Cleaned, normalized, and privatized

|~50,000,000

|Json

|NLP, linguistics, sentiment

|2022

|{{Cite book |last1=Zheng |first1=Lucia |last2=Guha |first2=Neel |last3=Anderson |first3=Brandon R. |last4=Henderson |first4=Peter |last5=Ho |first5=Daniel E. |title=Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law |chapter=When does pretraining help? |date=2021-06-21 |chapter-url=http://dx.doi.org/10.1145/3462757.3466088 |pages=159–168 |location=New York, NY, USA |publisher=ACM |doi=10.1145/3462757.3466088|isbn=9781450385268 |s2cid=233296302 }}{{Cite web |title=pile-of-law/pile-of-law · Datasets at Hugging Face |url=https://huggingface.co/datasets/pile-of-law/pile-of-law |access-date=2023-01-11 |website=huggingface.co|date=4 July 2022 }}

|L. Zheng; N. Guha; B. Anderson; P. Henderson; D. Ho

Caselaw Access Project

|All official, book-published state and federal United States case law — every volume or case designated as an official report of decisions by a court within the United States.

|Cleaned and normalized text

|~10,000

|Json

|NLP, linguistics

|2022

|{{Cite web |title=About {{!}} Caselaw Access Project |url=https://case.law/about/ |access-date=2023-01-11 |website=case.law |language=en}}

|A. Aizman; S. Chapman; J. Cushman; K. Dulin; H. Eidolon; et al.

= Other text =

style="width: 100%" class="wikitable sortable"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Hansard French-English

|The Canadian Hansard records.

|2869040 French-English sentence pairs in 46.3 million words of French and 38.6 words of English (IBM portion), and 60 million words (Bell portion)

|French-English sentence pairs

|Translation

|1995

|{{Citation |last=Roukos, Salim |title=Hansard French/English |date=1995 |url=https://catalog.ldc.upenn.edu/LDC95T20 |access-date=2025-02-26 |publisher=Linguistic Data Consortium |doi=10.35111/JHGN-RV21 |last2=Graff, David |last3=Melamed, Dan}}

|IBM, Bell labs

Web of Science Dataset

|Hierarchical Datasets for Text Classification

|None.

|46,985

|Text

|Classification,

Categorization

|2017

|K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, M. S. Gerber and L. E. Barnes, "HDLTex: Hierarchical Deep Learning for Text Classification", 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 364–371. doi:10.1109/ICMLA.2017.0-134K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, M. S. Gerber and L. E. Barnes, "Web of Science Dataset", {{doi|10.17632/9rw3vkcfy4.6}}

|K. Kowsari et al.

Legal Case Reports

|Federal Court of Australia cases from 2006 to 2009.

|None.

|4,000

|Text

|Summarization,

citation analysis

|2012

|Galgani, Filippo, Paul Compton, and Achim Hoffmann. "[https://www.aclweb.org/anthology/W12-0515 Combining different summarization techniques for legal text]." Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data. Association for Computational Linguistics, 2012.{{cite journal | last1 = Nagwani | first1 = N. K. | year = 2015 | title = Summarizing large text collection using topic modeling and clustering based on MapReduce framework | journal = Journal of Big Data | volume = 2 | issue = 1| pages = 1–18 | doi=10.1186/s40537-015-0020-5| doi-access = free }}

|F. Galgani et al.

Blogger Authorship Corpus

|Blog entries of 19,320 people from blogger.com.

|Blogger self-provided gender, age, industry, and astrological sign.

|681,288

|Text

|Sentiment analysis, summarization, classification

|2006

|{{cite journal | last1 = Schler | first1 = Jonathan | display-authors = et al | title = Effects of Age and Gender on Blogging | url = https://www.aaai.org/Papers/Symposia/Spring/2006/SS-06-03/SS06-03-039.pdf | journal = AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs | volume = 6 | year = 2006 | access-date = 6 August 2019 | archive-date = 14 November 2020 | archive-url = https://web.archive.org/web/20201114000329/https://www.aaai.org/Papers/Symposia/Spring/2006/SS-06-03/SS06-03-039.pdf | url-status = dead }}Anand, Pranav, et al. "Believe Me-We Can Do This! Annotating Persuasive Acts in Blog Text."Computational Models of Natural Argument. 2011.

|J. Schler et al.

Social Structure of Facebook Networks

|Large dataset of the social structure of Facebook.

|None.

|100 colleges covered

|Text

|Network analysis, clustering

|2012

|Traud, Amanda L., Peter J. Mucha, and Mason A. Porter. "Social structure of Facebook networks." Physica A: Statistical Mechanics and its Applications391.16 (2012): 4165–4180.{{cite arXiv |eprint=1206.6474|last1=Richard|first1=Emile|title=Estimation of Simultaneously Sparse and Low Rank Matrices|last2=Savalle|first2=Pierre-Andre|last3=Vayatis|first3=Nicolas|class=cs.DS|year=2012}}

|A. Traud et al.

Dataset for the Machine Comprehension of Text

|Stories and associated questions for testing comprehension of text.

|None.

|660

|Text

|Natural language processing, machine comprehension

|2013

|{{cite journal | last1 = Richardson | first1 = Matthew | last2 = Burges | first2 = Christopher JC | last3 = Renshaw | first3 = Erin | title = MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text | url = https://www.aclweb.org/anthology/D13-1020| journal = EMNLP | volume = 1 | year = 2013 }}{{cite arXiv |eprint=1502.05698|last1=Weston|first1=Jason|title=Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks|last2=Bordes|first2=Antoine|last3=Chopra|first3=Sumit|last4= Rush|first4=Alexander M.|author5=Bart van Merriënboer|last6=Joulin|first6=Armand|last7=Mikolov|first7=Tomas|class=cs.AI|year=2015}}

|M. Richardson et al.

The Penn Treebank Project

|Naturally occurring text annotated for linguistic structure.

|Text is parsed into semantic trees.

|~ 1M words

|Text

|Natural language processing, summarization

|1995

|{{cite journal | last1 = Marcus | first1 = Mitchell P. | last2 = Ann Marcinkiewicz | first2 = Mary | last3 = Santorini | first3 = Beatrice | year = 1993 | title = Building a large annotated corpus of English: The Penn Treebank | url = http://repository.upenn.edu/cgi/viewcontent.cgi?article=1246&context=cis_reports | journal = Computational Linguistics | volume = 19 | issue = 2| pages = 313–330 }}{{cite journal | last1 = Collins | first1 = Michael | year = 2003 | title = Head-driven statistical models for natural language parsing | journal = Computational Linguistics | volume = 29 | issue = 4| pages = 589–637 | doi=10.1162/089120103322753356| doi-access = free }}

|M. Marcus et al.

DEXTER Dataset

|Task given is to determine, from features given, which articles are about corporate acquisitions.

|Features extracted include word stems. Distractor features included.

|2600

|Text

|Classification

|2008

|Guyon, Isabelle, et al., eds. [https://books.google.com/books?id=FOTzBwAAQBAJ&q=DEXTER Feature extraction: foundations and applications]. Vol. 207. Springer, 2008.

|Reuters

Google Books N-grams

|N-grams from a very large corpus of books

|None.

|2.2 TB of text

|Text

|Classification, clustering, regression

|2011

|Lin, Yuri, et al. "[https://www.aclweb.org/anthology/P/P12/P12-3029.pdf Syntactic annotations for the google books ngram corpus]." Proceedings of the ACL 2012 system demonstrations. Association for Computational Linguistics, 2012.{{cite journal | last1 = Krishnamoorthy | first1 = Niveda | display-authors = et al | title = Generating Natural-Language Video Descriptions Using Text-Mined Knowledge | url = https://www.aaai.org/ocs/index.php/AAAI/AAAI13/paper/download/6454/7204 | journal = AAAI | volume = 1 | year = 2013 | access-date = 6 August 2019 | archive-date = 6 August 2019 | archive-url = https://web.archive.org/web/20190806022756/https://www.aaai.org/ocs/index.php/AAAI/AAAI13/paper/download/6454/7204 | url-status = dead }}

|Google

Personae Corpus

|Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays.

|In addition to normal texts, syntactically annotated texts are given.

|145

|Text

|Classification, regression

|2008

|{{cite book |last1=Luyckx |first1=Kim |last2=Daelemans |first2=Walter |chapter=Personae: a corpus for author and personality prediction from text |hdl=10067/687330151162165141 |title=Proceedings of LREC-2008, the Sixth International Language Resources and Evaluation Conference |date=2008 |isbn=978-2-9517408-4-6 }}Solorio, Thamar, Ragib Hasan, and Mainul Mizan. "[https://www.aclweb.org/anthology/W13-1107 A case study of sockpuppet detection in wikipedia]." Workshop on Language Analysis in Social Media (LASM) at NAACL HLT. 2013.

|K. Luyckx et al.

PushShift

|Archives of social media websites, including Reddit, Twitter, and Hackernews.

|Text extracted and normalized from WARCs

|~100,000,000 posts

|Json

|NLP, sentiment, linguistics

|2022

|{{Cite web |title=Pushshift Files |url=https://files.pushshift.io/ |access-date=2023-01-12 |website=files.pushshift.io |archive-date=12 January 2023 |archive-url=https://web.archive.org/web/20230112015822/https://files.pushshift.io/ |url-status=dead }}{{Cite arXiv |last1=Baumgartner |first1=Jason |last2=Zannettou |first2=Savvas |last3=Keegan |first3=Brian |last4=Squire |first4=Megan |last5=Blackburn |first5=Jeremy |date=2020-01-23 |title=The Pushshift Reddit Dataset |class=cs.SI |eprint=2001.08435 }}

|J. Baumgartner

[https://sraf.nd.edu/sec-edgar-data/ SEC Filings]

|EDGAR | Company Filings

|Text extracted.

|csv

|NLP

CNAE-9 Dataset

|Categorization task for free text descriptions of Brazilian companies.

|Word frequency has been extracted.

|1080

|Text

|Classification

|2012

|{{cite book |doi=10.1109/ISDA.2009.9 |chapter=Agglomeration and Elimination of Terms for Dimensionality Reduction |title=2009 Ninth International Conference on Intelligent Systems Design and Applications |date=2009 |last1=Ciarelli |first1=Patrick Marques |last2=Oliveira |first2=Elias |pages=547–552 |isbn=978-1-4244-4735-0 }}{{cite journal |last1=Zhou |first1=Mingyuan |last2=Padilla |first2=Oscar Hernan Madrid |last3=Scott |first3=James G. |title=Priors for Random Count Matrices Derived from a Family of Negative Binomial Processes |journal=Journal of the American Statistical Association |date=2 July 2016 |volume=111 |issue=515 |pages=1144–1156 |doi=10.1080/01621459.2015.1075407 |arxiv=1404.3331 }}

|P. Ciarelli et al.

Sentiment Labeled Sentences Dataset

|3000 sentiment labeled sentences.

|Sentiment of each sentence has been hand labeled as positive or negative.

|3000

|Text

|Classification, sentiment analysis

|2015

|Kotzias, Dimitrios, et al. "[http://datalab.ics.uci.edu/papers/kdd2015_dimitris.pdf From group to individual labels using deep features]." Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015.{{cite arXiv |eprint=1602.08033|last1=Ning|first1=Yue|title=Modeling Precursors for Event Forecasting via Nested Multi-Instance Learning|last2=Muthiah|first2=Sathappan|last3=Rangwala|first3=Huzefa|last4=Ramakrishnan|first4=Naren|class=cs.SI|year=2016}}

|D. Kotzias

BlogFeedback Dataset

|Dataset to predict the number of comments a post will receive based on features of that post.

|Many features of each post extracted.

|60,021

|Text

|Regression

|2014

|Buza, Krisztian. "[http://www.cs.bme.hu/~buza/pdfs/gfkl2012_blogs.pdf Feedback prediction for blogs]."Data analysis, machine learning and knowledge discovery. Springer International Publishing, 2014. 145–152.{{cite journal | last1 = Soysal | first1 = Ömer M | year = 2015 | title = Association rule mining with mostly associated sequential patterns | journal = Expert Systems with Applications | volume = 42 | issue = 5| pages = 2582–2592 | doi=10.1016/j.eswa.2014.10.049}}

|K. Buza

[https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/ PubMed Central]

|PubMed® comprises more than 35 million citations for biomedical literature from MEDLINE, life science journals, and online books.

|None

|35 Million

|Text

|NLP

[https://bulkdata.uspto.gov USPTO]

|The United States Patent and Trademark Office

|Text

|NLP

[https://github.com/thoppe/The-Pile-PhilPapers PhilPapers]

|Open access collection of philosophy publications

|Text

|NLP

[https://github.com/soskek/bookcorpus Book Corpus]

|A popular large-scale text corpus.

|None

|Text

|NLP

|2015

|Zhu, Yukun, et al. "Aligning books and movies: Towards story-like visual explanations by watching movies and reading books." Proceedings of the IEEE international conference on computer vision. 2015.

|Zhu, Yukun, et al.

Stanford Natural Language Inference (SNLI) Corpus

|Image captions matched with newly constructed sentences to form entailment, contradiction, or neutral pairs.

|Entailment class labels, syntactic parsing by the Stanford PCFG parser

|570,000

|Text

|Natural language inference/recognizing textual entailment

|2015

|{{cite arXiv | eprint=1508.05326 | last1=Bowman | first1=Samuel R. | last2=Angeli | first2=Gabor | last3=Potts | first3=Christopher | last4=Manning | first4=Christopher D. | title=A large annotated corpus for learning natural language inference | year=2015 | class=cs.CL }}

|S. Bowman et al.

DSL Corpus Collection (DSLCC)

|A multilingual collection of short excerpts of journalistic texts in similar languages and dialects.

|None

|294,000 phrases

|Text

|Discriminating between similar languages

|2017

|{{Cite web|url=http://ttg.uni-saarland.de/resources/DSLCC/|title=DSL Corpus Collection|website=ttg.uni-saarland.de|access-date=2017-09-22}}

|Tan, Liling et al.

Urban Dictionary Dataset

|Corpus of words, votes and definitions

|User names anonymised

|2,580,925

|CSV

|NLP, Machine comprehension

|2016 May

|{{Cite web | url=https://www.kaggle.com/therohk/urban-dictionary-words-dataset | title=Urban Dictionary Words and Definitions}}

|Anonymous

T-REx

|Wikipedia abstracts aligned with Wikidata entities

|Alignment of Wikidata triples with Wikipedia abstracts

|11M aligned triples

|JSON and NIF [https://hadyelsahar.github.io/t-rex/]

|NLP, Relation Extraction

|2018

|H. Elsahar, P. Vougiouklis, A. Remaci, C. Gravier, J. Hare, F. Laforest, E. Simperl, "[https://www.aclweb.org/anthology/L18-1544 T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples]", Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018).

| H. Elsahar et al.

{{anchor|GLUE}}General Language Understanding Evaluation (GLUE)

|Benchmark of nine tasks

|Various

|~1M sentences and sentence pairs

|NLU

|2018

|{{Cite arXiv | eprint=1804.07461 | last1=Wang | first1=Alex | last2=Singh | first2=Amanpreet | last3=Michael | first3=Julian | last4=Hill | first4=Felix | last5=Levy | first5=Omer | last6=Bowman | first6=Samuel R. | title=GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding | year=2018 | class=cs.CL }}{{cite magazine |title=Computers Are Learning to Read—But They're Still Not So Smart |url=https://www.wired.com/story/computers-are-learning-to-read-but-theyre-still-not-so-smart/ |access-date=29 December 2019 |magazine=Wired |language=en}}{{Cite web|url=https://gluebenchmark.com/|title=GLUE Benchmark|website=gluebenchmark.com|language=en|access-date=2019-02-25}}

| Wang et al.

Contract Understanding Atticus Dataset (CUAD) (formerly known as Atticus Open Contract Dataset (AOK))

|Dataset of legal contracts with rich expert annotations

|~13,000 labels

|CSV and PDF

|Natural language processing, QnA

|2021

|[https://www.atticusprojectai.org/cuad The Atticus Project]

Vietnamese Image Captioning Dataset (UIT-ViIC)

|Vietnamese Image Captioning Dataset

|19,250 captions for 3,850 images

|CSV and PDF

|Natural language processing, Computer vision

|2020

|{{cite web| last1=Quan |first1=Hoang Lam |last2=Quang |first2=Duy Le |last3=Van Kiet |first3=Nguyen |last4=Ngan |first4=Luu-Thuy Nguyen. |url=https://www.springerprofessional.de/uit-viic-a-dataset-for-the-first-evaluation-on-vietnamese-image-/18612672 |title=UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image Captioning}}

|Lam et al.

Vietnamese Names annotated with Genders (UIT-ViNames)

|Vietnamese Names annotated with Genders

|26,850 Vietnamese full names annotated with genders

|CSV

|Natural language processing

|2020

|{{cite book| last1=To |first1=Quoc Huy |last2=Nguyen |first2=Van Kiet |last3=Nguyen |first3= Luu Thuy Ngan |last4=Nguyen |first4=Gia Tuan Anh |title=Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval|chapter=Gender Prediction Based on Vietnamese Names with Machine Learning Techniques |year=2020 |pages=55–60 |doi=10.1145/3443279.3443309 |arxiv=2010.10852 |isbn=9781450377607 |s2cid=224814110 }}

|To et al.

Vietnamese Constructive and Toxic Speech Detection Dataset (UIT-ViCTSD)

|Vietnamese Constructive and Toxic Speech Detection Dataset

|10,000 Vietnamese users' comments on online newspapers on 10 domains

|CSV

|Natural Language Processing

|2021

|{{cite book|last1=Nguyen|first1=Luan Thanh|last2=Van Nguyen|first2=Kiet|last3=Nguyen|first3=Ngan Luu-Thuy|date=2021-03-18|title=Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices|chapter=Constructive and Toxic Speech Detection for Open-Domain Social Media Comments in Vietnamese|series=Lecture Notes in Computer Science|volume=12798|pages=572–583|doi=10.1007/978-3-030-79457-6_49|arxiv=2103.10069|isbn=978-3-030-79456-9|s2cid=232269671}}

|Nguyen et al.

[https://github.com/deepmind/pg19 PG-19]

|A set of books extracted from the Project Gutenberg books library

|Text

|Natural Language Processing

|2019

|Jack W et al.

[https://github.com/deepmind/mathematics_dataset Deepmind Mathematics]

|Mathematical question and answer pairs.

|Text

|Natural Language Processing

|2018

|Saxton, David, et al. "Analysing Mathematical Reasoning Abilities of Neural Models." International Conference on Learning Representations. 2018.

|D Saxton et al.

[https://annas-archive.org/datasets Anna's Archive]

|A comprehensive archive of published books and papers

|None

|100,356,641

|Text, epub, PDF

|Natural Language Processing

|2024

Sound data

These datasets consist of sounds and sound features used for tasks such as speech recognition and speech synthesis.

= Speech =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Switchboard-1

|Conversational speech over telephone.

|260 hours of speech, from 543 speakers (302 male, 241 female) from across the United States, for around 2,400 two-sided telephone conversations, collected by Texas Instruments in 1990-1991.

|audio, text transcript, word-level timestamps, phonetic transcriptions

|speech recognition, phonetic transcription.

|1992 (2000)

|{{Cite book |last1=Godfrey |first1=J.J. |last2=Holliman |first2=E.C. |last3=McDaniel |first3=J. |chapter=SWITCHBOARD: Telephone speech corpus for research and development |date=1992 |pages=517–520 vol.1 |title=[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing |chapter-url=http://dx.doi.org/10.1109/icassp.1992.225858 |publisher=IEEE |doi=10.1109/icassp.1992.225858|isbn=0-7803-0532-9 }}{{Cite web |title=Switchboard-1 Release 2 - Linguistic Data Consortium |url=https://catalog.ldc.upenn.edu/LDC97S62 |access-date=2024-11-30 |website=catalog.ldc.upenn.edu |language=en}}

|NIST

Hub5'00

|Conversational speech over telephone.

|260 hours of speech, from 543 speakers (302 male, 241 female) from across the United States, for around 2,400 two-sided telephone conversations, at ~3 million words. Collected by Texas Instruments in 1990-1991.

|audio, text transcript, word-level timestamps, phonetic transcriptions

|speech recognition, phonetic transcription. The most commonly used test set for this dataset is called "Hub5'00".

|1992 (2000)

|NIST

Zero Resource Speech Challenge 2015

|Spontaneous speech (English), Read speech (Xitsonga).

|None, raw WAV files.

|English: 5h, 12 speakers; Xitsonga: 2h30, 24 speakers

|WAV (audio only)

|Unsupervised discovery of speech features/subword units/word units

|2015

|M. Versteegh, R. Thiollière, T. Schatz, X.-N. Cao, X. Anguera, A. Jansen, and E. Dupoux (2015). "The Zero Resource Speech Challenge 2015," in INTERSPEECH-2015.M. Versteegh, X. Anguera, A. Jansen, and E. Dupoux, (2016). "[https://core.ac.uk/download/pdf/82574050.pdf The Zero Resource Speech Challenge 2015: Proposed Approaches and Results]," in SLTU-2016.

|Versteegh et al.

Parkinson Speech Dataset

|Multiple recordings of people with and without Parkinson's Disease.

|Voice features extracted, disease scored by physician using unified Parkinson's disease rating scale.

|1,040

|Text

|Classification, regression

|2013

|{{cite journal | last1 = Sakar | first1 = Betul Erdogdu | display-authors = et al | year = 2013 | title = Collection and analysis of a Parkinson speech dataset with multiple types of sound recordings | journal = IEEE Journal of Biomedical and Health Informatics| volume = 17 | issue = 4| pages = 828–834 | doi=10.1109/jbhi.2013.2245674| pmid = 25055311 | s2cid = 15491516 }}{{cite book |doi=10.1109/ICASSP.2014.6854516 |chapter=Automatic detection of expressed emotion in Parkinson's Disease |title=2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |date=2014 |last1=Zhao |first1=Shunan |last2=Rudzicz |first2=Frank |last3=Carvalho |first3=Leonardo G. |last4=Marquez-Chin |first4=Cesar |last5=Livingstone |first5=Steven |pages=4813–4817 |isbn=978-1-4799-2893-4 }}

|B. E. Sakar et al.

Spoken Arabic Digits

|Spoken Arabic digits from 44 male and 44 female.

|Time-series of mel-frequency cepstrum coefficients.

|8,800

|Text

|Classification

|2010

|{{cite book |last1=Hammami |first1=Nacereddine |last2=Bedda |first2=Mouldi |chapter=Improved tree model for arabic speech recognition |title=2010 3rd International Conference on Computer Science and Information Technology |date=July 2010 |pages=521–526 |doi=10.1109/ICCSIT.2010.5563892 |isbn=978-1-4244-5537-9 }}Maaten, Laurens. "[https://lvdmaaten.github.io/publications/papers/ICML_2011.pdf Learning discriminative fisher kernels]." Proceedings of the 28th International Conference on Machine Learning (ICML-11). 2011.

|M. Bedda et al.

ISOLET Dataset

|Spoken letter names.

|Features extracted from sounds.

|7797

|Text

|Classification

|1994

|Cole, Ronald, and Mark Fanty. "[https://www.aclweb.org/anthology/H90-1075 Spoken letter recognition]." Proc. Third DARPA Speech and Natural Language Workshop. 1990.{{cite journal | last1 = Chapelle | first1 = Olivier | last2 = Sindhwani | first2 = Vikas | last3 = Keerthi | first3 = Sathiya S. | year = 2008 | title = Optimization techniques for semi-supervised support vector machines | url =http://www.jmlr.org/papers/volume9/chapelle08a/chapelle08a.pdf | journal = The Journal of Machine Learning Research | volume = 9 | pages = 203–233 }}

|R. Cole et al.

Japanese Vowels Dataset

|Nine male speakers uttered two Japanese vowels successively.

|Applied 12-degree linear prediction analysis to it to obtain a discrete-time series with 12 cepstrum coefficients.

|640

|Text

|Classification

|1999

|{{cite journal |last1=Kudo |first1=Mineichi |last2=Toyama |first2=Jun |last3=Shimbo |first3=Masaru |title=Multidimensional curve classification using passing-through regions |journal=Pattern Recognition Letters |date=November 1999 |volume=20 |issue=11–13 |pages=1103–1111 |doi=10.1016/s0167-8655(99)00077-x |bibcode=1999PaReL..20.1103K }}{{cite journal |last1=Jaeger |first1=Herbert |last2=Lukoševičius |first2=Mantas |last3=Popovici |first3=Dan |last4=Siewert |first4=Udo |title=Optimization and applications of echo state networks with leaky- integrator neurons |journal=Neural Networks |date=April 2007 |volume=20 |issue=3 |pages=335–352 |doi=10.1016/j.neunet.2007.04.016 |pmid=17517495 }}

|M. Kudo et al.

Parkinson's Telemonitoring Dataset

|Multiple recordings of people with and without Parkinson's Disease.

|Sound features extracted.

|5875

|Text

|Classification

|2009

|{{cite journal |last1=Tsanas |first1=A. |last2=Little |first2=M.A. |last3=McSharry |first3=P.E. |last4=Ramig |first4=L.O. |title=Accurate Telemonitoring of Parkinson's Disease Progression by Noninvasive Speech Tests |journal=IEEE Transactions on Biomedical Engineering |date=April 2010 |volume=57 |issue=4 |pages=884–893 |doi=10.1109/tbme.2009.2036000 |pmid=19932995 }}{{cite journal | last1 = Clifford | first1 = Gari D. | last2 = Clifton | first2 = David | year = 2012 | title = Wireless technology in disease management and medicine | journal = Annual Review of Medicine | volume = 63 | pages = 479–492 | doi=10.1146/annurev-med-051210-114650| pmid = 22053737 }}

|A. Tsanas et al.

TIMIT

|Recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences.

|Speech is lexically and phonemically transcribed.

|6300

|Text

|Speech recognition, classification.

|1986

|{{cite journal | last1 = Zue | first1 = Victor | last2 = Seneff | first2 = Stephanie | last3 = Glass | first3 = James | year = 1990 | title = Speech database development at MIT: TIMIT and beyond | journal = Speech Communication | volume = 9 | issue = 4| pages = 351–356 | doi=10.1016/0167-6393(90)90010-7}}{{cite book |doi=10.1109/ICASSP.1993.319349 |chapter=MMI training for continuous phoneme recognition on the TIMIT database |title=IEEE International Conference on Acoustics Speech and Signal Processing |date=1993 |last1=Kapadia |first1=S. |last2=Valtchev |first2=V. |last3=Young |first3=S.J. |pages=491–494 vol.2 |isbn=0-7803-0946-4 }}

|J. Garofolo et al.

Arabic Speech Corpus

|A single-speaker, Modern Standard Arabic (MSA) speech corpus with phonetic and orthographic transcripts aligned to phoneme level.

|Speech is orthographically and phonetically transcribed with stress marks.

|~1900

|Text, WAV

|Speech Synthesis, Speech Recognition, Corpus Alignment, Speech Therapy, Education.

|2016

|{{cite thesis |last=Halabi |first=Nawar |year=2016 |title=Modern Standard Arabic Phonetics for Speech Synthesis |url=http://en.arabicspeechcorpus.com/Nawar%20Halabi%20PhD%20Thesis%20Revised.pdf |type=PhD Thesis |publisher=University of Southampton, School of Electronics and Computer Science}}

|N. Halabi

Common Voice

|A public domain database of crowdsourced data across a wide range of dialects.

|Validation by other users .

|English: 1,118 hours

|MP3 with corresponding text files

|Speech recognition

|2017 June (2019 December)

|{{cite arXiv | last1=Ardila | first1=Rosana | last2=Branson | first2=Megan | last3=Davis | first3=Kelly | last4=Henretty | first4=Michael | last5=Kohler | first5=Michael | last6=Meyer | first6=Josh | last7=Morais | first7=Reuben | last8=Saunders | first8=Lindsay | last9=Tyers | first9=Francis M. | last10=Weber | first10=Gregor | title=Common Voice: A Massively-Multilingual Speech Corpus | date=Dec 13, 2019 | class=cs.CL | eprint=1912.06670v2 }}

|Mozilla

LJSpeech

|A single-speaker corpus of English public-domain audiobook recordings, split into short clips at punctuation marks.

|Quality check, normalized transcription alongside the original.

|13,100

|CSV, WAV

|Speech synthesis

|2017

|{{Cite web |title=The LJ Speech Dataset |url=https://keithito.com/LJ-Speech-Dataset |access-date=2022-04-13 |website=keithito.com}}

|Keith Ito, Linda Johnson

Arabic Speech Commands Dataset

|Collected from 30 contributors and grouped into 40 keywords.

|Raw WAV files

|12,000

|WAV, CSV

|Speech recognition, keyword spotting

|2021

|{{cite journal |last1=Ghandoura |first1=Abdulkader |last2=Hjabo |first2=Farouk |last3=Al Dakkak |first3=Oumayma |title=Building and benchmarking an Arabic Speech Commands dataset for small-footprint keyword spotting |journal=Engineering Applications of Artificial Intelligence |date=June 2021 |volume=102 |pages=104267 |doi=10.1016/j.engappai.2021.104267 }}

|Abdulkader Ghandoura

= Music =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Geographic Origin of Music Data Set

|Audio features of music samples from different locations.

|Audio features extracted using MARSYAS software.

|1,059

|Text

|Geographic classification, clustering

|2014

|{{cite book |doi=10.1109/ICDM.2014.73 |chapter=Predicting the Geographical Origin of Music |title=2014 IEEE International Conference on Data Mining |date=2014 |last1=Zhou |first1=Fang |last2=Claire |first2=Q. |last3=King |first3=Ross D. |pages=1115–1120 |isbn=978-1-4799-4302-9 }}{{cite journal | last1 = Saccenti | first1 = Edoardo | last2 = Camacho | first2 = José | year = 2015 | title = On the use of the observation-wise k-fold operation in PCA cross-validation | journal = Journal of Chemometrics | volume = 29 | issue = 8| pages = 467–478 | doi=10.1002/cem.2726| hdl = 10481/55302 | s2cid = 62248957 | hdl-access = free }}

|F. Zhou et al.

Million Song Dataset

|Audio features from one million different songs.

|Audio features extracted.

|1M

|Text

|Classification, clustering

|2011

|Bertin-Mahieux, Thierry, et al. "The million song dataset." ISMIR 2011: Proceedings of the 12th International Society for Music Information Retrieval Conference, 24–28 October 2011, Miami, Florida. University of Miami, 2011.{{cite journal | last1 = Henaff | first1 = Mikael | display-authors = et al | title = Unsupervised learning of sparse features for scalable audio classification | url =https://archives.ismir.net/ismir2011/paper/000128.pdf | journal = ISMIR | volume = 11 | year = 2011 }}

|T. Bertin-Mahieux et al.

MUSDB18

|Multi-track popular music recordings

|Raw audio

|150

|MP4, WAV

|Source Separation

|2017

|{{Cite book | doi=10.5281/zenodo.1117372| title=MUSDB18 – a corpus for music separation | year=2017 | last1=Rafii | first1=Zafar | chapter=Music }}

|Z. Rafii et al.

Free Music Archive

|Audio under Creative Commons from 100k songs (343 days, 1TiB) with a hierarchy of 161 genres, metadata, user data, free-form text.

|Raw audio and audio features.

|106,574

|Text, MP3

|Classification, recommendation

|2017

|{{cite arXiv|last1=Defferrard|first1=Michaël|last2=Benzi|first2=Kirell|last3=Vandergheynst|first3=Pierre|last4=Bresson|first4=Xavier|date=6 December 2016|title=FMA: A Dataset For Music Analysis|eprint=1612.01840|class=cs.SD}}

|M. Defferrard et al.

Bach Choral Harmony Dataset

|Bach chorale chords.

|Audio features extracted.

|5665

|Text

|Classification

|2014

|{{cite journal | last1 = Esposito | first1 = Roberto | last2 = Radicioni | first2 = Daniele P. | year = 2009 | title = Carpediem: Optimizing the viterbi algorithm and applications to supervised sequential learning | url =http://www.jmlr.org/papers/volume10/esposito09a/esposito09a.pdf | journal = The Journal of Machine Learning Research | volume = 10 | pages = 1851–1880 }}{{cite journal | last1 = Sourati | first1 = Jamshid | display-authors = et al | year = 2016 | title = Classification Active Learning Based on Mutual Information | journal = Entropy | volume = 18 | issue = 2| page = 51 | doi=10.3390/e18020051| bibcode = 2016Entrp..18...51S | doi-access = free }}

|D. Radicioni et al.

= Other sounds =

style="width: 100%" class="wikitable sortable"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

UrbanSound

|Labeled sound recordings of sounds like air conditioners, car horns and children playing.

|Sorted into folders by class of events as well as metadata in a JSON file and annotations in a CSV file.

|1,059

|Sound

(WAV)

|Classification

|2014

|Salamon, Justin; Jacoby, Christopher; Bello, Juan Pablo. "[https://www.researchgate.net/profile/Justin_Salamon/publication/267269056_A_Dataset_and_Taxonomy_for_Urban_Sound_Research/links/544936af0cf2f63880810a84/A-Dataset-and-Taxonomy-for-Urban-Sound-Research.pdf A dataset and taxonomy for urban sound research]." Proceedings of the ACM International Conference on Multimedia. ACM, 2014.{{cite arXiv |eprint=1502.00141|last1=Lagrange|first1=Mathieu|title=An evaluation framework for event detection using a morphological model of acoustic scenes| last2=Lafay|first2=Grégoire|last3=Rossignol|first3=Mathias|last4=Benetos|first4=Emmanouil|last5=Roebel|first5=Axel|class=stat.ML|year=2015}}

|J. Salamon et al.

{{anchor|AudioSet}}AudioSet

|10-second sound snippets from YouTube videos, and an ontology of over 500 labels.

|128-d PCA'd VGG-ish features every 1 second.

|2,084,320

|Text (CSV) and TensorFlow Record files

|Classification

|2017

|Gemmeke, Jort F., et al. "Audio Set: An ontology and human-labeled dataset for audio events." IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 2017.

|J. Gemmeke et al., Google

Bird Audio Detection challenge

|Audio from environmental monitoring stations, plus crowdsourced recordings

|17,000+

|Classification

|2016 (2018)

|{{cite news |title=Watch out, birders: Artificial intelligence has learned to spot birds from their songs |url=https://www.science.org/content/article/watch-out-birders-artificial-intelligence-has-learned-spot-birds-their-songs |access-date=22 July 2018 |work=Science {{!}} AAAS |date=18 July 2018 |language=en}}{{cite web |title=Bird Audio Detection challenge |url=http://machine-listening.eecs.qmul.ac.uk/bird-audio-detection-challenge/ |website=Machine Listening Lab at Queen Mary University |access-date=22 July 2018 |date=3 May 2016}}

|Queen Mary University and IEEE Signal Processing Society

WSJ0 Hipster Ambient Mixtures

|Audio from WSJ0 mixed with noise recorded in the San Francisco Bay Area

|Noise clips matched to WSJ0 clips

|28,000

|Sound (WAV)

|Audio source separation

|2019

|{{cite arXiv | eprint=1907.01160 | last1=Wichern | first1=Gordon | last2=Antognini | first2=Joe | last3=Flynn | first3=Michael | author4=Licheng Richard Zhu | last5=McQuinn | first5=Emmett | last6=Crow | first6=Dwight | last7=Manilow | first7=Ethan | author8=Jonathan Le Roux | title=WHAM!: Extending Speech Separation to Noisy Environments | year=2019 | class=cs.SD }}

|Wichern, G., et al., Whisper and MERL

Clotho

|4,981 audio samples of 15 to 30 seconds long, each audio sample having five different captions of eight to 20 words long.

|24,905

|Sound (WAV) and text (CSV)

|Automated audio captioning

|2020

|Drossos, K., Lipping, S., and Virtanen, T. "Clotho: An Audio Captioning Dataset" IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 2020.Drossos, K., Lipping, S., and Virtanen, T. (2019). Clotho dataset (Version 1.0) [Data set]. Zenodo. [http://doi.org/10.5281/zenodo.3490684 http://doi.org/10.5281/zenodo.3490684]

|K. Drossos, S. Lipping, and T. Virtanen

Signal data

Datasets containing electric signal information requiring some sort of signal processing for further analysis.

= Electrical =

style="width: 100%" class="wikitable sortable"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Witty Worm Dataset

|Dataset detailing the spread of the Witty worm and the infected computers.

|Split into a publicly available set and a restricted set containing more sensitive information like IP and UDP headers.

|55,909 IP addresses

|Text

|Classification

|2004

|The CAIDA UCSD Dataset on the Witty Worm – 19–24 March 2004, http://www.caida.org/data/passive/witty_worm_dataset.xml{{cite journal |last1=Chen |first1=Zesheng |last2=Ji |first2=Chuanyi |title=Optimal worm-scanning method using vulnerable-host distributions |journal=International Journal of Security and Networks |date=2007 |volume=2 |issue=1/2 |pages=71 |doi=10.1504/IJSN.2007.012826 }}

|Center for Applied Internet Data Analysis

Cuff-Less Blood Pressure Estimation Dataset

|Cleaned vital signals from human patients which can be used to estimate blood pressure.

|125 Hz vital signs have been cleaned.

|12,000

|Text

|Classification, regression

|2015

|{{cite book |doi=10.1109/ISCAS.2015.7168806 |chapter=Cuff-less high-accuracy calibration-free blood pressure estimation using pulse transit time |title=2015 IEEE International Symposium on Circuits and Systems (ISCAS) |date=2015 |last1=Kachuee |first1=Mohamad |last2=Kiani |first2=Mohammad Mahdi |last3=Mohammadzade |first3=Hoda |last4=Shabany |first4=Mahdi |pages=1006–1009 |isbn=978-1-4799-8391-9 }}{{cite journal |last1=Goldberger |first1=Ary L. |last2=Amaral |first2=Luis A. N. |last3=Glass |first3=Leon |last4=Hausdorff |first4=Jeffrey M. |last5=Ivanov |first5=Plamen Ch. |last6=Mark |first6=Roger G. |last7=Mietus |first7=Joseph E. |last8=Moody |first8=George B. |last9=Peng |first9=Chung-Kang |last10=Stanley |first10=H. Eugene |title=PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals |journal=Circulation |date=13 June 2000 |volume=101 |issue=23 |pages=E215-20 |doi=10.1161/01.CIR.101.23.e215 |pmid=10851218 }}

|M. Kachuee et al.

Gas Sensor Array Drift Dataset

|Measurements from 16 chemical sensors utilized in simulations for drift compensation.

|Extensive number of features given.

|13,910

|Text

|Classification

|2012

|{{cite journal | last1 = Vergara | first1 = Alexander | display-authors = et al | year = 2012 | title = Chemical gas sensor drift compensation using classifier ensembles | journal = Sensors and Actuators B: Chemical | volume = 166 | pages = 320–329 | doi=10.1016/j.snb.2012.01.074| bibcode = 2012SeAcB.166..320V }}{{cite journal | last1 = Korotcenkov | first1 = G. | last2 = Cho | first2 = B. K. | year = 2014 | title = Engineering approaches to improvement of conductometric gas sensor parameters. Part 2: Decrease of dissipated (consumable) power and improvement stability and reliability | journal = Sensors and Actuators B: Chemical | volume = 198 | pages = 316–341 | doi=10.1016/j.snb.2014.03.069| bibcode = 2014SeAcB.198..316K }}

|A. Vergara

Servo Dataset

|Data covering the nonlinear relationships observed in a servo-amplifier circuit.

|Levels of various components as a function of other components are given.

|167

|Text

|Regression

|1993

|{{cite journal | last1 = Quinlan | first1 = John R | title = Learning with continuous classes | url =https://sci2s.ugr.es/keel/pdf/algorithm/congreso/1992-Quinlan-AI.pdf | journal = 5th Australian Joint Conference on Artificial Intelligence | volume = 92 | year = 1992 }}{{cite journal | last1 = Merz | first1 = Christopher J. | last2 = Pazzani | first2 = Michael J. | year = 1999 | title = A principal components approach to combining regression estimates | journal = Machine Learning | volume = 36 | issue = 1–2| pages = 9–32 | doi=10.1023/a:1007507221352| doi-access = free }}

|K. Ullrich

UJIIndoorLoc-Mag Dataset

|Indoor localization database to test indoor positioning systems. Data is magnetic field based.

|Train and test splits given.

|40,000

|Text

|Classification, regression, clustering

|2015

|Torres-Sospedra, Joaquin, et al. "UJIIndoorLoc-Mag: A new database for magnetic field-based localization problems." Indoor Positioning and Indoor Navigation (IPIN), 2015 International Conference on. IEEE, 2015.Berkvens, Rafael, Maarten Weyn, and Herbert Peremans. "[https://www.researchgate.net/profile/Raf_Berkvens/publication/284154212_Mean_Mutual_Information_of_Probabilistic_Wi-Fi_Localization/links/564c6b7508aeab8ed5e92fcb.pdf Mean Mutual Information of Probabilistic Wi-Fi Localization]." Indoor Positioning and Indoor Navigation (IPIN), 2015 International Conference on. Banff, Canada: IPIN. 2015.

|D. Rambla et al.

Sensorless Drive Diagnosis Dataset

|Electrical signals from motors with defective components.

|Statistical features extracted.

|58,508

|Text

|Classification

|2015

|Paschke, Fabian, et al. "Sensorlose Zustandsüberwachung an Synchronmotoren."Proceedings. 23. Workshop Computational Intelligence, Dortmund, 5.-6. Dezember 2013. KIT Scientific Publishing, 2013.Lessmeier, Christian, et al. "[https://www.researchgate.net/profile/Olaf_Enge-Rosenblatt/publication/264441239_Data_Acquisition_and_Signal_Analysis_from_Measured_Motor_Currents_for_Defect_Detection_in_Electromechanical_Drive_Systems/links/53df97e90cf2a768e49bb3b9.pdf Data Acquisition and Signal Analysis from Measured Motor Currents for Defect Detection in Electromechanical Drive Systems]."

|M. Bator

= Motion-tracking =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Wearable Computing: Classification of Body Postures and Movements (PUC-Rio)

|People performing five standard actions while wearing motion trackers.

|None.

|165,632

|Text

|Classification

|2013

|Ugulino, Wallace, et al. "[http://groupware.secondlab.inf.puc-rio.br/public/papers/2012.Ugulino.WearableComputing.HAR.Classifier.RIBBON.pdf Wearable computing: Accelerometers’ data classification of body postures and movements] {{Webarchive|url=https://web.archive.org/web/20200925222906/http://groupware.secondlab.inf.puc-rio.br/public/papers/2012.Ugulino.WearableComputing.HAR.Classifier.RIBBON.pdf |date=25 September 2020 }}." Advances in Artificial Intelligence-SBIA 2012. Springer Berlin Heidelberg, 2012. 52–61.{{cite journal | last1 = Schneider | first1 = Jan | display-authors = et al | year = 2015 | title = Augmenting the senses: a review on sensor-based learning support | journal = Sensors | volume = 15 | issue = 2| pages = 4097–4133 | doi=10.3390/s150204097| pmid = 25679313 | pmc = 4367401 | bibcode = 2015Senso..15.4097S | doi-access = free }}

|Pontifical Catholic University of Rio de Janeiro

Gesture Phase Segmentation Dataset

|Features extracted from video of people doing various gestures.

|Features extracted aim at studying gesture phase segmentation.

|9900

|Text

|Classification, clustering

|2014

|Madeo, Renata CB, Clodoaldo AM Lima, and Sarajane M. Peres. "[https://tarjomefa.com/wp-content/uploads/2016/11/5781-English.pdf Gesture unit segmentation using support vector machines: segmenting gestures from rest positions]." Proceedings of the 28th Annual ACM Symposium on Applied Computing. ACM, 2013.{{cite journal | last1 = Lun | first1 = Roanna | last2 = Zhao | first2 = Wenbing | year = 2015 | title = A survey of applications and human motion recognition with Microsoft Kinect | url = https://engagedscholarship.csuohio.edu/cgi/viewcontent.cgi?article=1417&context=enece_facpub| journal = International Journal of Pattern Recognition and Artificial Intelligence | volume = 29 | issue = 5| page = 1555008 | doi=10.1142/s0218001415550083}}

|R. Madeo et a

Vicon Physical Action Data Set Dataset

|10 normal and 10 aggressive physical actions that measure the human activity tracked by a 3D tracker.

|Many parameters recorded by 3D tracker.

|3000

|Text

|Classification

|2011

|{{cite book |doi=10.1109/ROBIO.2007.4522190 |chapter=Action classification of 3D human models using dynamic ANNs for mobile robot surveillance |title=2007 IEEE International Conference on Robotics and Biomimetics (ROBIO) |date=2007 |last1=Theodoridis |first1=Theodoros |last2=Huosheng Hu |pages=371–376 |isbn=978-1-4244-1761-2 }}{{cite book |doi=10.1109/ICICISYS.2009.5357690 |chapter=3D human action recognition and style transformation using resilient backpropagation neural networks |title=2009 IEEE International Conference on Intelligent Computing and Intelligent Systems |date=2009 |last1=Etemad |first1=Seyed Ali |last2=Arya |first2=Ali |pages=296–301 |isbn=978-1-4244-4754-1 }}

|T. Theodoridis

Daily and Sports Activities Dataset

|Motor sensor data for 19 daily and sports activities.

|Many sensors given, no preprocessing done on signals.

|9120

|Text

|Classification

|2013

|{{cite journal | last1 = Altun | first1 = Kerem | last2 = Barshan | first2 = Billur | last3 = Tunçel | first3 = Orkun | year = 2010 | title = Comparative study on classifying human activities with miniature inertial and magnetic sensors | journal = Pattern Recognition | volume = 43 | issue = 10| pages = 3605–3620 | doi=10.1016/j.patcog.2010.04.019| bibcode = 2010PatRe..43.3605A | hdl = 11693/11947 | hdl-access = free }}{{cite journal | last1 = Nathan | first1 = Ran |author-link1=Ran Nathan | display-authors = et al | year = 2012 | title = Using tri-axial acceleration data to identify behavioral modes of free-ranging animals: general concepts and tools illustrated for griffon vultures | journal = The Journal of Experimental Biology | volume = 215 | issue = 6| pages = 986–996 | doi=10.1242/jeb.058602| pmid = 22357592 | pmc = 3284320 | bibcode = 2012JExpB.215..986N }}

|B. Barshan et al.

Human Activity Recognition Using Smartphones Dataset

|Gyroscope and accelerometer data from people wearing smartphones and performing normal actions.

|Actions performed are labeled, all signals preprocessed for noise.

|10,299

|Text

|Classification

|2012

|Anguita, Davide, et al. "[https://upcommons.upc.edu/bitstream/handle/2117/101769/IWAAL2012.pdf Human activity recognition on smartphones using a multiclass hardware-friendly support vector machine]." Ambient assisted living and home care. Springer Berlin Heidelberg, 2012. 216–223.{{cite journal | last1 = Su | first1 = Xing | last2 = Tong | first2 = Hanghang | last3 = Ji | first3 = Ping | year = 2014 | title = Activity recognition with smartphone sensors | journal = Tsinghua Science and Technology | volume = 19 | issue = 3| pages = 235–249 | doi=10.1109/tst.2014.6838194| s2cid = 62751498 }}

|J. Reyes-Ortiz et al.

Australian Sign Language Signs

|Australian sign language signs captured by motion-tracking gloves.

|None.

|2565

|Text

|Classification

|2002

|Kadous, Mohammed Waleed. [https://pdfs.semanticscholar.org/4bad/c3f0ad169ed9ec7d073375e9b168fa9f6c8f.pdf Temporal classification: Extending the classification paradigm to multivariate time series]. Diss. The University of New South Wales, 2002.Graves, Alex, et al. "[https://mediatum.ub.tum.de/doc/1292048/file.pdf Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks]." Proceedings of the 23rd international conference on Machine learning. ACM, 2006.

|M. Kadous

Weight Lifting Exercises monitored with Inertial Measurement Units

|Five variations of the biceps curl exercise monitored with IMUs.

|Some statistics calculated from raw data.

|39,242

|Text

|Classification

|2013

|Velloso, Eduardo, et al. "[https://www.perceptualui.org/publications/velloso13_ah.pdf Qualitative activity recognition of weight lifting exercises]."Proceedings of the 4th Augmented Human International Conference. ACM, 2013.Mortazavi, Bobak Jack, et al. "[http://www.thehabitslab.com/assets/papers/28.pdf Determining the single best axis for exercise repetition recognition and counting on smartwatches] {{Webarchive|url=https://web.archive.org/web/20211104043511/https://www.thehabitslab.com/assets/papers/28.pdf |date=4 November 2021 }}." Wearable and Implantable Body Sensor Networks (BSN), 2014 11th International Conference on. IEEE, 2014.

|W. Ugulino et al.

sEMG for Basic Hand movements Dataset

|Two databases of surface electromyographic signals of 6 hand movements.

|None.

|3000

|Text

|Classification

|2014

|Sapsanis, Christos, et al. "[https://www.researchgate.net/profile/Christos_Sapsanis/publication/257602303_Improving_EMG_based_classification_of_basic_hand_movements_using_EMD/links/56dfb7fd08ae979addef64a2/Improving-EMG-based-classification-of-basic-hand-movements-using-EMD.pdf Improving EMG based Classification of basic hand movements using EMD]." Engineering in Medicine and Biology Society (EMBC), 2013 35th Annual International Conference of the IEEE. IEEE, 2013.{{cite journal | last1 = Andrianesis | first1 = Konstantinos | last2 = Tzes | first2 = Anthony | year = 2015 | title = Development and control of a multifunctional prosthetic hand with shape memory alloy actuators | journal = Journal of Intelligent & Robotic Systems | volume = 78 | issue = 2| pages = 257–289 | doi=10.1007/s10846-014-0061-6| s2cid = 207174078 }}

|C. Sapsanis et al.

REALDISP Activity Recognition Dataset

|Evaluate techniques dealing with the effects of sensor displacement in wearable activity recognition.

|None.

|1419

|Text

|Classification

|2014

|{{cite journal | last1 = Banos | first1 = Oresti | display-authors = et al | year = 2014 | title = Dealing with the effects of sensor displacement in wearable activity recognition | journal = Sensors | volume = 14 | issue = 6| pages = 9995–10023 | doi=10.3390/s140609995| pmid = 24915181 | pmc=4118358| bibcode = 2014Senso..14.9995B | doi-access = free }}

|O. Banos et al.

Heterogeneity Activity Recognition Dataset

|Data from multiple different smart devices for humans performing various activities.

|None.

|43,930,257

|Text

|Classification, clustering

|2015

|{{cite book |doi=10.1145/2809695.2809718 |chapter=Smart Devices are Different: Assessing and MitigatingMobile Sensing Heterogeneities for Activity Recognition |title=Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems |date=2015 |last1=Stisen |first1=Allan |last2=Blunck |first2=Henrik |last3=Bhattacharya |first3=Sourav |last4=Prentow |first4=Thor Siiger |last5=Kjærgaard |first5=Mikkel Baun |last6=Dey |first6=Anind |last7=Sonne |first7=Tobias |last8=Jensen |first8=Mads Møller |pages=127–140 |isbn=978-1-4503-3631-4 }}{{cite book |doi=10.1109/PERCOMW.2016.7457169 |chapter=From smart to deep: Robust activity recognition on smartwatches using deep learning |title=2016 IEEE International Conference on Pervasive Computing and Communication Workshops (PerCom Workshops) |date=2016 |last1=Bhattacharya |first1=Sourav |last2=Lane |first2=Nicholas D. |pages=1–6 |isbn=978-1-5090-1941-0 |chapter-url=https://discovery.ucl.ac.uk/id/eprint/1503672/ }}

|A. Stisen et al.

Indoor User Movement Prediction from RSS Data

|Temporal wireless network data that can be used to track the movement of people in an office.

|None.

|13,197

|Text

|Classification

|2016

|{{cite journal | last1 = Bacciu | first1 = Davide | display-authors = et al | year = 2014 | title = An experimental characterization of reservoir computing in ambient assisted living applications | journal = Neural Computing and Applications | volume = 24 | issue = 6| pages = 1451–1464 | doi=10.1007/s00521-013-1364-4| hdl = 11568/237959 | s2cid = 14124013 | hdl-access = free }}{{cite book |doi=10.1007/978-3-642-41043-7_3 |chapter=Multisensor Data Fusion for Activity Recognition Based on Reservoir Computing |title=Evaluating AAL Systems Through Competitive Benchmarking |series=Communications in Computer and Information Science |date=2013 |last1=Palumbo |first1=Filippo |last2=Barsocchi |first2=Paolo |last3=Gallicchio |first3=Claudio |last4=Chessa |first4=Stefano |last5=Micheli |first5=Alessio |volume=386 |pages=24–35 |isbn=978-3-642-41042-0 }}

|D. Bacciu

PAMAP2 Physical Activity Monitoring Dataset

|18 different types of physical activities performed by 9 subjects wearing 3 IMUs.

|None.

|3,850,505

|Text

|Classification

|2012

|{{cite book |doi=10.1109/ISWC.2012.13 |chapter=Introducing a New Benchmarked Dataset for Activity Monitoring |title=2012 16th International Symposium on Wearable Computers |date=2012 |last1=Reiss |first1=Attila |last2=Stricker |first2=Didier |pages=108–109 |isbn=978-0-7695-4697-1 }}

|A. Reiss

OPPORTUNITY Activity Recognition Dataset

|Human Activity Recognition from wearable, object, and ambient sensors is a dataset devised to benchmark human activity recognition algorithms.

|None.

|2551

|Text

|Classification

|2012

|{{cite book |doi=10.1109/WOWMOM.2009.5282442 |chapter=OPPORTUNITY: Towards opportunistic activity and context recognition systems |title=2009 IEEE International Symposium on a World of Wireless, Mobile and Multimedia Networks & Workshops |date=2009 |last1=Roggen |first1=Daniel |last2=Forster |first2=Kilian |last3=Calatroni |first3=Alberto |last4=Holleczek |first4=Thomas |last5=Fang |first5=Yu |last6=Troster |first6=Gerhard |last7=Ferscha |first7=Alois |last8=Holzmann |first8=Clemens |last9=Riener |first9=Andreas |last10=Lukowicz |first10=Paul |last11=Pirkl |first11=Gerald |last12=Bannach |first12=David |last13=Kunze |first13=Kai |last14=Chavarriaga |first14=Ricardo |last15=Millan |first15=Jose del R. |pages=1–6 |isbn=978-1-4244-4440-3 |chapter-url=http://infoscience.epfl.ch/record/138648 }}Kurz, Marc, et al. "[https://www.researchgate.net/profile/Marc_Kurz/publication/220271166_Dynamic_Quantification_of_Activity_Recognition_Capabilities_in_Opportunistic_Systems/links/09e4150f66b480c97a000000/Dynamic-Quantification-of-Activity-Recognition-Capabilities-in-Opportunistic-Systems.pdf Dynamic quantification of activity recognition capabilities in opportunistic systems]." Vehicular Technology Conference (VTC Spring), 2011 IEEE 73rd. IEEE, 2011.

|D. Roggen et al.

Real World Activity Recognition Dataset

|Human Activity Recognition from wearable devices. Distinguishes between seven on-body device positions and comprises six different kinds of sensors.

|None.

|3,150,000 (per sensor)

|Text

|Classification

|2016

|{{cite book |doi=10.1109/PERCOM.2016.7456521 |chapter=On-body localization of wearable devices: An investigation of position-aware activity recognition |title=2016 IEEE International Conference on Pervasive Computing and Communications (PerCom) |date=2016 |last1=Sztyler |first1=Timo |last2=Stuckenschmidt |first2=Heiner |pages=1–9 |isbn=978-1-4673-8779-8 }}

|T. Sztyler et al.

Toronto Rehab Stroke Pose Dataset

|3D human pose estimates (Kinect) of stroke patients and healthy participants performing a set of tasks using a stroke rehabilitation robot.

|None.

|10 healthy person and 9 stroke survivors (3500–6000 frames per person)

|CSV

|Classification

|2017

|{{cite journal |last1=Zhi |first1=Ying Xuan |last2=Lukasik |first2=Michelle |last3=Li |first3=Michael H. |last4=Dolatabadi |first4=Elham |last5=Wang |first5=Rosalie H. |last6=Taati |first6=Babak |title=Automatic Detection of Compensation During Robotic Stroke Rehabilitation Therapy |journal=IEEE Journal of Translational Engineering in Health and Medicine |date=2018 |volume=6 |pages=1–7 |doi=10.1109/JTEHM.2017.2780836 |pmid=29404226 |pmc=5788403 }}{{cite book |doi=10.1145/3154862.3154925 |chapter=The toronto rehab stroke pose dataset to detect compensation during stroke rehabilitation therapy |title=Proceedings of the 11th EAI International Conference on Pervasive Computing Technologies for Healthcare |date=2017 |last1=Dolatabadi |first1=Elham |last2=Zhi |first2=Ying Xuan |last3=Ye |first3=Bing |last4=Coahran |first4=Marge |last5=Lupinacci |first5=Giorgia |last6=Mihailidis |first6=Alex |last7=Wang |first7=Rosalie |last8=Taati |first8=Babak |pages=375–381 |isbn=978-1-4503-6363-1 }}{{Cite web|url=https://www.kaggle.com/derekdb/toronto-robot-stroke-posture-dataset|title=Toronto Rehab Stroke Pose Dataset}}

|E. Dolatabadi et al.

Corpus of Social Touch (CoST)

|7805 gesture captures of 14 different social touch gestures performed by 31 subjects. The gestures were performed in three variations: gentle, normal and rough, on a pressure sensor grid wrapped around a mannequin arm.

|Touch gestures performed are segmented and labeled.

|7805 gesture captures

|CSV

|Classification

|2016

|{{cite journal |last1=Jung |first1=Merel M. |last2=Poel |first2=Mannes |last3=Poppe |first3=Ronald |last4=Heylen |first4=Dirk K. J. |title=Automatic recognition of touch gestures in the corpus of social touch |journal=Journal on Multimodal User Interfaces |date=March 2017 |volume=11 |issue=1 |pages=81–96 |doi=10.1007/s12193-016-0232-9 }}{{Cite journal|date=2016-06-01|title=Corpus of Social Touch (CoST)|url=https://data.4tu.nl/articles/dataset/Corpus_of_Social_Touch_CoST_/12696869|language=en|doi=10.4121/uuid:5ef62345-3b3e-479c-8e1d-c922748c9b29|last1=Jung|first1=M.M. (Merel)|publisher=University of Twente}}

|M. Jung et al.

= Other signals =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Wine Dataset

|Chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.

|13 properties of each wine are given

|178

|Text

|Classification, regression

|1991

|Aeberhard, S., D. Coomans, and O. De Vel. "Comparison of classifiers in high dimensional settings." Dept. Math. Statist., James Cook Univ., North Queensland, Australia, Tech. Rep 92-02 (1992).Basu, Sugato. "[http://www.aaai.org/Papers/AAAI/2004/AAAI04-138.pdf Semi-supervised clustering with limited background knowledge]." AAAI. 2004.

|M. Forina et al.

Combined Cycle Power Plant Data Set

|Data from various sensors within a power plant running for 6 years.

|None

|9568

|Text

|Regression

|2014

|{{cite journal | last1 = Tüfekci | first1 = Pınar | year = 2014 | title = Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods | journal = International Journal of Electrical Power & Energy Systems | volume = 60 | pages = 126–140 | doi=10.1016/j.ijepes.2014.02.027| bibcode = 2014IJEPE..60..126T | url = https://hal.science/hal-04823875 }}Kaya, Heysem, Pınar Tüfekci, and Fikret S. Gürgen. "Local and global learning methods for predicting power of a combined gas & steam turbine." International conference on emerging trends in computer and electronics engineering (ICETCEE'2012), Dubai. 2012.

|P. Tufekci et al.

Physical data

Datasets from physical systems.

= High-energy physics =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

HIGGS Dataset

|Monte Carlo simulations of particle accelerator collisions.

|28 features of each collision are given.

|11M

|Text

|Classification

|2014

|{{cite journal | last1 = Baldi | first1 = Pierre | last2 = Sadowski | first2 = Peter | last3 = Whiteson | first3 = Daniel | year = 2014| title = Searching for exotic particles in high-energy physics with deep learning | journal = Nature Communications | volume = 5 | page = 2014 | bibcode = 2014NatCo...5.4308B | doi = 10.1038/ncomms5308 | pmid = 24986233 | arxiv = 1402.4735 | s2cid = 195953 }}{{cite journal | last1 = Baldi | first1 = Pierre | last2 = Sadowski | first2 = Peter | last3 = Whiteson | first3 = Daniel | year = 2015 | title = Enhanced Higgs Boson to τ+ τ− Search with Deep Learning | journal = Physical Review Letters | volume = 114 | issue = 11| page = 111801 | doi=10.1103/physrevlett.114.111801| pmid = 25839260 | bibcode = 2015PhRvL.114k1801B | arxiv = 1410.3469 | s2cid = 2339142 }}{{Cite journal|url=https://higgsml.lal.in2p3.fr/|title=The Higgs Machine Learning Challenge|journal=Journal of Physics: Conference Series|volume=664|issue=7|pages=072015|bibcode=2015JPhCS.664g2015A|last1=Adam-Bourdarios|first1=C.|last2=Cowan|first2=G.|last3=Germain-Renaud|first3=C.|last4=Guyon|first4=I.|last5=Kégl|first5=B.|last6=Rousseau|first6=D.|year=2015|doi=10.1088/1742-6596/664/7/072015|doi-access=free}}

|D. Whiteson

HEPMASS Dataset

|Monte Carlo simulations of particle accelerator collisions. Goal is to separate the signal from noise.

|28 features of each collision are given.

|10,500,000

|Text

|Classification

|2016

|{{cite journal | arxiv=1601.07913 | doi=10.1140/epjc/s10052-016-4099-4 | title=Parameterized neural networks for high-energy physics | year=2016 | last1=Baldi | first1=Pierre | last2=Cranmer | first2=Kyle | last3=Faucett | first3=Taylor | last4=Sadowski | first4=Peter | last5=Whiteson | first5=Daniel | journal=The European Physical Journal C | volume=76 | issue=5 | page=235 | bibcode=2016EPJC...76..235B | s2cid=254108545 }}

|D. Whiteson

= Systems =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Yacht Hydrodynamics Dataset

|Yacht performance based on dimensions.

|Six features are given for each yacht.

|308

|Text

|Regression

|2013

|{{cite journal | last1 = Ortigosa | first1 = I. | last2 = Lopez | first2 = R. | last3 = Garcia | first3 = J. | title = A neural networks approach to residuary resistance of sailing yachts prediction | journal = Proceedings of the International Conference on Marine Engineering MARINE | volume = 2007 }}Gerritsma, J., R. Onnink, and A. Versluis.Geometry, resistance and stability of the delft systematic yacht hull series. Delft University of Technology, 1981.

|R. Lopez

Robot Execution Failures Dataset

|5 data sets that center around robotic failure to execute common tasks.

|Integer valued features such as torque and other sensor measurements.

|463

|Text

|Classification

|1999

|Liu, Huan, and Hiroshi Motoda. [https://books.google.com/books?id=zi_0EdWW5fYC Feature extraction, construction and selection: A data mining perspective]. Springer Science & Business Media, 1998.

|L. Seabra et al.

Pittsburgh Bridges Dataset

|Design description is given in terms of several properties of various bridges.

|Various bridge features are given.

|108

|Text

|Classification

|1990

|Reich, Yoram. Converging to Ideal Design Knowledge by Learning. [Carnegie Mellon University], Engineering Design Research Center, 1989.{{Cite book|chapter-url=https://link.springer.com/chapter/10.1007/978-3-540-48247-5_11|doi = 10.1007/978-3-540-48247-5_11|chapter = Experiments in Meta-level Learning with ILP|title = Principles of Data Mining and Knowledge Discovery|series = Lecture Notes in Computer Science|year = 1999|last1 = Todorovski|first1 = Ljupčo|last2 = Džeroski|first2 = Sašo|volume = 1704|pages = 98–106|isbn = 978-3-540-66490-1| s2cid=39382993 }}

|Y. Reich et al.

Automobile Dataset

|Data about automobiles, their insurance risk, and their normalized losses.

|Car features extracted.

|205

|Text

|Regression

|1987

|Wang, Yong. [http://www.cs.waikato.ac.nz/~ml/publications/2000/thesis.pdf A new approach to fitting linear models in high dimensional spaces]. Diss. The University of Waikato, 2000.{{cite journal | last1 = Kibler | first1 = Dennis | last2 = Aha | first2 = David W. | last3 = Albert | first3 = Marc K. | year = 1989 | title = Instance-based prediction of real-valued attributes | url = https://escholarship.org/uc/item/68f860zb| journal = Computational Intelligence | volume = 5 | issue = 2| pages = 51–57 | doi=10.1111/j.1467-8640.1989.tb00315.x| s2cid = 40800413 }}

|J. Schimmer et al.

Auto MPG Dataset

|MPG data for cars.

|Eight features of each car given.

|398

|Text

|Regression

|1993

|{{cite book |doi=10.1007/3-540-36175-8_49 |chapter=Electricity Based External Similarity of Categorical Attributes |title=Advances in Knowledge Discovery and Data Mining |series=Lecture Notes in Computer Science |date=2003 |last1=Palmer |first1=Christopher R. |last2=Faloutsos |first2=Christos |volume=2637 |pages=486–500 |isbn=978-3-540-04760-5 }}

|Carnegie Mellon University

Energy Efficiency Dataset

|Heating and cooling requirements given as a function of building parameters.

|Building parameters given.

|768

|Text

|Classification, regression

|2012

|{{cite journal | last1 = Tsanas | first1 = Athanasios | last2 = Xifara | first2 = Angeliki | year = 2012 | title = Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools | journal = Energy and Buildings | volume = 49 | pages = 560–567 | doi=10.1016/j.enbuild.2012.03.003| bibcode = 2012EneBu..49..560T }}{{cite journal | last1 = De Wilde | first1 = Pieter | year = 2014 | title = The gap between predicted and measured energy performance of buildings: A framework for investigation | journal = Automation in Construction | volume = 41 | pages = 40–49 | doi=10.1016/j.autcon.2014.02.009}}

|A. Xifara et al.

Airfoil Self-Noise Dataset

|A series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections.

|Data about frequency, angle of attack, etc., are given.

|1503

|Text

|Regression

|2014

|Brooks, Thomas F., D. Stuart Pope, and Michael A. Marcolini. [https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19890016302.pdf Airfoil self-noise and prediction]. Vol. 1218. National Aeronautics and Space Administration, Office of Management, Scientific and Technical Information Division, 1989.

|R. Lopez

Challenger USA Space Shuttle O-Ring Dataset

|Attempt to predict O-ring problems given past Challenger data.

|Several features of each flight, such as launch temperature, are given.

|23

|Text

|Regression

|1993

|Draper, David. "[http://www2.denizyuret.com/ref/draper/assessment-and-propagation.pdf Assessment and propagation of model uncertainty]." Journal of the Royal Statistical Society, Series B (Methodological) (1995): 45–97.{{cite journal | last1 = Lavine | first1 = Michael | year = 1991 | title = Problems in extrapolation illustrated with space shuttle O-ring data | journal = Journal of the American Statistical Association | volume = 86 | issue = 416| pages = 919–921 | doi=10.1080/01621459.1991.10475132}}

|D. Draper et al.

Statlog (Shuttle) Dataset

|NASA space shuttle datasets.

|Nine features given.

|58,000

|Text

|Classification

|2002

|{{cite book |doi=10.1109/ICDM.2002.1184032 |chapter=Concept tree based clustering visualization with shaded similarity matrices |title=2002 IEEE International Conference on Data Mining, 2002. Proceedings. |date=2002 |last1=Wang |first1=J. |last2=Yu |first2=B. |last3=Gasser |first3=L. |pages=697–700 |isbn=0-7695-1754-4 }}

|NASA

= Astronomy =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Volcanoes on Venus – JARtool experiment Dataset

|Venus images returned by the Magellan spacecraft.

|Images are labeled by humans.

|not given

|Images

|Classification

|1991

|{{cite journal |last1=Pettengill |first1=Gordon H. |last2=Ford |first2=Peter G. |last3=Johnson |first3=William T. K. |last4=Raney |first4=R. Keith |last5=Soderblom |first5=Laurence A. |title=Magellan: Radar Performance and Data Products |journal=Science |date=12 April 1991 |volume=252 |issue=5003 |pages=260–265 |doi=10.1126/science.252.5003.260 |pmid=17769272 |bibcode=1991Sci...252..260P }}{{cite journal | last1 = Aharonian | first1 = F. | display-authors = et al | year = 2008 | title = Energy spectrum of cosmic-ray electrons at TeV energies | journal = Physical Review Letters | volume = 101 | issue = 26| page = 261104 | bibcode = 2008PhRvL.101z1104A | doi = 10.1103/PhysRevLett.101.261104 | pmid = 19437632 | arxiv = 0811.3894 | hdl = 2440/51450 | s2cid = 41850528 }}

|M. Burl

MAGIC Gamma Telescope Dataset

|Monte Carlo generated high-energy gamma particle events.

|Numerous features extracted from the simulations.

|19,020

|Text

|Classification

|2007

|{{cite journal | last1 = Bock | first1 = R. K. | display-authors = et al | year = 2004 | title = Methods for multidimensional event classification: a case study using images from a Cherenkov gamma-ray telescope | journal = Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment | volume = 516 | issue = 2| pages = 511–528 | doi=10.1016/j.nima.2003.08.157| bibcode = 2004NIMPA.516..511B }}

|R. Bock

Solar Flare Dataset

|Measurements of the number of certain types of solar flare events occurring in a 24-hour period.

|Many solar flare-specific features are given.

|1389

|Text

|Regression, classification

|1989

|{{cite journal | last1 = Li | first1 = Jinyan | display-authors = et al | year = 2004 | title = Deeps: A new instance-based lazy discovery and classification system | journal = Machine Learning | volume = 54 | issue = 2| pages = 99–124 | doi=10.1023/b:mach.0000011804.08528.7d| doi-access = free }}

|G. Bradshaw

CAMELS Multifield Dataset

|2D maps and 3D grids from thousands of N-body and state-of-the-art hydrodynamic simulations spanning a broad range in the value of the cosmological and astrophysical parameters

|Each map and grid has 6 cosmological and astrophysical parameters associated to it

|405,000 2D maps and 405,000 3D grids

|2D maps and 3D grids

|Regression

|2021

|{{cite journal|last1=Villaescusa-Navarro|first1=Francisco|last2=al.|first2=et|title=The CAMELS Multifield Data Set: Learning the Universe's Fundamental Parameters with Artificial Intelligence|journal=The Astrophysical Journal Supplement Series|year=2022|volume=259|issue=2|page=61|doi=10.3847/1538-4365/ac5ab0|arxiv=2109.10915|bibcode=2022ApJS..259...61V|s2cid=237604997 |doi-access=free }}

|Francisco Villaescusa-Navarro et al.

= Earth science =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Volcanoes of the World

|Volcanic eruption data for all known volcanic events on earth.

|Details such as region, subregion, tectonic setting, dominant rock type are given.

|1535

|Text

|Regression, classification

|2013

|Siebert, Lee, and Tom Simkin. "Volcanoes of the world: an illustrated catalog of Holocene volcanoes and their eruptions." (2014).

|E. Venzke et al.

Seismic-bumps Dataset

|Seismic activities from a coal mine.

|Seismic activity was classified as hazardous or not.

|2584

|Text

|Classification

|2013

|{{cite journal | last1 = Sikora | first1 = Marek | last2 = Wróbel | first2 = Łukasz | year = 2010 | title = Application of rule induction algorithms for analysis of data collected by seismic hazard monitoring systems in coal mines | url = https://www.infona.pl/resource/bwmeta1.element.baztech-article-BPZ5-0008-0008| journal = Archives of Mining Sciences | volume = 55 | issue = 1| pages = 91–114 }}{{cite book |doi=10.1007/978-1-4471-2760-4_10 |chapter=Rough Natural Hazards Monitoring |title=Rough Sets: Selected Methods and Applications in Management and Engineering |series=Advanced Information and Knowledge Processing |date=2012 |last1=Sikora |first1=Marek |last2=Sikora |first2=Beata |pages=163–179 |isbn=978-1-4471-2759-8 }}

|M. Sikora et al.

CAMELS-US

|Catchment hydrology dataset with hydrometeorological timeseries and various attributes

|see Reference

|671

|CSV, Text, Shapefile

|Regression

|2017

|{{cite journal |last1=Addor |first1=Nans |last2=Newman |first2=Andrew J. |last3=Mizukami |first3=Naoki |last4=Clark |first4=Martyn P. |title=The CAMELS data set: catchment attributes and meteorology for large-sample studies |journal=Hydrology and Earth System Sciences |date=20 October 2017 |volume=21 |issue=10 |pages=5293–5313 |doi=10.5194/hess-21-5293-2017 |doi-access=free |bibcode=2017HESS...21.5293A }}{{cite journal |last1=Newman |first1=A. J. |last2=Clark |first2=M. P. |last3=Sampson |first3=K. |last4=Wood |first4=A. |last5=Hay |first5=L. E. |last6=Bock |first6=A. |last7=Viger |first7=R. J. |last8=Blodgett |first8=D. |last9=Brekke |first9=L. |last10=Arnold |first10=J. R. |last11=Hopson |first11=T. |last12=Duan |first12=Q. |title=Development of a large-sample watershed-scale hydrometeorological data set for the contiguous USA: data set characteristics and assessment of regional variability in hydrologic model performance |journal=Hydrology and Earth System Sciences |date=14 January 2015 |volume=19 |issue=1 |pages=209–223 |doi=10.5194/hess-19-209-2015 |doi-access=free |bibcode=2015HESS...19..209N }}

|N. Addor et al. / A. Newman et al.

CAMELS-Chile

|Catchment hydrology dataset with hydrometeorological timeseries and various attributes

|see Reference

|516

|CSV, Text, Shapefile

|Regression

|2018

|{{cite journal |last1=Alvarez-Garreton |first1=Camila |last2=Mendoza |first2=Pablo A. |last3=Boisier |first3=Juan Pablo |last4=Addor |first4=Nans |last5=Galleguillos |first5=Mauricio |last6=Zambrano-Bigiarini |first6=Mauricio |last7=Lara |first7=Antonio |last8=Puelma |first8=Cristóbal |last9=Cortes |first9=Gonzalo |last10=Garreaud |first10=Rene |last11=McPhee |first11=James |last12=Ayala |first12=Alvaro |title=The CAMELS-CL dataset: catchment attributes and meteorology for large sample studies – Chile dataset |journal=Hydrology and Earth System Sciences |date=13 November 2018 |volume=22 |issue=11 |pages=5817–5846 |doi=10.5194/hess-22-5817-2018 |doi-access=free |bibcode=2018HESS...22.5817A }}

|C. Alvarez-Garreton et al.

CAMELS-Brazil

|Catchment hydrology dataset with hydrometeorological timeseries and various attributes

|see Reference

|897

|CSV, Text, Shapefile

|Regression

|2020

|{{cite journal |last1=Chagas |first1=Vinícius B. P. |last2=Chaffe |first2=Pedro L. B. |last3=Addor |first3=Nans |last4=Fan |first4=Fernando M. |last5=Fleischmann |first5=Ayan S. |last6=Paiva |first6=Rodrigo C. D. |last7=Siqueira |first7=Vinícius A. |title=CAMELS-BR: hydrometeorological time series and landscape attributes for 897 catchments in Brazil |journal=Earth System Science Data |date=8 September 2020 |volume=12 |issue=3 |pages=2075–2096 |doi=10.5194/essd-12-2075-2020 |doi-access=free |bibcode=2020ESSD...12.2075C }}

|V. Chagas et al.

CAMELS-GB

|Catchment hydrology dataset with hydrometeorological timeseries and various attributes

|see Reference

|671

|CSV, Text, Shapefile

|Regression

|2020

|{{cite journal |last1=Coxon |first1=Gemma |last2=Addor |first2=Nans |last3=Bloomfield |first3=John P. |last4=Freer |first4=Jim |last5=Fry |first5=Matt |last6=Hannaford |first6=Jamie |last7=Howden |first7=Nicholas J. K. |last8=Lane |first8=Rosanna |last9=Lewis |first9=Melinda |last10=Robinson |first10=Emma L. |last11=Wagener |first11=Thorsten |last12=Woods |first12=Ross |title=CAMELS-GB: hydrometeorological time series and landscape attributes for 671 catchments in Great Britain |journal=Earth System Science Data |date=12 October 2020 |volume=12 |issue=4 |pages=2459–2483 |doi=10.5194/essd-12-2459-2020 |doi-access=free |bibcode=2020ESSD...12.2459C }}

|G. Coxon et al.

CAMELS-Australia

|Catchment hydrology dataset with hydrometeorological timeseries and various attributes

|see Reference

|222

|CSV, Text, Shapefile

|Regression

|2021

|{{cite journal |last1=Fowler |first1=Keirnan J. A. |last2=Acharya |first2=Suwash Chandra |last3=Addor |first3=Nans |last4=Chou |first4=Chihchung |last5=Peel |first5=Murray C. |title=CAMELS-AUS: hydrometeorological time series and landscape attributes for 222 catchments in Australia |journal=Earth System Science Data |date=6 August 2021 |volume=13 |issue=8 |pages=3847–3867 |doi=10.5194/essd-13-3847-2021 |doi-access=free |bibcode=2021ESSD...13.3847F }}

|K. Fowler et al.

LamaH-CE

|Catchment hydrology dataset with hydrometeorological timeseries and various attributes

|see Reference

|859

|CSV, Text, Shapefile

|Regression

|2021

|{{cite journal |last1=Klingler |first1=Christoph |last2=Schulz |first2=Karsten |last3=Herrnegger |first3=Mathew |title=LamaH-CE: LArge-SaMple DAta for Hydrology and Environmental Sciences for Central Europe |journal=Earth System Science Data |date=16 September 2021 |volume=13 |issue=9 |pages=4529–4565 |doi=10.5194/essd-13-4529-2021 |doi-access=free |bibcode=2021ESSD...13.4529K }}

|C. Klingler et al.

= Other physical =

class="wikitable sortable sort-under" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Concrete Compressive Strength Dataset

|Dataset of concrete properties and compressive strength.

|Nine features are given for each sample.

|1030

|Text

|Regression

|2007

|{{cite journal | last1 = Yeh | first1 = I–C | year = 1998 | title = Modeling of strength of high-performance concrete using artificial neural networks | journal = Cement and Concrete Research | volume = 28 | issue = 12| pages = 1797–1808 | doi=10.1016/s0008-8846(98)00165-3}}
{{cite journal | last1 = Zarandi | first1 = MH Fazel | display-authors = et al | year = 2008 | title = Fuzzy polynomial neural networks for approximation of the compressive strength of concrete | journal = Applied Soft Computing | volume = 8 | issue = 1| pages = 488–498 | doi=10.1016/j.asoc.2007.02.010| bibcode = 2008ApSoC...8...79S }}

|I. Yeh

Concrete Slump Test Dataset

|Concrete slump flow given in terms of properties.

|Features of concrete given such as fly ash, water, etc.

|103

|Text

|Regression

|2009

|Yeh, I. "Modeling slump of concrete with fly ash and superplasticizer." Computers and Concrete5.6 (2008): 559–572.
{{cite journal | last1 = Gencel | first1 = Osman | display-authors = et al | year = 2011 | title = Comparison of artificial neural networks and general linear model approaches for the analysis of abrasive wear of concrete | journal = Construction and Building Materials | volume = 25 | issue = 8| pages = 3486–3494 | doi=10.1016/j.conbuildmat.2011.03.040}}

|I. Yeh

Musk Dataset

|Predict if a molecule, given the features, will be a musk or a non-musk.

|168 features given for each molecule.

|6598

|Text

|Classification

|1994

|Dietterich, Thomas G., et al. "[http://papers.nips.cc/paper/781-a-comparison-of-dynamic-reposing-and-tangent-distance-for-drug-activity-prediction.pdf A comparison of dynamic reposing and tangent distance for drug activity prediction] {{Webarchive|url=https://web.archive.org/web/20191207012717/https://papers.nips.cc/paper/781-a-comparison-of-dynamic-reposing-and-tangent-distance-for-drug-activity-prediction.pdf |date=7 December 2019 }}." Advances in Neural Information Processing Systems (1994): 216–216.

|Arris Pharmaceutical Corp.

Steel Plates Faults Dataset

|Steel plates of 7 different types.

|27 features given for each sample.

|1941

|Text

|Classification

|2010

|{{cite book |doi=10.1007/978-1-4614-4223-3_5 |chapter=Meta Net: A New Meta-Classifier Family |title=Data Mining Applications Using Artificial Adaptive Systems |date=2013 |last1=Buscema |first1=Massimo |last2=Tastle |first2=William J. |last3=Terzi |first3=Stefano |pages=141–182 |isbn=978-1-4614-4222-6 }}

|Semeion Research Center

Noble Metal Monometallic Nanoparticles Datasets

|Processing and structural features of monometallic nanoparticles, labels being formation energy.

|85-182 features given for each sample.

|425 to 4000

|CSV

|Regression

|2017 to 2023

|Barnard, Amanda; Sun, Baichuan; Motevalli Soumehsaraei, Ben; & Opletal, George (2019): Silver Nanoparticle Data Set. v3. CSIRO. Data Collection. [https://data.csiro.au/collection/csiro:23472 https://doi.org/10.25919/5d22d20bc543e]
Barnard, Amanda; Sun, Baichuan; & Opletal, George (2019): Platinum Nanoparticle Data Set. v2. CSIRO. Data Collection. [https://data.csiro.au/collection/csiro:36491 https://doi.org/10.25919/5d3958d9bf5f7]
Barnard, Amanda; & Opletal, George (2019): Gold Nanoparticle Data Set. v1. CSIRO. Data Collection. [https://data.csiro.au/collection/csiro:40669 https://doi.org/10.25919/5d395ef9a4291]
Barnard, Amanda; & Opletal, George (2019): Ruthenium Nanoparticle Data Set. v1. CSIRO. Data Collection. [https://data.csiro.au/collection/csiro:42601 https://doi.org/10.25919/5e30b8fa67484]
Barnard, Amanda; & Opletal, George (2019): Copper Nanoparticle Data Set. v1. CSIRO. Data Collection. [https://data.csiro.au/collection/csiro:42598 https://doi.org/10.25919/5e30ba386311f]
Barnard, Amanda; & Opletal, George (2023): Palladium Nanoparticle Data Set. v2. CSIRO. Data Collection. [https://doi.org/10.25919/epxd-8p61 https://doi.org/10.25919/epxd-8p61]

|A. Barnard and G. Opletal

Noble Metal Bimetallic Nanoparticles Datasets

|Processing and structural features of bimetallic nanoparticles, labels being formation energy.

|922 features given for each sample.

|138147 to 162770

|CSV

|Regression

|2023

|Ting, Jonathan; Barnard, Amanda; Opletal, George (2023): AuCo Nanoparticle Data Set. v2. CSIRO. Data Collection. [https://doi.org/10.25919/7h3x-1343 https://doi.org/10.25919/7h3x-1343]
Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): PtCo Nanoparticle Data Set. v1. CSIRO. Data Collection. [https://doi.org/10.25919/jzh8-rd31 https://doi.org/10.25919/jzh8-rd31]
Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): PtAu Nanoparticle Data Set. v1. CSIRO. Data Collection. [https://doi.org/10.25919/tdnv-jp30 https://doi.org/10.25919/tdnv-jp30]
Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): PdPt Nanoparticle Data Set. v1. CSIRO. Data Collection. [https://doi.org/10.25919/qced-2e85 https://doi.org/10.25919/qced-2e85]
Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): PdCo Nanoparticle Data Set. v1. CSIRO. Data Collection. [https://doi.org/10.25919/az9t-vr97 https://doi.org/10.25919/az9t-vr97]
Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): CoPt Nanoparticle Data Set. v1. CSIRO. Data Collection. [https://doi.org/10.25919/0bs4-sn79 https://doi.org/10.25919/0bs4-sn79]
Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): CoPd Nanoparticle Data Set. v1. CSIRO. Data Collection. [https://doi.org/10.25919/em3a-9a89 https://doi.org/10.25919/em3a-9a89]
Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): CoAu Nanoparticle Data Set. v1. CSIRO. Data Collection. [https://doi.org/10.25919/991j-hg07 https://doi.org/10.25919/991j-hg07]
Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): AuPt Nanoparticle Data Set. v1. CSIRO. Data Collection. [https://doi.org/10.25919/7zh9-3f67 https://doi.org/10.25919/7zh9-3f67]
Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): PtPd Nanoparticle Data Set. v1. CSIRO. Data Collection. [https://doi.org/10.25919/9sz9-3a85 https://doi.org/10.25919/9sz9-3a85]
Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): PdAu Nanoparticle Data Set. v1. CSIRO. Data Collection. [https://doi.org/10.25919/6ajg-1275 https://doi.org/10.25919/6ajg-1275]
Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): AuPd Nanoparticle Data Set. v1. CSIRO. Data Collection. [https://doi.org/10.25919/v0r5-sw08 https://doi.org/10.25919/v0r5-sw08]

|J. Ting et al.

AuPdPt Trimetallic Nanoparticles Dataset

|Processing and structural features of AuPdPt nanoparticles, labels being formation energy.

|1958 features given for each sample.

|48136

|CSV

|Regression

|2023

|Lu, Kaihan; Ting, Jonathan; Barnard, Amanda; & Opletal, George (2023): AuPdPt Nanoparticle Data Set. v1. CSIRO. Data Collection. [https://doi.org/10.25919/psvw-am47 https://doi.org/10.25919/psvw-am47]

|K. Lu et al.

Biological data

Datasets from biological systems.

= Human =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Age Dataset

|A structured general-purpose dataset on life, work, and death of 1.22 million distinguished people. Public domain.

|A five-step method to infer birth and death years, gender, and occupation from community-submitted data to all language versions of the Wikipedia project.

|1,223,009

|Text

|Regression, Classification

|2022

|Paper{{cite journal | last1 = Amoradnejad | first1 = Issa | last2 = Amoradnejad | first2 = Rahimberdi | display-authors = et al | year = 2022 | title = Age dataset: A structured general-purpose dataset on life, work, and death of 1.22 million distinguished people | journal = Workshop Proceedings of the 16th International AAAI Conference on Web and Social Media (ICWSM) | volume = 3 | pages = 1–4 | publisher = ICWSM | doi=10.36190/2022.82 | s2cid = 249668669 | url=http://workshop-proceedings.icwsm.org/abstract?id=2022_82 }}

Dataset{{cite web |title=Age Dataset |website=GitHub |date=7 June 2022 |url=https://github.com/Moradnejad/AgeDataset}}

|Amoradnejad et al.

Synthetic Fundus Dataset{{cite web |title=Synthetic Fundus Dataset |url=http://math.unipa.it/cvalenti/fundus/ |access-date=22 February 2023 |archive-date=29 November 2021 |archive-url=https://web.archive.org/web/20211129155047/http://math.unipa.it/cvalenti/fundus/ |url-status=dead }}

|Photorealistic retinal images and vessel segmentations. Public domain.

|2500 images with 1500*1152 pixels useful for segmentation and classification of veins and arteries on a single background.

|2500

|Images

|Classification, Segmentation

|2020

|{{cite journal | last1 = Lo Castro | first1 = Dario | display-authors = et al | year = 2020 | title = A visual framework to create photorealistic retinal vessels for diagnosis purposes | journal = Journal of Biomedical Informatics | volume = 108 | page = 103490 | doi=10.1016/j.jbi.2020.103490 | pmid = 32640292 | s2cid = 220429697 }}

|C. Valenti et al.

EEG Database

|Study to examine EEG correlates of genetic predisposition to alcoholism.

|Measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9 ms epoch) for 1 second.

|122

|Text

|Classification

|1999

|{{cite journal | last = Ingber | first = Lester | year = 1997 | title = Statistical mechanics of neocortical interactions: Canonical momenta indicatorsof electroencephalography | journal = Physical Review E | volume = 55 | issue = 4| pages = 4578–4593| bibcode = 1997PhRvE..55.4578I| doi = 10.1103/PhysRevE.55.4578| arxiv = physics/0001052| s2cid = 6390999 }}

|H. Begleiter

P300 Interface Dataset

|Data from nine subjects collected using P300-based brain-computer interface for disabled subjects.

|Split into four sessions for each subject. MATLAB code given.

|1,224

|Text

|Classification

|2008

|{{cite journal |last1=Hoffmann |first1=Ulrich |last2=Vesin |first2=Jean-Marc |last3=Ebrahimi |first3=Touradj |last4=Diserens |first4=Karin |title=An efficient P300-based brain–computer interface for disabled subjects |journal=Journal of Neuroscience Methods |date=January 2008 |volume=167 |issue=1 |pages=115–125 |doi=10.1016/j.jneumeth.2007.03.005 |pmid=17445904 |url=http://infoscience.epfl.ch/record/101093 }}{{cite journal |last1=Donchin |first1=Emanuel |first2=Kevin M. |last2=Spencer |first3=Ranjith |last3=Wijesinghe |title=The mental prosthesis: assessing the speed of a P300-based brain-computer interface |journal=IEEE Transactions on Rehabilitation Engineering |volume=8 |issue=2 |year=2000 |pages=174–179 |pmid=10896179 |doi=10.1109/86.847808|s2cid=84043 }}

|U. Hoffman et al.

Heart Disease Data Set

|Attributed of patients with and without heart disease.

|75 attributes given for each patient with some missing values.

|303

|Text

|Classification

|1988

|{{cite journal | last1 = Detrano | first1 = Robert | display-authors = et al | year = 1989 | title = International application of a new probability algorithm for the diagnosis of coronary artery disease | journal = The American Journal of Cardiology | volume = 64 | issue = 5| pages = 304–310 | doi=10.1016/0002-9149(89)90524-9| pmid = 2756873 }}{{cite journal | last1 = Bradley | first1 = Andrew P | year = 1997 | title = The use of the area under the ROC curve in the evaluation of machine learning algorithms | url = http://espace.library.uq.edu.au/view/UQ:8925/pr-t.pdf| journal = Pattern Recognition | volume = 30 | issue = 7| pages = 1145–1159 | doi=10.1016/s0031-3203(96)00142-2| bibcode = 1997PatRe..30.1145B | s2cid = 13806304 }}

|A. Janosi et al.

Breast Cancer Wisconsin (Diagnostic) Dataset

|Dataset of features of breast masses. Diagnoses by physician is given.

|10 features for each sample are given.

|569

|Text

|Classification

|1995

|{{cite book |doi=10.1117/12.148698 |chapter=Nuclear feature extraction for breast tumor diagnosis |title=Biomedical Image Processing and Biomedical Visualization |date=1993 |editor-last1=Acharya |editor-first1=Raj S. |last1=Street |first1=W. N. |last2=Wolberg |first2=W. H. |last3=Mangasarian |first3=O. L. |volume=1905 |pages=861–870 |editor-first2=Dmitry B. |editor-last2=Goldgof }}{{cite report |last1=Demir |first1=Cigdem |last2=Yener |first2=Bülent |title=Automated cancer diagnosis based on histopathological images : a systematic survey |date=2005 |url=https://www.cs.rpi.edu/research/pdf/05-09.pdf |s2cid=8952443 }}

|W. Wolberg et al.

National Survey on Drug Use and Health

|Large scale survey on health and drug use in the United States.

|None.

|55,268

|Text

|Classification, regression

|2012

|Abuse, Substance. "Mental Health Services Administration, Results from the 2010 National Survey on Drug Use and Health: Summary of National Findings, NSDUH Series H-41, HHS Publication No.(SMA) 11-4658." Rockville, MD: Substance Abuse and Mental Health Services Administration 201 (2011).

|United States Department of Health and Human Services

Lung Cancer Dataset

|Lung cancer dataset without attribute definitions

|56 features are given for each case

|32

|Text

|Classification

|1992

|{{cite journal | last1 = Hong | first1 = Zi-Quan | last2 = Yang | first2 = Jing-Yu | year = 1991 | title = Optimal discriminant plane for a small number of samples and design method of classifier on the plane | journal = Pattern Recognition | volume = 24 | issue = 4| pages = 317–324 | doi=10.1016/0031-3203(91)90074-f| bibcode = 1991PatRe..24..317H }}{{cite book |doi=10.1007/978-3-540-45160-0_25 |chapter=Using Rules to Analyse Bio-medical Data: A Comparison between C4.5 and PCL |title=Advances in Web-Age Information Management |series=Lecture Notes in Computer Science |date=2003 |last1=Li |first1=Jinyan |last2=Wong |first2=Limsoon |volume=2762 |pages=254–265 |isbn=978-3-540-40715-7 }}

|Z. Hong et al.

Arrhythmia Dataset

|Data for a group of patients, of which some have cardiac arrhythmia.

|276 features for each instance.

|452

|Text

|Classification

|1998

|{{cite book |doi=10.1109/CIC.1997.647926 |chapter=A supervised machine learning algorithm for arrhythmia analysis |title=Computers in Cardiology 1997 |date=1997 |last1=Guvenir |first1=H.A. |last2=Acar |first2=B. |last3=Demiroz |first3=G. |last4=Cekin |first4=A. |pages=433–436 |hdl=11693/27699 |isbn=0-7803-4445-6 }}{{cite book |last1=Lagus |first1=Krista |last2=Alhoniemi |first2=Esa |last3=Seppä |first3=Jeremias |last4=Honkela |first4=Antti |last5=Wagner |first5=Paul |chapter=Independent Variable Group Analysis in Learning Compact Representations for Data |title=International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR'05), Helsinki, Finland, June 15-17, 2005 |date=2005 |pages=49–56 |chapter-url=http://research.ics.aalto.fi/events/AKRR05/papers/akrr05lagus.pdf }}

|H. Altay et al.

Diabetes 130-US hospitals for years 1999–2008 Dataset

|9 years of readmission data across 130 US hospitals for patients with diabetes.

|Many features of each readmission are given.

|100,000

|Text

|Classification, clustering

|2014

|Strack, Beata, et al. "[http://downloads.hindawi.com/journals/bmri/2014/781670.pdf Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records]." BioMed Research International 2014; 2014{{cite journal | last1 = Rubin | first1 = Daniel J | year = 2015 | title = Hospital readmission of patients with diabetes | journal = Current Diabetes Reports | volume = 15 | issue = 4| pages = 1–9 | doi=10.1007/s11892-015-0584-7| pmid = 25712258 | s2cid = 3908599 }}

|J. Clore et al.

Diabetic Retinopathy Debrecen Dataset

|Features extracted from images of eyes with and without diabetic retinopathy.

|Features extracted and conditions diagnosed.

|1151

|Text

|Classification

|2014

|{{cite journal | last1 = Antal | first1 = Bálint | last2 = Hajdu | first2 = András | year = 2014 | title = An ensemble-based system for automatic screening of diabetic retinopathy | journal = Knowledge-Based Systems | volume = 60 | issue = 2014| pages = 20–27 | doi=10.1016/j.knosys.2013.12.023| arxiv = 1410.8576 | bibcode = 2014arXiv1410.8576A | s2cid = 13984326 }}{{cite arXiv |eprint=1505.04424|last1=Haloi|first1=Mrinal|title=Improved Microaneurysm Detection using Deep Neural Networks|class=cs.CV|year=2015}}

|B. Antal et al.

Diabetic Retinopathy Messidor Dataset

|Methods to evaluate segmentation and indexing techniques in the field of retinal ophthalmology (MESSIDOR)

|Features retinopathy grade and risk of macular edema

|1200

|Images, Text

|Classification, Segmentation

|2008

|{{Cite web|url=http://www.adcis.net/en/Download-Third-Party/Messidor.htmldownload.php|title=ADCIS Download Third Party: Messidor Database|last=ELIE|first=Guillaume PATRY, Gervais GAUTHIER, Bruno LAY, Julien ROGER, Damien|website=adcis.net|language=en|access-date=2018-02-25}}{{cite journal |last1=Decencière |first1=Etienne |last2=Zhang |first2=Xiwei |last3=Cazuguel |first3=Guy |last4=Lay |first4=Bruno |last5=Cochener |first5=Béatrice |last6=Trone |first6=Caroline |last7=Gain |first7=Philippe |last8=Ordonez |first8=Richard |last9=Massin |first9=Pascale |last10=Erginay |first10=Ali |last11=Charton |first11=Béatrice |last12=Klein |first12=Jean-Claude |title=Feedback on a Publicly Distributed Image Database: The Messidor Database |journal=Image Analysis & Stereology |date=26 August 2014 |volume=33 |issue=3 |pages=231 |doi=10.5566/ias.1155 }}

|Messidor Project

Liver Disorders Dataset

|Data for people with liver disorders.

|Seven biological features given for each patient.

|345

|Text

|Classification

|1990

|{{cite journal |last1=Bagirov |first1=A. M. |last2=Rubinov |first2=A. M. |last3=Soukhoroukova |first3=N. V. |last4=Yearwood |first4=J. |title=Unsupervised and supervised data classification via nonsmooth and global optimization |journal=Top |date=June 2003 |volume=11 |issue=1 |pages=1–75 |doi=10.1007/bf02578945 |url=https://figshare.com/articles/journal_contribution/26292709 }}{{cite book |doi=10.1145/1015330.1015409 |chapter=A fast iterative algorithm for fisher discriminant using heterogeneous kernels |last1=Fung |first1=Glenn |last2=Dundar |first2=Murat |last3=Bi |first3=Jinbo |last4=Rao |first4=Bharat |page=40 |editor1-last=Greiner |editor1-first=Russell |editor2-last=Schuurmans |editor2-first=Dale |title=Proceedings of the Twenty-first International Conference on Machine Learning |date=2004 |publisher=ACM |isbn=978-1-58113-838-2 }}

|Bupa Medical Research Ltd.

Thyroid Disease Dataset

|10 databases of thyroid disease patient data.

|None.

|7200

|Text

|Classification

|1987

|{{cite book |last1=Quinlan |first1=J. R. |last2=Compton |first2=P. J. |last3=Horn |first3=K. A. |last4=Lazarus |first4=L. |chapter=Inductive knowledge acquisition: a case study |pages=137–156 |editor1-last=Quinlan |editor1-first=John Ross |title=Applications of Expert Systems: Based on the Proceedings of the Second Australian Conference |date=1987 |publisher=Turing Institute Press |isbn=978-0-201-17449-6 }}{{cite journal |doi=10.1109/tkde.2004.11 |title=NeC4.5: Neural ensemble based C4.5 |date=2004 |last1=Zhi-Hua Zhou |last2=Yuan Jiang |journal=IEEE Transactions on Knowledge and Data Engineering |volume=16 |issue=6 |pages=770–773 }}

|R. Quinlan

Mesothelioma Dataset

|Mesothelioma patient data.

|Large number of features, including asbestos exposure, are given.

|324

|Text

|Classification

|2016

|{{cite journal | last1 = Er | first1 = Orhan | display-authors = et al | year = 2012 | title = An approach based on probabilistic neural network for diagnosis of Mesothelioma's disease | journal = Computers & Electrical Engineering | volume = 38 | issue = 1| pages = 75–81 | doi=10.1016/j.compeleceng.2011.09.001}}{{cite journal |last1=Er |first1=Orhan |last2=Tanrikulu |first2=A. Çetin |last3=Abakay |first3=Abdurrahman |title=Use of artificial intelligence techniques for diagnosis of malignant pleural mesothelioma |journal=Dicle Medical Journal / Dicle Tip Dergisi |date=10 May 2015 |volume=42 |issue=1 |doi=10.5798/diclemedj.0921.2015.01.0520 |doi-broken-date=23 November 2024 }}

|A. Tanrikulu et al.

Parkinson's Vision-Based Pose Estimation Dataset

|2D human pose estimates of Parkinson's patients performing a variety of tasks.

|Camera shake has been removed from trajectories.

|134

|Text

|Classification, regression

|2017

|{{cite journal|last1=Li|first1=Michael H.|last2=Mestre|first2=Tiago A.|last3=Fox|first3=Susan H.|last4=Taati|first4=Babak|date=2017-07-25|title=Vision-Based Assessment of Parkinsonism and Levodopa-Induced Dyskinesia with Deep Learning Pose Estimation|journal=Journal of Neuroengineering and Rehabilitation|volume=15|issue=1|pages=97|arxiv=1707.09416|doi=10.1186/s12984-018-0446-z|pmid=30400914|pmc=6219082|bibcode=2017arXiv170709416L |doi-access=free }}{{cite journal |last1=Li |first1=Michael H. |last2=Mestre |first2=Tiago A. |last3=Fox |first3=Susan H. |last4=Taati |first4=Babak |title=Automated assessment of levodopa-induced dyskinesia: Evaluating the responsiveness of video-based features |journal=Parkinsonism & Related Disorders |date=August 2018 |volume=53 |pages=42–45 |doi=10.1016/j.parkreldis.2018.04.036 |pmid=29748112 }}{{Cite web|url=https://www.kaggle.com/limi44/parkinsons-visionbased-pose-estimation-dataset/home|title=Parkinson's Vision-Based Pose Estimation Dataset {{!}} Kaggle|website=kaggle.com|access-date=2018-08-22}}

|M. Li et al.

KEGG Metabolic Reaction Network (Undirected) Dataset

|Network of metabolic pathways. A reaction network and a relation network are given.

|Detailed features for each network node and pathway are given.

|65,554

|Text

|Classification, clustering, regression

|2011

|{{cite journal|last1=Shannon|first1=Paul|display-authors=etal|year=2003|title=Cytoscape: a software environment for integrated models of biomolecular interaction networks|journal=Genome Research |volume=13 |issue=11 |pages=2498–2504 |doi=10.1101/gr.1239303 |pmid=14597658 |pmc=403769}}

|M. Naeem et al.

Modified Human Sperm Morphology Analysis Dataset (MHSMA)

|Human sperm images from 235 patients with male factor infertility, labeled for normal or abnormal sperm acrosome, head, vacuole, and tail.

|Cropped around single sperm head. Magnification normalized. Training, validation, and test set splits created.

|1,540

|.npy files

|Classification

|2019

|{{cite journal |last1=Javadi |first1=Soroush |last2=Mirroshandel |first2=Seyed Abolghasem |title=A novel deep learning method for automatic assessment of human sperm images |journal=Computers in Biology and Medicine |date=June 2019 |volume=109 |pages=182–194 |doi=10.1016/j.compbiomed.2019.04.030 |pmid=31059902 }}{{cite web|url=https://github.com/soroushj/mhsma-dataset|title=soroushj/mhsma-dataset: MHSMA: The Modified Human Sperm Morphology Analysis Dataset|website=github.com|access-date=2019-05-03}}

|S. Javadi and S.A. Mirroshandel

= Animal =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Abalone Dataset

|Physical measurements of Abalone. Weather patterns and location are also given.

|None.

|4177

|Text

|Regression

|1995

|Clark, David, Zoltan Schreter, and Anthony Adams. "A quantitative comparison of dystal and backpropagation." Proceedings of 1996 Australian Conference on Neural Networks. 1996.

|Marine Research Laboratories – Taroona

Zoo Dataset

|Artificial dataset covering 7 classes of animals.

|Animals are classed into 7 categories and features are given for each.

|101

|Text

|Classification

|1990

|Jiang, Yuan, and Zhi-Hua Zhou. "[https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/isnn04a.pdf Editing training data for kNN classifiers with neural network ensemble]." Advances in Neural Networks–ISNN 2004. Springer Berlin Heidelberg, 2004. 356–361.

|R. Forsyth

Demospongiae Dataset

|Data about marine sponges.

|503 sponges in the Demosponge class are described by various features.

|503

|Text

|Classification

|2010

|{{cite book |doi=10.1007/978-3-642-02998-1_18 |chapter=On Similarity Measures Based on a Refinement Lattice |title=Case-Based Reasoning Research and Development |series=Lecture Notes in Computer Science |date=2009 |last1=Ontañón |first1=Santiago |last2=Plaza |first2=Enric |volume=5650 |pages=240–255 |isbn=978-3-642-02997-4 }}

|E. Armengol et al.

Farm animals data

|PLF data inventory (cows, pigs; location, acceleration, etc.).

|Labeled datasets.

|List is constantly updated

|Text

|Classification

|2020

|{{Cite web|url=https://github.com/Animal-Data-Inventory/PLFDataInventory|title=PLF data inventory|website=GitHub|date=5 November 2021}}

|V. Bloch

Splice-junction Gene Sequences Dataset

|Primate splice-junction gene sequences (DNA) with associated imperfect domain theory.

|None.

|3190

|Text

|Classification

|1992

|G. Towell et al.

Mice Protein Expression Dataset

|Expression levels of 77 proteins measured in the cerebral cortex of mice.

|None.

|1080

|Text

|Classification, Clustering

|2015

|{{cite journal | last1 = Higuera | first1 = Clara | last2 = Gardiner | first2 = Katheleen J. | last3 = Cios | first3 = Krzysztof J. | year = 2015 | title = Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome | journal = PLOS ONE | volume = 10 | issue = 6| page = e0129126 | doi=10.1371/journal.pone.0129126| pmid = 26111164 | pmc = 4482027 | bibcode = 2015PLoSO..1029126H | doi-access = free }}{{cite journal | last1 = Ahmed | first1 = Md Mahiuddin | display-authors = et al | year = 2015 | title = Protein dynamics associated with failed and rescued learning in the Ts65Dn mouse model of Down syndrome | journal = PLOS ONE | volume = 10 | issue = 3| page = e0119491 | doi=10.1371/journal.pone.0119491| pmid = 25793384 | pmc = 4368539 | bibcode = 2015PLoSO..1019491A | doi-access = free }}

|C. Higuera et al.

= Fungi =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

UCI Mushroom Dataset

|Mushroom attributes and classification.

|Many properties of each mushroom are given.

|8124

|Text

|Classification

|1987

|{{cite journal|last1=Langley|first1=PAT|year=2014|title=Trading off simplicity and coverage in incremental concept learning|url=https://www.westmont.edu/~iba/pubs/hillary-paper.pdf|journal=Machine Learning Proceedings|volume=1988|page=73|access-date=6 August 2019|archive-date=6 August 2019|archive-url=https://web.archive.org/web/20190806184005/https://www.westmont.edu/~iba/pubs/hillary-paper.pdf|url-status=dead}}

|J. Schlimmer

Secondary Mushroom Dataset

|Mushroom attributes and classification

|Simulated data from larger and more realistic primary mushroom entries. Fully reproducible.

|61069

|Text

|Classification

|2020

|{{Cite web|title=Mushroom Data Set 2020|url=https://mushroom.mathematik.uni-marburg.de/|access-date=2021-04-06|website=mushroom.mathematik.uni-marburg.de}}{{cite journal |last1=Wagner |first1=Dennis |last2=Heider |first2=Dominik |last3=Hattab |first3=Georges |title=Mushroom data creation, curation, and simulation to support classification tasks |journal=Scientific Reports |date=14 April 2021 |volume=11 |issue=1 |page=8134 |doi=10.1038/s41598-021-87602-3 |pmid=33854157 |pmc=8046754 |bibcode=2021NatSR..11.8134W }}

|D. Wagner et al.

= Plant =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Forest Fires Dataset

|Forest fires and their properties.

|13 features of each fire are extracted.

|517

|Text

|Regression

|2008

|Cortez, Paulo, and Aníbal de Jesus Raimundo Morais. "A data mining approach to predict forest fires using meteorological data." (2007).{{cite journal | last1 = Farquad | first1 = M. A. H. | last2 = Ravi | first2 = V. | last3 = Raju | first3 = S. Bapi | year = 2010 | title = Support vector regression based hybrid rule extraction methods for forecasting | journal = Expert Systems with Applications | volume = 37 | issue = 8| pages = 5577–5589 | doi=10.1016/j.eswa.2010.02.055}}

|P. Cortez et al.

Iris Dataset

|Three types of iris plants are described by 4 different attributes.

|None.

|150

|Text

|Classification

|1936

|{{cite journal | last1 = Fisher | first1 = Ronald A | year = 1936 | title = The use of multiple measurements in taxonomic problems | journal = Annals of Eugenics | volume = 7 | issue = 2| pages = 179–188 | doi=10.1111/j.1469-1809.1936.tb02137.x| hdl = 2440/15227 | hdl-access = free }}Ghahramani, Zoubin, and Michael I. Jordan. "[http://papers.nips.cc/paper/767-supervised-learning-from-incomplete-data-via-an-em-approach.pdf Supervised learning from incomplete data via an EM approach] {{Webarchive|url=https://web.archive.org/web/20170422085455/http://papers.nips.cc/paper/767-supervised-learning-from-incomplete-data-via-an-em-approach.pdf |date=22 April 2017 }}." Advances in neural information processing systems 6. 1994.

|R. Fisher

Plant Species Leaves Dataset

|Sixteen samples of leaf each of one-hundred plant species.

|Shape descriptor, fine-scale margin, and texture histograms are given.

|1600

|Text

|Classification

|2012

|{{cite book |doi=10.2316/P.2013.798-098 |chapter=Plant Leaf Classification using Probabilistic Integration of Shape, Texture and Margin Features |title=Computer Graphics and Imaging / 798: Signal Processing, Pattern Recognition and Applications |date=2013 |last1=Mallah |first1=Charles |last2=Cope |first2=James |last3=Orwell |first3=James |isbn=978-0-88986-944-8 }}{{cite book |doi=10.1109/ICME.2012.130 |chapter=Leaf Shape Descriptor for Tree Species Identification |title=2012 IEEE International Conference on Multimedia and Expo |date=2012 |last1=Yahiaoui |first1=Itheri |last2=Mzoughi |first2=Olfa |last3=Boujemaa |first3=Nozha |pages=254–259 |isbn=978-1-4673-1659-0 }}

|J. Cope et al.

Soybean Dataset

|Database of diseased soybean plants.

|35 features for each plant are given. Plants are classified into 19 categories.

|307

|Text

|Classification

|1988

|{{cite book |doi=10.1016/B978-0-934613-64-4.50018-9 |chapter=Using Weighted Networks to Represent Classification Knowledge in Noisy Domains |title=Machine Learning Proceedings 1988 |date=1988 |last1=Tan |first1=Ming |last2=Eshelman |first2=Larry |pages=121–134 |isbn=978-0-934613-64-4 }}

|R. Michalski et al.

Seeds Dataset

|Measurements of geometrical properties of kernels belonging to three different varieties of wheat.

|None.

|210

|Text

|Classification, clustering

|2012

|Charytanowicz, Małgorzata, et al. "[http://home.agh.edu.pl/~kulpi/publ/Charytanowicz_Niewczas_Kulczycki_Kowalski_Lukasik_Zak_-_Information_Technologies_in_Biomedicine_-_2010.pdf Complete gradient clustering algorithm for features analysis of x-ray images]." Information technologies in biomedicine. Springer Berlin Heidelberg, 2010. 15–24.{{cite journal | last1 = Sanchez | first1 = Mauricio A. | display-authors = et al | year = 2014 | title = Fuzzy granular gravitational clustering algorithm for multivariate data | journal = Information Sciences | volume = 279 | pages = 498–511 | doi=10.1016/j.ins.2014.04.005}}

|Charytanowicz et al.

Covertype Dataset

|Data for predicting forest cover type strictly from cartographic variables.

|Many geographical features given.

|581,012

|Text

|Classification

|1998

|{{cite journal |last1=Blackard |first1=Jock A. |last2=Dean |first2=Denis J. |title=Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables |journal=Computers and Electronics in Agriculture |date=December 1999 |volume=24 |issue=3 |pages=131–151 |doi=10.1016/s0168-1699(99)00046-0 |bibcode=1999CEAgr..24..131B }}{{cite book |last1=Fürnkranz |first1=Johannes |chapter=Round Robin Rule Learning |pages=146–153 |chapter-url=https://ofai.at/papers/oefai-tr-2001-02.pdf |editor1-last=Danyluk |editor1-first=Andrea Pohoreckyj |editor2-last=Brodley |editor2-first=Carla E. |title=Machine Learning: Proceedings of the Eighteenth International Conference (ICML 2001) : Williams College, June 28-July 1, 2001 |date=2001 |publisher=Morgan Kaufmann Publishers |isbn=978-1-55860-778-1 }}

|J. Blackard et al.

Abscisic Acid Signaling Network Dataset

|Data for a plant signaling network. Goal is to determine set of rules that governs the network.

|None.

|300

|Text

|Causal-discovery

|2008

|{{cite journal | last1 = Li | first1 = Song | last2 = Assmann | first2 = Sarah M. | last3 = Albert | first3 = Réka | year = 2006 | title = Predicting essential components of signal transduction networks: a dynamic model of guard cell abscisic acid signaling | journal = PLOS Biol | volume = 4 | issue = 10| page = e312 | doi=10.1371/journal.pbio.0040312| pmid = 16968132 | pmc = 1564158 | bibcode = 2006q.bio....10012L | arxiv = q-bio/0610012 | doi-access = free }}

|J. Jenkens et al.

Folio Dataset

|20 photos of leaves for each of 32 species.

|None.

|637

|Images, text

|Classification, clustering

|2015

|{{cite journal | last1 = Munisami | first1 = Trishen | display-authors = et al | year = 2015 | title = Plant Leaf Recognition Using Shape Features and Colour Histogram with K-nearest Neighbour Classifiers | journal = Procedia Computer Science | volume = 58 | pages = 740–747 | doi=10.1016/j.procs.2015.08.095| doi-access = free }}{{cite journal | last1 = Li | first1 = Bai | year = 2016 | title = Atomic potential matching: An evolutionary target recognition approach based on edge features | journal = Optik | volume = 127 | issue = 5| pages = 3162–3168 | doi=10.1016/j.ijleo.2015.11.186| bibcode = 2016Optik.127.3162L }}

|T. Munisami et al.

Oxford Flower Dataset

|17 category dataset of flowers.

|Train/test splits, labeled images,

|1360

|Images, text

|Classification

|2006

|Razavian, Ali, et al. "[https://www.cv-foundation.org/openaccess/content_cvpr_workshops_2014/W15/papers/Razavian_CNN_Features_Off-the-Shelf_2014_CVPR_paper.pdf CNN features off-the-shelf: an astounding baseline for recognition]." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2014.Nilsback, Maria-Elena, and Andrew Zisserman. "[http://www.robots.ox.ac.uk/~men/papers/nilsback_cvpr06.pdf A visual vocabulary for flower classification]."Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on. Vol. 2. IEEE, 2006.

|M-E Nilsback et al.

Plant Seedlings Dataset

|12 category dataset of plant seedlings.

|Labelled images, segmented images,

|5544

|Images

|Classification, detection

|2017

|{{cite arXiv | last1 = Giselsson | first1 = Thomas M. | display-authors = et al | year = 2017 | title = A Public Image Database for Benchmark of Plant Seedling Classification Algorithms | eprint= 1711.05458 | class = cs.CV }}

|Giselsson et al.

Fruits-360

|Database with images of 194 fruits, vegetables, nuts and seeds.

|100x100 pixels, white background.

|132739

|Images (jpg)

|Classification

|2017–2025

|{{cite web|last1 = Oltean| first1 = Mihai | year = 2017 | title = Fruits-360 dataset| website = GitHub | url = https://www.github.com/fruits-360}}

|Mihai Oltean

Weed-ID.App

|Database with 1,025 species, 13,500+ images, and 120,000+ characteristics

|Varying size and background. Labeled by PhD botanist.

|13,500

|Images, text

|Classification

|1999-2024

|{{cite web|last1 = Old| first1 = Richard | year = 2024 | title = Weed-ID.App dataset | url = https://weed-id.app}}

|Richard Old

CottonWeedDet3 Dataset

|A 3-class weed detection dataset for cotton cropping systems

|3 species of weeds.

|848

|Images

|Classification

|2022

|{{cite journal |last1=Rahman |first1=Abdur |last2=Lu |first2=Yuzhen |last3=Wang |first3=Haifeng |title=Performance evaluation of deep learning object detectors for weed detection for cotton |journal=Smart Agricultural Technology |date=February 2023 |volume=3 |pages=100126 |doi=10.1016/j.atech.2022.100126 }}

|Rahman et al.

= Microbe =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Ecoli Dataset

|Protein localization sites.

|Various features of the protein localizations sites are given.

|336

|Text

|Classification

|1996

|{{cite journal | last1 = Nakai | first1 = Kenta | last2 = Kanehisa | first2 = Minoru | year = 1991 | title = Expert system for predicting protein localization sites in gram-negative bacteria | journal = Proteins: Structure, Function, and Bioinformatics | volume = 11 | issue = 2| pages = 95–110 | doi=10.1002/prot.340110203| pmid = 1946347 | s2cid = 27606447 }}Ling, Charles X., et al. "[https://cling.csd.uwo.ca/cs860/ICML04-Ling.pdf Decision trees with minimal costs]." Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.

|K. Nakai et al.

MicroMass Dataset

|Identification of microorganisms from mass-spectrometry data.

|Various mass spectrometer features.

|931

|Text

|Classification

|2013

|{{cite journal |last1=Mahé |first1=Pierre |last2=Arsac |first2=Maud |last3=Chatellier |first3=Sonia |last4=Monnin |first4=Valérie |last5=Perrot |first5=Nadine |last6=Mailler |first6=Sandrine |last7=Girard |first7=Victoria |last8=Ramjeet |first8=Mahendrasingh |last9=Surre |first9=Jérémy |last10=Lacroix |first10=Bruno |last11=van Belkum |first11=Alex |last12=Veyrieras |first12=Jean-Baptiste |title=Automatic identification of mixed bacterial species fingerprints in a MALDI-TOF mass-spectrum |journal=Bioinformatics |date=May 2014 |volume=30 |issue=9 |pages=1280–1286 |doi=10.1093/bioinformatics/btu022 |pmid=24443381 }}{{cite journal | last1 = Barbano | first1 = Duane | display-authors = et al | year = 2015 | title = Rapid characterization of microalgae and microalgae mixtures using matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS) | journal = PLOS ONE | volume = 10 | issue = 8| page = e0135337 | doi=10.1371/journal.pone.0135337| pmid = 26271045 | pmc = 4536233 | bibcode = 2015PLoSO..1035337B | doi-access = free }}

|P. Mahe et al.

Yeast Dataset

|Predictions of Cellular localization sites of proteins.

|Eight features given per instance.

|1484

|Text

|Classification

|1996

|{{cite journal | last1 = Horton | first1 = Paul | last2 = Nakai | first2 = Kenta | year = 1996 | title = A probabilistic classification system for predicting the cellular localization sites of proteins | url = https://www.aaai.org/Papers/ISMB/1996/ISMB96-012.pdf | journal = ISMB-96 Proceedings | volume = 4 | pages = 109–15 | pmid = 8877510 | access-date = 6 August 2019 | archive-date = 4 November 2021 | archive-url = https://web.archive.org/web/20211104042943/https://www.aaai.org/Papers/ISMB/1996/ISMB96-012.pdf | url-status = dead }}{{cite journal | last1 = Allwein | first1 = Erin L. | last2 = Schapire | first2 = Robert E. | last3 = Singer | first3 = Yoram | year = 2001 | title = Reducing multiclass to binary: A unifying approach for margin classifiers | url = http://www.jmlr.org/papers/volume1/allwein00a/allwein00a.pdf| journal = The Journal of Machine Learning Research | volume = 1 | pages = 113–141 }}

|K. Nakai et al.

= Drug discovery =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Tox21 Dataset

|Prediction of outcome of biological assays.

|Chemical descriptors of molecules are given.

|12707

|Text

|Classification

|2016

|{{cite journal | last1 = Mayr | first1 = Andreas | last2 = Klambauer | first2 = Guenter | last3 = Unterthiner | first3 = Thomas | last4 = Hochreiter | first4 = Sepp | year = 2016 | title = DeepTox: Toxicity Prediction Using Deep Learning | url = http://bioinf.jku.at/research/DeepTox/tox21.html | journal = Frontiers in Environmental Science | volume = 3 | page = 80 | doi=10.3389/fenvs.2015.00080| doi-access = free | bibcode = 2016FrEnS...3...80M }}

|A. Mayr et al.

Anomaly data

style="width: 100%" class="wikitable sortable"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Numenta Anomaly Benchmark (NAB)

|Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted.

|None

|50+ files

|CSV

|Anomaly detection

|2016 (continually updated)

|{{cite book |last1=Lavin |first1=Alexander |last2=Ahmad |first2=Subutai |title=2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA) |chapter=Evaluating Real-Time Anomaly Detection Algorithms -- the Numenta Anomaly Benchmark |arxiv=1510.03336 |pages=38–44 |date=12 October 2015 |doi=10.1109/ICMLA.2015.141 |isbn=978-1-5090-0287-0 |s2cid=6842305 }}

|Numenta

Skoltech Anomaly Benchmark (SKAB)

|Each file represents a single experiment and contains a single anomaly. The dataset represents a multivariate time series collected from the sensors installed on the testbed.

|There are two markups for Outlier detection (point anomalies) and Changepoint detection (collective anomalies) problems

|30+ files (v0.9)

|CSV

|Anomaly detection

|2020 (continually updated)

{{cite web |author1=Iurii D. Katser |author2=Vyacheslav O. Kozitsin |title=SKAB GitHub repository |website=GitHub |url=https://github.com/waico/skab |access-date=12 January 2021}}

{{cite journal |author1=Iurii D. Katser |author2=Vyacheslav O. Kozitsin |title=Skoltech Anomaly Benchmark (SKAB) |publisher=Kaggle |year=2020 |doi=10.34740/KAGGLE/DSV/1693952 |url=https://www.kaggle.com/yuriykatser/skoltech-anomaly-benchmark-skab |access-date=12 January 2021}}

|Iurii D. Katser and Vyacheslav O. Kozitsin

On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study

|Most data files are adapted from UCI Machine Learning Repository data, some are collected from the literature.

|treated for missing values, numerical attributes only, different percentages of anomalies, labels

|1000+ files

|ARFF

|Anomaly detection

|2016 (possibly updated with new datasets and/or results)

{{cite journal |last1=Campos |first1=Guilherme O. |last2=Zimek |first2=Arthur |last3=Sander |first3=Jörg |last4=Campello |first4=Ricardo J. G. B. |last5=Micenková |first5=Barbora |last6=Schubert |first6=Erich |last7=Assent |first7=Ira |last8=Houle |first8=Michael E. |title=On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study |journal=Data Mining and Knowledge Discovery |date=July 2016 |volume=30 |issue=4 |pages=891–927 |doi=10.1007/s10618-015-0444-8 }}

|Campos et al.

Question answering data

This section includes datasets that deals with structured data.

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

DBpedia Neural Question Answering (DBNQA) Dataset

|A large collection of Question to SPARQL specially design for Open Domain Neural Question Answering over DBpedia Knowledgebase.

|This dataset contains a large collection of Open Neural SPARQL Templates and instances for training Neural SPARQL Machines; it was pre-processed by semi-automatic annotation tools as well as by three SPARQL experts.

|894,499

|Question-query pairs

|Question Answering

|2018

|Ann-Kathrin Hartmann, Tommaso Soru, Edgard Marx. [https://www.researchgate.net/publication/324482598_Generating_a_Large_Dataset_for_Neural_Question_Answering_over_the_DBpedia_Knowledge_Base Generating a Large Dataset for Neural Question Answering over the DBpedia Knowledge Base]. 2018.{{cite report |type=Preprint |last1=Soru |first1=Tommaso |last2=Marx |first2=Edgard |last3=Moussallem |first3=Diego |last4=Publio |first4=Gustavo |last5=Valdestilhas |first5=André |last6=Esteves |first6=Diego |last7=Neto |first7=Ciro Baron |title=SPARQL as a Foreign Language |date=2017 |arxiv=1708.07624 }}

|Hartmann, Soru, and Marx et al.

Vietnamese Question Answering Dataset (UIT-ViQuAD)

|A large collection of Vietnamese questions for evaluating MRC models.

|This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia.

|23,074

|Question-answer pairs

|Question Answering

|2020

|Kiet Van Nguyen, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen. [https://www.aclweb.org/anthology/2020.coling-main.233.pdf A Vietnamese Dataset for Evaluating Machine Reading Comprehension]. COLING 2020.

|Nguyen et al.

Vietnamese Multiple-Choice Machine Reading Comprehension Corpus(ViMMRC)

|A collection of Vietnamese multiple-choice questions for evaluating MRC models.

|This corpus includes 2,783 Vietnamese multiple-choice questions.

|2,783

|Question-answer pairs

|Question Answering/Machine Reading Comprehension

|2020

|{{cite journal |last1=Nguyen |first1=Kiet Van |last2=Tran |first2=Khiem Vinh |last3=Luu |first3=Son T. |last4=Nguyen |first4=Anh Gia-Tuan |last5=Nguyen |first5=Ngan Luu-Thuy |title=Enhancing Lexical-Based Approach With External Knowledge for Vietnamese Multiple-Choice Machine Reading Comprehension |journal=IEEE Access |date=2020 |volume=8 |pages=201404–201417 |doi=10.1109/ACCESS.2020.3035701 |bibcode=2020IEEEA...8t1404N }}

|Nguyen et al.

Open-Domain Question Answering Goes Conversational via Question Rewriting

|An end-to-end open-domain question answering.

|This dataset includes 14,000 conversations with 81,000 question-answer pairs.

|Context, Question, Rewrite, Answer, Answer_URL, Conversation_no, Turn_no, Conversation_source

Further details are provided in the [https://github.com/apple/ml-qrecc project's GitHub repository] and respective [https://huggingface.co/datasets/svakulenk0/qrecc Hugging Face dataset card].

|Question Answering

|2021

|{{cite arXiv | eprint=2010.04898 | last1=Anantha | first1=Raviteja | last2=Vakulenko | first2=Svitlana | last3=Tu | first3=Zhucheng | last4=Longpre | first4=Shayne | last5=Pulman | first5=Stephen | last6=Chappidi | first6=Srinivas | title=Open-Domain Question Answering Goes Conversational via Question Rewriting | year=2020 | class=cs.IR }}

|Anantha and Vakulenko et al.

UnifiedQA

|Question-answer data

|Processed dataset

|Question Answering

|2020

|{{Cite journal |last1=Khashabi |first1=Daniel |last2=Min |first2=Sewon |last3=Khot |first3=Tushar |last4=Sabharwal |first4=Ashish |last5=Tafjord |first5=Oyvind |last6=Clark |first6=Peter |last7=Hajishirzi |first7=Hannaneh |date=November 2020 |title=UNIFIEDQA: Crossing Format Boundaries with a Single QA System |url=https://aclanthology.org/2020.findings-emnlp.171 |journal=Findings of the Association for Computational Linguistics: EMNLP 2020 |location=Online |publisher=Association for Computational Linguistics |pages=1896–1907 |doi=10.18653/v1/2020.findings-emnlp.171|arxiv=2005.00700 |s2cid=218487109 }}

|Khashabi et al.

Dialog or instruction prompted data

This section includes datasets that contains multi-turn text with at least two actors, a "user" and an "agent". The user makes requests for the agent, which performs the request.

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Taskmaster

|"The Taskmaster corpus consists of THREE datasets, Taskmaster-1 (TM-1), Taskmaster-2 (TM-2), and Taskmaster-3 (TM-3), comprising over 55,000 spoken and written task-oriented dialogs in over a dozen domains."{{Citation |title=Taskmaster |date=2022-12-17 |url=https://github.com/google-research-datasets/Taskmaster |publisher=Google Research Datasets |access-date=2023-01-07}}

|Taskmaster-1: goal-oriented conversational dataset. It includes 13,215 task-based dialogs comprising six domains.

Taskmaster-2: 17,289 dialogs in the seven domains (restaurants, food ordering, movies, hotels, flights, music and sports).

Taskmaster-3: 23,757 movie ticketing dialogs.

|Taskmaster-1 and Taskmaster-2: conversation id, utterances, Instruction id

Taskmaster-3: conversation id, utterances, vertical, scenario, instructions.

For further details check the [https://github.com/google-research-datasets/Taskmaster project's GitHub repository] or the Hugging Face dataset cards ([https://huggingface.co/datasets/taskmaster1 taskmaster-1], [https://huggingface.co/datasets/taskmaster2 taskmaster-2], [https://huggingface.co/datasets/taskmaster3 taskmaster-3]).

|Dialog/Instruction prompted

|2019

|{{Cite arXiv |last1=Byrne |first1=Bill |last2=Krishnamoorthi |first2=Karthik |last3=Sankar |first3=Chinnadhurai |last4=Neelakantan |first4=Arvind |last5=Duckworth |first5=Daniel |last6=Yavuz |first6=Semih |last7=Goodrich |first7=Ben |last8=Dubey |first8=Amit |last9=Cedilnik |first9=Andy |last10=Kim |first10=Kyu-Young |date=2019-09-01 |title=Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset |class=cs.CL |eprint=1909.05358 }}

|Byrne and Krishnamoorthi et al.

DrRepair

|A labeled dataset for program repair.

|Pre-processed data

|Check format details in the [https://worksheets.codalab.org/worksheets/0x01838644724a433c932bef4cb5c42fbd project's worksheet].

|Dialog/Instruction prompted

|2020

|{{Cite journal |last1=Yasunaga |first1=Michihiro |last2=Liang |first2=Percy |date=2020-11-21 |title=Graph-based, Self-Supervised Program Repair from Diagnostic Feedback |url=https://proceedings.mlr.press/v119/yasunaga20a.html |journal=International Conference on Machine Learning |language=en |publisher=PMLR |pages=10799–10808|arxiv=2005.10636 }}

|Michihiro et al.

Natural Instructions v2

|Large dataset that covers a wider range of reasoning abilities

|Each task consists of input/output, and a task definition.

Additionally, each ask contains a task definition.

Further information is provided in the [https://github.com/allenai/natural-instructions GitHub repository] of the project and the [https://huggingface.co/datasets/Muennighoff/natural-instructions Hugging Face data card].

|Input/Output and task definition

|2022

|{{Cite arXiv |last1=Wang |first1=Yizhong |last2=Mishra |first2=Swaroop |last3=Alipoormolabashi |first3=Pegah |last4=Kordi |first4=Yeganeh |last5=Mirzaei |first5=Amirreza |last6=Arunkumar |first6=Anjana |last7=Ashok |first7=Arjun |last8=Dhanasekaran |first8=Arut Selvan |last9=Naik |first9=Atharva |last10=Stap |first10=David |last11=Pathak |first11=Eshaan |last12=Karamanolakis |first12=Giannis |last13=Lai |first13=Haizhi Gary |last14=Purohit |first14=Ishan |last15=Mondal |first15=Ishani |date=2022-10-24 |title=Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks |class=cs.CL |eprint=2204.07705 }}

|Wang et al.

LAMBADA

|" LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word."{{Citation |last1=Paperno |first1=Denis |title=The LAMBADA dataset |date=2016-08-07 |url=https://zenodo.org/record/2630551 |access-date=2023-01-07 |last2=Kruszewski |first2=Germán |last3=Lazaridou |first3=Angeliki |last4=Pham |first4=Quan Ngoc |last5=Bernardi |first5=Raffaella |last6=Pezzelle |first6=Sandro |last7=Baroni |first7=Marco |last8=Boleda |first8=Gemma |last9=Fernández |first9=Raquel|doi=10.5281/zenodo.2630551 }}

|Information about this dataset's format is available in the [https://huggingface.co/datasets/lambada HuggingFace dataset card] and the [https://zenodo.org/record/2630551#.Y7uPquzMKNi project's website].

The dataset can be downloaded [https://zenodo.org/record/2630551/files/lambada-dataset.tar.gz here], and the rejected data [https://zenodo.org/record/2630551/files/rejected-data1.tar.gz here].

|2016

|{{Cite journal |last1=Paperno |first1=Denis |last2=Kruszewski |first2=Germán |last3=Lazaridou |first3=Angeliki |last4=Pham |first4=Ngoc Quan |last5=Bernardi |first5=Raffaella |last6=Pezzelle |first6=Sandro |last7=Baroni |first7=Marco |last8=Boleda |first8=Gemma |last9=Fernández |first9=Raquel |date=August 2016 |title=The LAMBADA dataset: Word prediction requiring a broad discourse context |url=https://aclanthology.org/P16-1144 |journal=Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |location=Berlin, Germany |publisher=Association for Computational Linguistics |pages=1525–1534 |doi=10.18653/v1/P16-1144|hdl=10230/32702 |s2cid=2381275 }}

|Paperno et al.

FLAN

|A re-preprocessed version of the FLAN dataset with updates since the original FLAN dataset was released is available in [https://huggingface.co/datasets/Muennighoff/flan Hugging Face]:

[https://huggingface.co/datasets/Muennighoff/flan/tree/main/test test data]
[https://huggingface.co/datasets/Muennighoff/flan/tree/main/train train data]
[https://huggingface.co/datasets/Muennighoff/flan/tree/main/validation validation data]

The scripts to process the data are available in the GitHub repo mentioned on the paper: https://github.com/google-research/FLAN/tree/main/flan.

Another [https://github.com/Muennighoff/FLAN FLAN GitHub repo] was created as well. This is the one associated with the dataset card in Hugging Face.

|2021

|{{Cite report |type=Preprint |last1=Wei |first1=Jason |last2=Bosma |first2=Maarten |last3=Zhao |first3=Vincent |last4=Guu |first4=Kelvin |last5=Yu |first5=Adams Wei |last6=Lester |first6=Brian |last7=Du |first7=Nan |last8=Dai |first8=Andrew M. |last9=Le |first9=Quoc V. |date=2022-02-10 |title=Finetuned Language Models are Zero-Shot Learners |arxiv=2109.01652 |url=https://openreview.net/forum?id=gEZrGCozdqR |language=en}}

|Wei et al.

Cybersecurity

class="wikitable sortable sort-under"

!Dataset Name

!Brief description

!Preprocessing

!Instances

!Format

!Default Task

!Created (updated)

!Reference

!Creator

MITRE ATTACK

|The ATT&CK is a globally-accessible knowledge base of adversary tactics and techniques.

|Data can be downloaded from these two GitHub repositories: [https://github.com/mitre-attack/attack-stix-data/archive/refs/heads/master.zip version 2.1] and [https://github.com/mitre/cti/archive/refs/heads/master.zip version 2.0]

|{{Cite web |title=Working with ATT&CK {{!}} MITRE ATT&CK® |url=https://attack.mitre.org/resources/working-with-attack/ |access-date=2023-01-14 |website=attack.mitre.org}}

|MITRE ATTACK

CAPEC

|Common Attack Pattern Enumeration and Classification

|Data can be downloaded from [https://capec.mitre.org/data/archive/capec_latest.zip CAPEC's website]:

[https://capec.mitre.org/data/csv/1000.csv.zip Mechanisms of Attack]

[https://capec.mitre.org/data/csv/3000.csv.zip Domains of Attack]

|{{Cite web |title=CAPEC - Common Attack Pattern Enumeration and Classification (CAPEC™) |url=https://capec.mitre.org/ |access-date=2023-01-14 |website=capec.mitre.org}}

|CAPEC

CVE

|CVE is a list of publicly disclosed cybersecurity vulnerabilities that is free to search, use, and incorporate into products and services.

|Data can be downloaded from: [https://cve.mitre.org/data/downloads/allitems.csv Allitems]

|{{Cite web |title=CVE - Home |url=https://cve.mitre.org/cve/ |access-date=2023-01-14 |website=cve.mitre.org}}

|CVE

CWE

|Common Weakness Enumeration data.

|Data can be downloaded from:

[https://cwe.mitre.org/data/csv/699.csv.zip Software Development]

[https://cwe.mitre.org/1194.csv.zip Hardware Design]{{Dead link|date=September 2023 |bot=InternetArchiveBot |fix-attempted=yes }}[https://cwe.mitre.org/data/csv/1000.csv.zip Research Concepts]

|{{Cite web |title=CWE - Common Weakness Enumeration |url=https://cwe.mitre.org/index.html |access-date=2023-01-14 |website=cwe.mitre.org}}

|CWE

MalwareTextDB

|Annotated database of malware texts.

|The [https://github.com/statnlp-research/statnlp-datasets/tree/master/dataset GitHub repository of the project] contains the data to download.

|{{Cite journal |last1=Lim |first1=Swee Kiat |last2=Muis |first2=Aldrian Obaja |last3=Lu |first3=Wei |last4=Ong |first4=Chen Hui |date=July 2017 |title=MalwareTextDB: A Database for Annotated Malware Articles |url=https://aclanthology.org/P17-1143 |journal=Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |location=Vancouver, Canada |publisher=Association for Computational Linguistics |pages=1557–1567 |doi=10.18653/v1/P17-1143|s2cid=7816596 }}

|Kiat et al.

USENIX Security Symposium proceedings

|Collection of security proceedings from USENIX Security Symposium – technical sessions from 1995 to 2022.

|This data is not pre-processed.

|[https://www.usenix.org/legacy/publications/library/proceedings/security95/index.html 1995], [https://www.usenix.org/legacy/publications/library/proceedings/sec96/ 1996], [https://www.usenix.org/legacy/publications/library/proceedings/ana97/technical.html 1997], [https://www.usenix.org/legacy/publications/library/proceedings/sec98/technical.html 1998], [https://www.usenix.org/legacy/events/sec99/technical.html 1999], [https://www.usenix.org/legacy/events/sec2000/tech.html 2000], [https://www.usenix.org/legacy/events/sec2001/tech.html 2001], [https://www.usenix.org/legacy/publications/library/proceedings/sec02/tech.html 2002], [https://www.usenix.org/legacy/publications/library/proceedings/sec03/tech.html 2003], [https://www.usenix.org/legacy/events/sec04/tech/ 2004], [https://www.usenix.org/legacy/events/sec05/tech/ 2005], [https://www.usenix.org/legacy/events/sec06/tech/ 2006], [https://www.usenix.org/legacy/events/sec07/tech/ 2007], [https://www.usenix.org/legacy/events/sec08/tech/#wed 2008],

[https://www.usenix.org/legacy/events/sec09/tech/ 2009], [https://www.usenix.org/legacy/events/sec10/tech/ 2010]

[https://static.usenix.org/event/sec11/tech/ 2011], [https://www.usenix.org/conference/usenixsecurity12/technical-sessions 2012], [https://www.usenix.org/conference/usenixsecurity13/technical-sessions 2013], [https://www.usenix.org/conference/usenixsecurity14/technical-sessions 2014], [https://www.usenix.org/conference/usenixsecurity15/technical-sessions 2015], [https://www.usenix.org/conference/usenixsecurity16/technical-sessions 2016], [https://www.usenix.org/conference/usenixsecurity17/technical-sessions 2017], [https://www.usenix.org/conference/usenixsecurity18/technical-sessions 2018], [https://www.usenix.org/conference/usenixsecurity19/technical-sessions 2019], [https://www.usenix.org/conference/usenixsecurity20/technical-sessions 2020], [https://www.usenix.org/conference/usenixsecurity21/technical-sessions 2021], [https://www.usenix.org/conference/usenixsecurity22/technical-sessions 2022].

|{{Cite web |title=USENIX |url=https://www.usenix.org/ |access-date=2023-01-19 |website=USENIX |language=en}}

|USENIX Security Symposium

APTNotes

|Collection of public documents, whitepapers and articles about APT campaigns. All the documents are publicly available data.

|This data is not pre-processed.

|The [https://github.com/aptnotes/data GitHub repository] of the project contains a file with links to the data stored in box.

Data files can also be downloaded [https://github.com/ameza13/APTNotesData/ here].

|{{Cite web |title=APTnotes {{!}} Read the Docs |url=https://readthedocs.org/projects/aptnotes/ |access-date=2023-01-19 |website=readthedocs.org}}

|APT Notes

arXiv Cryptography and Security papers

|Collection of articles about cybersecurity

|This data is not pre-processed.

|All articles available [https://github.com/ameza13/Cryptography-and-Security here].

|{{Cite web |title=Cryptography and Security authors/titles recent submissions |url=https://arxiv.org/list/cs.CR/recent |access-date=2023-01-19 |website=arxiv.org |language=en}}

|arXiv

Security eBooks for free

|Small collection of security eBooks, and security presentations publicly available.

|This data is not pre-processed.

|{{Cite web |title=Holistic Info-Sec for Web Developers - Fascicle 0 |url=https://f0.holisticinfosecforwebdevelopers.com/ |access-date=2023-01-20 |website=f0.holisticinfosecforwebdevelopers.com}}
{{Cite web |title=Holistic Info-Sec for Web Developers - Fascicle 1 |url=https://f1.holisticinfosecforwebdevelopers.com/ |access-date=2023-01-20 |website=f1.holisticinfosecforwebdevelopers.com}}
{{Cite web |last=Vincent |first=Adam |title=Web Services Web Services Hacking and Hardening |url=https://owasp.org/www-pdf-archive/Web_Services_Hacking_and_Hardening.pdf |website=owasp.org}}
{{Cite web |last=McCray |first=Joe |title=Advanced SQL Injection |url=https://defcon.org/images/defcon-17/dc-17-presentations/defcon-17-joseph_mccray-adv_sql_injection.pdf |website=defcon.org}}
{{Cite web |last=Shah |first=Shreeraj |title=Blind SQL injection discovery & exploitation technique |url=https://blueinfy.com/wp/blindsql.pdf |website=blueinfy.com}}
{{Cite web |last=Palcer |first=C. C. |title=Ethical hacking |url=https://blueinfy.com/wp/blindsql.pdf |website=textfiles}}
{{Cite web |title=Hacking Secrets Revealed - Information and Instructional Guide |url=https://www.onlinepot.org/security/HackersSecrets.pdf}}
{{Cite web |last=Park |first=Alexis |title=Hack any website |url=https://defcon.org/images/defcon-11/dc-11-presentations/dc-11-Gentil/dc-11-gentil.pdf}}
{{Cite web |last1=Cerrudo |first1=Cesar |last2=Martinez Fayo |first2=Esteban |title=Hacking Databases for Owning your Data |url=https://www.blackhat.com/presentations/bh-europe-07/Cerrudo/Whitepaper/bh-eu-07-cerrudo-WP-up.pdf |website=blackhat}}
{{Cite web |last=O'Connor |first=Tj. |title=Violent Python-A Cookbook for Hackers, Forensic Analysts, Penetration Testers and Security Engineers |url=https://github.com/reconSF/python/blob/master/Syngress.Violent.Python.a.Cookbook.for.Hackers.2013.pdf |website=Github}}
{{Cite web |last=Grand |first=Joe |title=Hardware Reverse Engineering: Access, Analyze, & Defeat |url=https://media.blackhat.com/bh-dc-11/Grand/BlackHat_DC_2011_Grand-Workshop.pdf |website=blackhat}}
{{Cite web |last=Chang |first=Jason V. |title=Computer Hacking: Making the Case for National Reporting Requirement |url=https://cyber.harvard.edu/sites/cyber.law.harvard.edu/files/ComputerHacking.pdf |website=cyber.harvard.edu}}

National Cyber Security strategy repository

|Repository of worldwide strategy documents about cybersecurity.

|This data is not pre-processed.

|{{Cite web |title=National Cybersecurity Strategies Repository |url=https://www.itu.int:443/en/ITU-D/Cybersecurity/Pages/National-Strategies-repository.aspx |access-date=2023-01-20 |website=ITU |language=en-US}}

Cyber Security Natural Language Processing

|Data about cybersecurity strategies from more than 75 countries.

|Tokenization, meaningless-frequent words removal.

|{{Citation |last=Chen |first=Yanlin |title=Cyber Security Natural Language Processing |date=2022-08-31 |url=https://github.com/Ychen463/Cyber |access-date=2023-01-20}}

|Yanlin Chen, Yunjian Wei, Yifan Yu, Wen Xue, Xianya Qin

APT Reports collection

|Sample of APT reports, malware, technology, and intelligence collection

|Raw and tokenize data available.

|All data is available in this [https://github.com/blackorbird/APT_REPORT GitHub] repository.

|{{Citation needed| reason=deleted twitter link|date=October 2023}}

|blackorbird

Offensive Language Identification Dataset (OLID)

|Data available in the [https://sites.google.com/site/offensevalsharedtask/olid project's website].

Data is also available [https://github.com/ameza13/OLIDdataset here].

|{{Cite arXiv |last1=Zampieri |first1=Marcos |last2=Malmasi |first2=Shervin |last3=Nakov |first3=Preslav |last4=Rosenthal |first4=Sara |last5=Farra |first5=Noura |last6=Kumar |first6=Ritesh |date=2019-04-16 |title=Predicting the Type and Target of Offensive Posts in Social Media |class=cs.CL |eprint=1902.09666 }}

|Zampieri et al.

Cyber reports from the National Cyber Security Centre

|This data is not pre-processed.

|[https://www.ncsc.gov.uk/section/keep-up-to-date/threat-reports Threat reports], [https://www.ncsc.gov.uk/section/keep-up-to-date/reports-advisories reports and advisory], [https://www.ncsc.gov.uk/section/keep-up-to-date/ncsc-news news], [https://www.ncsc.gov.uk/section/keep-up-to-date/ncsc-blog blog-posts], [https://www.ncsc.gov.uk/section/keep-up-to-date/all-speeches speeches].

[https://github.com/bee3202/cybersecurity-reports-ncsc Alternate list of reports].

|{{Cite web |title=Threat reports |url=https://www.ncsc.gov.uk/section/keep-up-to-date/threat-reports |access-date=2023-01-20 |website=www.ncsc.gov.uk |language=en}}

APT reports by Kaspersky

|This data is not pre-processed.

|{{Cite web |title=Category: APT reports {{!}} Securelist |url=https://securelist.com/category/apt-reports/ |access-date=2023-01-23 |website=securelist.com}}

The cyberwire

|This data is not pre-processed.

|[https://thecyberwire.com/newsletters Newsletters], [https://thecyberwire.com/podcasts podcasts], and [https://thecyberwire.com/stories stories].

|{{Cite web |title=Your Cybersecurity News Connection - Cyber News {{!}} CyberWire |url=https://thecyberwire.com/ |access-date=2023-01-23 |website=The CyberWire}}

Databreaches news

|This data is not pre-processed.

|[https://www.databreaches.net/news/ News], [https://github.com/bee3202/cybersecurity-data-sources/blob/main/DATABREACHES.md list of news from Aug 2022 to Feb 2023]

|{{Cite web |title=News |date=21 August 2016 |url=https://www.databreaches.net/news/ |access-date=2023-01-23 |language=en-US}}

Cybernews

|This data is not pre-processed.

|[https://cybernews.com/news/ News], [https://github.com/bee3202/cybersecurity-data-sources/blob/main/CYBERNEWS.md curated list of news]

|{{Cite web |title=Cybernews |url=https://cybernews.com/ |website=Cybernews}}

Bleepingcomputer

|This data is not pre-processed.

|[https://www.bleepingcomputer.com/ News]

|{{Cite web |title=BleepingComputer |url=https://www.bleepingcomputer.com/ |access-date=2023-01-23 |website=BleepingComputer |language=en-us}}

Therecord

|This data is not pre-processed.

|[https://therecord.media/news/cybercrime/ Cybercrime news]

|{{Cite web |title=Homepage |url=https://therecord.media/ |access-date=2023-01-23 |website=The Record from Recorded Future News |language=en}}

Hackread

|This data is not pre-processed.

|[https://www.hackread.com/hacking-news/ Hacking news]

|{{Cite web |date=2022-01-08 |title=HackRead {{!}} Latest Cyber Crime - InfoSec- Tech - Hacking News |url=https://www.hackread.com/ |access-date=2023-01-23 |language=en-US}}

Securelist

|This data is not pre-processed.

|[https://securelist.com/category/apt-reports/ APT reports], [https://securelist.com/category/archive/ archive], [https://securelist.com/category/ddos-reports/ DDOS reports], [https://securelist.com/category/incidents/ incidents], [https://securelist.com/category/kaspersky-security-bulletin/ Kaspersky security bulletin], [https://securelist.com/category/industrial-threats/ industrial threats], [https://securelist.com/category/malware-reports/ malware-reports], [https://securelist.com/category/opinions/ opinions], [https://securelist.com/category/publications/ publications], [https://securelist.com/category/research/ research], and [https://securelist.com/category/sas/ SAS].

|{{Cite web |title=Securelist {{!}} Kaspersky's threat research and reports |url=https://securelist.com/ |access-date=2023-01-31 |website=securelist.com}}

Stucco project

|The Stucco project collects data not typically integrated into security systems.

|This data is not pre-processed

|[https://stucco.github.io/data/ Project's website with data information][https://github.com/bee3202/cybersecurity-data-sources Reviewed source with links to data sources]

|{{Cite book |last1=Harshaw |first1=Christopher R. |last2=Bridges |first2=Robert A. |last3=Iannacone |first3=Michael D. |last4=Reed |first4=Joel W. |last5=Goodall |first5=John R. |title=Proceedings of the 11th Annual Cyber and Information Security Research Conference |chapter=GraphPrints |date=2016-04-05 |chapter-url=https://doi.org/10.1145/2897795.2897806 |series=CISRC '16 |location=New York, NY, USA |publisher=Association for Computing Machinery |pages=1–4 |doi=10.1145/2897795.2897806 |isbn=978-1-4503-3752-6}}

Farsightsecurity

|Website with technical information, reports, and more about security topics.

|This data is not pre-processed

|[https://www.farsightsecurity.com/technical/ Technical information], [https://www.farsightsecurity.com/research/ research], [https://www.farsightsecurity.com/reports/ reports].

|{{Cite web |title=Farsight Security, cyber security intelligence solutions |url=https://www.farsightsecurity.com/ |access-date=2023-02-13 |website=Farsight Security |language=en}}

Schneier

|Website with academic papers about security topics.

|This data is not pre-processed

|[https://www.schneier.com/academic/ Papers per category], [https://www.schneier.com/academic/archive/ papers archive by date].

|{{Cite web |title=Schneier on Security |url=https://www.schneier.com/ |access-date=2023-02-13 |website=www.schneier.com |language=en-US}}

Trendmicro

|Website with research, news, and perspectives bout security topics.

|This data is not pre-processed

|[https://github.com/bee3202/cybersecurity-data-sources/blob/main/TRENDMICRO.md Reviewed list of Trendmicro research, news, and perspectives].

|{{Cite web |title=#1 in Cloud Security & Endpoint Cybersecurity |url=https://www.trendmicro.com/en_us/business.html |access-date=2023-02-13 |website=Trend Micro |language=en-US}}

The Hacker News

|News about cybersecurity topics.

|This data is not pre-processed

|[https://thehackernews.com/search/label/data%20breach data breaches], [https://thehackernews.com/search/label/Cyber%20Attack cyberattacks], [https://thehackernews.com/search/label/Vulnerability vulnerabilities], [https://thehackernews.com/search/label/Malware malware news].

|{{Cite web |title=The Hacker News {{!}} #1 Trusted Cybersecurity News Site |url=https://thehackernews.com/ |access-date=2023-02-13 |website=The Hacker News |language=en}}

Krebsonsecurity

|Security news and investigation

|This data is not pre-processed

|[https://github.com/bee3202/cybersecurity-data-sources/blob/main/krebsonsecurity.md curated list of news]

|{{Cite web |title=Krebs on Security – In-depth security news and investigation |url=https://krebsonsecurity.com/ |access-date=2023-02-25 |language=en-US}}

Mitre Defend

|Matrix of Defend artifacts

|json files

|{{Cite web |title=MITRE D3FEND Knowledge Graph |url=https://d3fend.mitre.org/ |access-date=2023-03-31 |website=d3fend.mitre.org |language=en}}

Mitre Atlas

|Mitre Atlas is a knowledge base of adversary tactics, techniques, and case studies for machine learning (ML) systems based on real-world observations.

|This data is not pre-processed

|{{Cite web |title=MITRE {{!}} ATLAS™ |url=https://atlas.mitre.org/ |access-date=2023-03-31 |website=atlas.mitre.org}}

Mitre Engage

|MITRE Engage is a framework for planning and discussing adversary engagement operations that empowers you to engage your adversaries and achieve your cybersecurity goals.

|This data is not pre-processed

|{{Cite web |title=MITRE Engage™ {{!}} An Adversary Engagement Framework from MITRE |url=https://engage.mitre.org/ |access-date=2023-04-01 |language=en-US}}

Hacking Tutorials

|This data is not pre-processed

|{{Cite web |title=Hacking Tutorials - The best Step-by-Step Hacking Tutorials |url=https://www.hackingtutorials.org/ |access-date=2023-04-01 |website=Hacking Tutorials |language=en-US}}

Climate and sustainability

class="wikitable sortable"

!Dataset Name

!Brief description

!Preprocessing

!Instances

!Format

!Default Task

!Created (updated)

!Reference

!Creator

TCFD reports

|Database of company reports that include TCFD-related disclosures.

|This data is not pre-processed

|[https://www.tcfdhub.org/reports Direct link to reports][https://github.com/bee3202/cybersecurity-data-sources/blob/main/TCFDreports.md Curated list of reports]

|{{Cite web |title=TCFD Knowledge Hub |url=https://www.tcfdhub.org/ |access-date=2023-02-03 |website=TCFD Knowledge Hub |language=en}}

|TCFD Knowledge Hub

Corporate Social Responsibility Reports

|A listing of responsibility reports on the internet.

|This data is not pre-processed

|[https://github.com/bee3202/cybersecurity-data-sources/blob/main/RESPONSABILITYREPORTS.md Curated list of reports]

|{{Cite web |title=ResponsibilityReports.com |url=https://www.responsibilityreports.com/ |access-date=2023-02-03 |website=www.responsibilityreports.com}}

|ResponsibilityReports

The Intergovernmental Panel on Climate Change (IPCC)

|A collection of comprehensive assessment reports about knowledge on climate change, its causes, potential impacts and response options

|This data is not pre-processed

|[https://www.ipcc.ch/reports/ Reports][https://github.com/bee3202/cybersecurity-data-sources/blob/main/IPCC.md Curated list of reports]

|{{Cite web |title=About — IPCC |url=https://www.ipcc.ch/about/ |access-date=2023-02-20}}

|IPCC

Alliance for Research on Corporate Sustainability

|This data is not pre-processed

|[https://github.com/bee3202/cybersecurity-data-sources/blob/main/arcs.md Curated list of blog posts]

|{{Cite web |title=Alliance for Research on Corporate Sustainability {{!}} ARCS serves as a vehicle for advancing rigorous academic research on corporate sustainability issues. |url=https://corporate-sustainability.org/ |access-date=2023-03-02 |website=corporate-sustainability.org}}

|ARCS

ESG corpus: Knowledge Hub of the Accounting for Sustainability

|This data is not pre-processed

|[https://www.accountingforsustainability.org/content/a4s/corporate/en/knowledge-hub.html?tab1=guides Guides], [https://www.accountingforsustainability.org/content/a4s/corporate/en/knowledge-hub.html?tab1=case-studies case studies], [https://www.accountingforsustainability.org/content/a4s/corporate/en/knowledge-hub.html?tab1=blogs blogs], and [https://www.accountingforsustainability.org/content/a4s/corporate/en/knowledge-hub.html?tab1=reports reports & surveys].

|{{cite book |doi=10.5121/csit.2022.120616 |chapter=ESGBERT: Language Model to Help with Classification Tasks Related to Companies' Environmental, Social, and Governance Practices |title=Embedded Systems and Applications |date=2022 |last1=Mehra |first1=Srishti |last2=Louka |first2=Robert |last3=Zhang |first3=Yixun |pages=183–190 |isbn=978-1-925953-65-7 }}

|Mehra et al.

CLIMATE-FEVER

|A dataset adopting the FEVER methodology that consists of 1,535 real-world claims regarding climate-change collected on the internet.

|Each claim is accompanied by five manually annotated evidence sentences retrieved from the English Wikipedia that support, refute or do not give enough information to validate the claim totalling in 7,675 claim-evidence pairs.{{Creative Commons text attribution notice|url=https://www.tensorflow.org/datasets/community_catalog/huggingface/climate_fever|cc=by4}}

|[https://huggingface.co/datasets/climate_fever Dataset HF card], and project's [https://github.com/tdiggelm/climate-fever-dataset GitHub repository].

|{{Cite arXiv |last1=Diggelmann |first1=Thomas |last2=Boyd-Graber |first2=Jordan |last3=Bulian |first3=Jannis |last4=Ciaramita |first4=Massimiliano |last5=Leippold |first5=Markus |date=2021-01-02 |title=CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims |class=cs.CL |eprint=2012.00614}}

|Diggelmann et al.

Climate News dataset

|A dataset for NLP and climate change media researchers

|The dataset is made up of a number of data artifacts (JSON, JSONL & CSV text files & SQLite database)

|[http://www.climate-news-db.com/ Climate news DB], Project's [https://github.com/ADGEfficiency/climate-news-db GitHub repository]

|{{Cite web |title=climate-news-db |url=http://www.climate-news-db.com/ |access-date=2023-02-03 |website=www.climate-news-db.com}}

|ADGEfficiency

Climatext

|Climatext is a dataset for sentence-based climate change topic detection.

|[https://huggingface.co/datasets/mwong/climatetext-evidence-related-evaluation/tree/main/data HF dataset]

|{{Cite web |title=Climatext |url=http://www.sustainablefinance.uzh.ch/en/research/climate-fever/climatext.html |access-date=2023-02-19 |website=www.sustainablefinance.uzh.ch |language=en}}

|University of Zurich

GreenBiz

|Collection of articles and news about climate and sustainability

|This data is not pre-processed

|[https://github.com/bee3202/cybersecurity-data-sources/blob/main/climate-tech.md Curated list of climate articles][https://github.com/bee3202/cybersecurity-data-sources/blob/main/sustainability-strategy.md Curated list of sustainability articles]

|{{Cite web |title=Greenbiz |url=https://www.greenbiz.com/ |access-date=2023-03-02 |website=www.greenbiz.com}}

Top research pre-prints in climate and sustainability

|List of pre-prints from researchers in the reuters hot list

|This data is not pre-processed

|[https://github.com/bee3202/climate/blob/main/preprints_app_dimentions_ai.md Curated list of pre-prints]

|{{Cite news |title=Explore the @Reuters Hot List of 1,000 top climate scientists |language=en |work=Reuters |url=https://www.reuters.com/investigates/special-report/climate-change-scientists-list/ |access-date=2023-03-22}}

|Maurice Tamman

ARCS

|This data is not pre-processed

|[https://github.com/bee3202/climate/blob/main/arcs.md Curated list of corporate sustainability blogs]

|{{Cite web |title=Blogs {{!}} Alliance for Research on Corporate Sustainability |url=https://corporate-sustainability.org/blogs/ |access-date=2023-03-27 |website=corporate-sustainability.org}}

GreenBiz

|Website with articles about climate and sustainability

|This data is not pre-processed

|{{Cite web |title=Greenbiz |url=https://www.greenbiz.com/ |access-date=2023-03-29 |website=www.greenbiz.com}}

|GreenBiz

CSRWIRE

|This data is not pre-processed

|[https://github.com/bee3202/climate/blob/main/csrwire_all.md Curated list of articles]

|{{Cite web |title=CSR News |url=https://www.csrwire.com/press_releases |access-date=2023-03-29 |website=www.csrwire.com |language=en}}

|CSRWIRE

CDP

|Articles about [https://www.cdp.net/en/climate climate], [https://www.cdp.net/en/water water], and [https://www.cdp.net/en/forests forests]

|This data is not pre-processed

|{{Cite web |title=CDP Homepage |url=https://www.cdp.net/en |access-date=2023-03-29 |website=www.cdp.net |language=en}}

|CDP

Code data

class="wikitable sortable"

!Dataset Name

!Brief description

!Preprocessing

!Instances

!Format

!Default Task

!Created (updated)

!Reference

!Creator

The Stack

|A 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages.

|Filtered through license detection and deduplication.

|6 TB, 51.76B files (prior to deduplication); 3 TB, 5.28B files (after). 358 programming languages.

|Parquet

|Language modeling, autocompletion, program synthesis.

|2022

|{{cite arXiv |last1=de Vries |first1=Harm |title=The Stack: 3 TB of permissively licensed source code |date=2022 |class=cs.CL |eprint=2211.15533 }}{{cite web |title=The Stack Dedup |url=https://huggingface.co/datasets/bigcode/the-stack-dedup |website=Huggingface |access-date=29 August 2023}}

|D. Kocetkov, R. Li, L. Ben Allal, L. von Werra, H. de Vries

LEMUR Neural Network Dataset

|The structured repository of standardized neural network models designed to facilitate AutoML tasks and model analysis with LLMs

|Filtered through license detection and deduplication.

|PyTorch models.

|Python scripts.

|Image classification, object detection, image segmentation, and natural language processing.

|2024

|{{cite arXiv |last1= Goodarzi |first1=Arash Torabi |title=LEMUR Neural Network Dataset: Towards Seamless AutoML |date=2025 |class=cs.CL |eprint=2504.10552 }}

|A. Goodarzi, R. Kochnev, W. Khalid, F. Qin, T. Uzun, Y. Dhameliya, Y. Kathiriya, Z. Bentyn, D. Ignatov, R. Timofte

GitHub repositories

|This data is not pre-processed

|Curated lis of repositories from GitHub: [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.61.md 61] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.62.md 62] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.63.md 63] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.64.md 64] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.65.md 65] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.66.md 66] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.67.md 67] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.68.md 68] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.69.md 69] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.70.md 70] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.71.md 71], [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.72.md 72], [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.73.md 73], [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.74.md 74], [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.75.md 75], [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.76.md 76], [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.77.md 77] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others_main.md 101]

IBM Public GitHub repositories

|This data is not pre-processed

|[https://github.com/bee3202/cybersecurity-data-sources/blob/main/CODEPUBLIC.md Curated list of repositories] from GitHub

RedHat Public GitHub repositories

|This data is not pre-processed

|[https://github.com/bee3202/cybersecurity-data-sources/blob/main/CODERHPUBLIC.md Curated list of repositories] from GitHub

StackExchange Public Archive.org files

|This data is not pre-processed

|[https://github.com/bee3202/cybersecurity-data-sources/blob/main/CODESEPPUBLIC.md Curated list of files] from [https://archive.org/ Archive.org]

Gitlab Public repositories

|This data is not pre-processed

|Curated list of repositories from Gitlab: [https://github.com/bee3202/cybersecurity-data-sources/blob/main/CODELABPUBLIC.md 1] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/CODELABPUBLIC2.md 2]

Ansible Collections public repositories

|This data is not pre-processed

|[https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/CODEANSIBLEPUBLIC.0.md Curated list of repositories] from GitHub.

CodeParrot GitHub Code Dataset

|This data is not pre-processed

|Curated list of repositories from Hugging Face: [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/GITHUBCODE_RAWPUBLIC.md 1] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/GITHUBCODE_CLEANPUBLIC.md 2] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/CODEPARROT_TRAINV2NEARDEDUP.md 3] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/CODEPARROT_TRAINV2NEARDEDUPVALID.md 4] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/CODEPARROT_TRAINNEARDEDUPLICATIONPUBLIC.md 5] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/CODEPARROT_TRAINNEARDEDUPLICATIONPUBLICVALID.md 6] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/CODEPARROT_TRAINMOREPUBLIC.md 7] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/CODEPARROT_TRAINMOREVALIDPUBLIC.md 8] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/CODEPARROT_CLEANTRAINPUBLIC.md 9] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/CODEPARROT_CLEANVALIDPUBLIC.md 10]

OKD

|The Community Distribution of Kubernetes that powers Red Hat OpenShift

|This data is not pre-processed

|[https://github.com/orgs/okd-project/repositories List of GitHub repositories of the project]

OpenShift

|The developer and operations friendly Kubernetes distro

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_openshift.md List of GitHub repositories of the project]

Kubernetes

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_kubernetes.md List of GitHub repositories of the project]

Red Hat Developer

|GitHub home of the Red Hat Developer program

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_developer.md List of GitHub repositories of the project]

Red Hat

Workshops

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_workshops.md List of GitHub repositories of the project]

Kubernetes SIGs

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_kubernetes_sigs.md List of GitHub repositories of the project]

Konveyor

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_konveyor.md List of GitHub repositories of the project]

RedHat Marketplace

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_marketplace.md List of GitHub repositories of the project]

Redhat blog

|This data is not pre-processed

|{{Cite web |title=Hybrid cloud blog |url=https://content.cloud.redhat.com/blog |access-date=2023-04-09 |website=content.cloud.redhat.com |language=en-us}}

Kubernetes io

|This data is not pre-processed

|{{Cite web |title=Production-Grade Container Orchestration |url=https://kubernetes.io/ |access-date=2023-04-09 |website=Kubernetes |language=en}}

Docs Openshift

|This data is not pre-processed

|{{Cite web |title=Home {{!}} Official Red Hat OpenShift Documentation |url=https://docs.openshift.com/ |access-date=2023-04-09 |website=docs.openshift.com}}

cncf io

|This data is not pre-processed

|{{Cite web |title=Cloud Native Computing Foundation |url=https://www.cncf.io/ |access-date=2023-04-09 |website=Cloud Native Computing Foundation |language=en-US}}

Kubernetes presentations

|List of publicly available Kubernetes presentations

|This data is not pre-processed

|[https://github.com/bee3202/kubernetes_presentations/archive/refs/heads/main.zip data link]

Red Hat Open Innovation Labs

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_open_innovation_labs.md List of GitHub repositories of the project]

Red Hat Demos

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_RedHatDemos.md List of GitHub repositories of the project]

Red Hat OpenShift Online

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_openshift-online.md List of GitHub repositories of the project]

Software Collections

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_software_collections.md List of GitHub repositories of the project]

Red Hat Insights

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_insights.md List of GitHub repositories of the project]

Red Hat Government

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_government.md List of GitHub repositories of the project]

Red Hat Consulting

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_consulting.md List of GitHub repositories of the project]

Red Hat Communities of Practice

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_communities_of_practice.md List of GitHub repositories of the project]

Red Hat Partner Tech

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_partner_tech.md List of GitHub repositories of the project]

Red Hat Documentation

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_documentation.md List of GitHub repositories of the project]

IBM

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_IBM.md List of GitHub repositories of the project]

IBM Cloud

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_IBM_cloud.md List of GitHub repositories of the project]

Build Lab Team

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_build_lab_team.md List of GitHub repositories of the project]

Terraform IBM Modules

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_terraform-ibm-modules.md List of GitHub repositories of the project]

Cloud Schematics

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_Cloud-Schematics.md List of GitHub repositories of the project]

OCP Power Demos

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_ocp-power-demos.md List of GitHub repositories of the project]

IBM App Modernization

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_IBMAppModernization.md List of GitHub repositories of the project]

Kubernetes OperatorHub

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_k8s-operatorhub.md List of GitHub repositories of the project]

Cloud Native Computing Foundation (CNCF)

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_cncf.md List of GitHub repositories of the project]

Operator Framework

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_operator-framework.md List of GitHub repositories of the project]

|{{Citation |title=CNCF Community Presentations |date=2023-04-11 |url=https://github.com/cncf/presentations/blob/2ff57e4d78f6d70bb1fd5daf81e76f04a54c8520/kubernetes/README.md |access-date=2023-04-11 |publisher=Cloud Native Computing Foundation (CNCF)}}

GitHub repositories referenced in artifacthub.io

|This data is not pre-processed

|[https://github.com/bee3202/artifacthub_packages/blob/main/artifacthub_git_repos.md List of GitHub repositories in artifacthub.io]

Red Hat Communities of Practice

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_cop.md List of GitHub repositories of the project]

Red Hat partner

|This data is not pre-processed

|[https://github.com/redhat-partner-tech?tab=repositories List of GitHub repositories of the project]

IBM Repositories

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/p_ibm_repositories.md List of GitHub repositories for the project]

Build Lab Team

|This data is not pre-processed

|[https://github.com/orgs/ibm-build-lab/repositories List of GitHub repositories for the project]

Operator Framework

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/operator_framework.md List of GitHub repositories for the project]

GitHub repositories

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/individual_git_repos.md List of GitHub repositories for the project]

Red Hat

|This data is not pre-processed

|[https://www.redhat.com/en List of GitHub repositories of the project]

Kubernetes Patterns

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/kubernetes-patterns.md List of GitHub repositories of the project]

Kubernetes Deployment & Security Patterns

|This data is not pre-processed

|[https://resources.linuxfoundation.org/LF+Projects/CNCF/TheNewStack_Book2_KubernetesDeploymentAndSecurityPatterns.pdf List of GitHub repositories of the project]

Kubernetes for Full-Stack Developers

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/kubernetes-fs-developers.md List of GitHub repositories of the project]

Load Balancer Cloudwatch Metrics

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/load-balancer-cloudwatch-metrics.md GitHub repository of the project]

Dynatrace

|This data is not pre-processed

|[https://docs.dynatrace.com/docs/observe-and-explore/metrics-classic/built-in-metrics]

AIOps Challenge 2020 Data

|This data is not pre-processed

|[https://github.com/NetManAIOps/AIOps-Challenge-2020-Data GitHub repository of the project]

Loghub

|This data is not pre-processed

|[https://github.com/logpai/loghub List of repositories]

HTML Pages

|This data is not pre-processed

|[https://github.com/bee3202/open-shift-repos/blob/main/html_pages.md List of HTML pages]

Opensift ebooks

|This data is not pre-processed

|{{Cite web |title=Red Hat - We make open source technologies for the enterprise |url=https://www.redhat.com/en |access-date=2023-05-01 |website=www.redhat.com |language=en}}

Kubernetes ebooks

|This data is not pre-processed

|[https://www.redhat.com/rhdc/managed-files/cm-oreilly-kubernetes-patterns-ebook-f19824-201910-en_1.pdf Kubernetes Patterns], [https://resources.linuxfoundation.org/LF+Projects/CNCF/TheNewStack_Book2_KubernetesDeploymentAndSecurityPatterns.pdf Kubernetes Deployment], [https://assets.digitalocean.com/books/kubernetes-for-full-stack-developers.pdf Kubernetes for Full-Stack Developers]

Kubernetes for Full-Stack Developers

|This data is not pre-processed

|[https://assets.digitalocean.com/books/kubernetes-for-full-stack-developers.pdf Kubernetes for Full-Stack Developers]

List of public and licensed Github repositories

|This data is not pre-processed

|[https://github.com/bee3202/code_dataset/tree/main/licensed_batch_A List of repositories]

Multivariate data

= Financial =

style="width: 100%" class="wikitable sortable"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Dow Jones Index

|Weekly data of stocks from the first and second quarters of 2011.

|Calculated values included such as percentage change and a lags.

|750

|Comma separated values

|Classification, regression, Time series

|2014

|{{cite book |doi=10.1007/978-3-642-39712-7_3 |chapter=Dynamic-Radius Species-Conserving Genetic Algorithm for the Financial Forecasting of Dow Jones Index Stocks |title=Machine Learning and Data Mining in Pattern Recognition |series=Lecture Notes in Computer Science |date=2013 |last1=Brown |first1=Michael Scott |last2=Pelosi |first2=Michael J. |last3=Dirska |first3=Henry |volume=7988 |pages=27–41 |isbn=978-3-642-39711-0 }}{{cite journal | last1 = Shen | first1 = Kao-Yi | last2 = Tzeng | first2 = Gwo-Hshiung | year = 2015 | title = Fuzzy Inference-Enhanced VC-DRSA Model for Technical Analysis: Investment Decision Aid | journal = International Journal of Fuzzy Systems | volume = 17 | issue = 3| pages = 375–389 | doi=10.1007/s40815-015-0058-8| s2cid = 68241024 }}

|M. Brown et al.

Statlog (Australian Credit Approval)

|Credit card applications either accepted or rejected and attributes about the application.

|Attribute names are removed as well as identifying information. Factors have been relabeled.

|690

|Comma separated values

|Classification

|1987

|{{cite journal |last1=Quinlan |first1=J.R. |title=Simplifying decision trees |journal=International Journal of Man-Machine Studies |date=September 1987 |volume=27 |issue=3 |pages=221–234 |doi=10.1016/s0020-7373(87)80053-6 |hdl=1721.1/6453 }}{{cite journal | last1 = Hamers | first1 = Bart | last2 = Suykens | first2 = Johan AK | last3 = De Moor | first3 = Bart | year = 2003 | title = Coupled transductive ensemble learning of kernel models | url =http://ftp.esat.kuleuven.be/pub/SISTA/hamers/BH_clm.pdf | journal = Journal of Machine Learning Research | volume = 1 | pages = 1–48 }}

|R. Quinlan

eBay auction data

|Auction data from various eBay.com objects over various length auctions

|Contains all bids, bidderID, bid times, and opening prices.

|~ 550

|Text

|Regression, classification

|2012

|{{cite journal |last1=Shmueli |first1=Galit |last2=Russo |first2=Ralph P. |last3=Jank |first3=Wolfgang |title=The BARISTA: A model for bid arrivals in online auctions |journal=The Annals of Applied Statistics |date=December 2007 |volume=1 |issue=2 |doi=10.1214/07-AOAS117 }}{{cite journal |last1=Peng |first1=Jie |last2=Müller |first2=Hans-Georg |title=Distance-based clustering of sparsely observed stochastic processes, with applications to online auctions |journal=The Annals of Applied Statistics |date=September 2008 |volume=2 |issue=3 |doi=10.1214/08-AOAS172 }}

|G. Shmueli et al.

Statlog (German Credit Data)

|Binary credit classification into "good" or "bad" with many features

|Various financial features of each person are given.

|690

|Text

|Classification

|1994

|{{cite book |doi=10.1145/967900.968104 |chapter=Genetic Programming for data classification: Partitioning the search space |title=Proceedings of the 2004 ACM symposium on Applied computing |date=2004 |last1=Eggermont |first1=Jeroen |last2=Kok |first2=Joost N. |last3=Kosters |first3=Walter A. |pages=1001–1005 |isbn=978-1-58113-812-2 }}

|H. Hofmann

Bank Marketing Dataset

|Data from a large marketing campaign carried out by a large bank .

|Many attributes of the clients contacted are given. If the client subscribed to the bank is also given.

|45,211

|Text

|Classification

|2012

|{{cite journal | last1 = Moro | first1 = Sérgio | last2 = Cortez | first2 = Paulo | last3 = Rita | first3 = Paulo | year = 2014 | title = A data-driven approach to predict the success of bank telemarketing | journal = Decision Support Systems | volume = 62 | pages = 22–31 | doi=10.1016/j.dss.2014.03.001| hdl = 10071/9499 | s2cid = 14181100 | hdl-access = free }}{{cite arXiv |eprint=1411.5653|last1= Payne|first1= Richard D.|title= Bayesian Big Data Classification: A Review with Complements|last2= Mallick|first2= Bani K.|class= stat.ME|year= 2014}}

|S. Moro et al.

Istanbul Stock Exchange Dataset

|Several stock indexes tracked for almost two years.

|None.

|536

|Text

|Classification, regression

|2013

|{{cite journal | last1 = Akbilgic | first1 = Oguz | last2 = Bozdogan | first2 = Hamparsum | last3 = Balaban | first3 = M. Erdal | year = 2014 | title = A novel Hybrid RBF Neural Networks model as a forecaster | journal = Statistics and Computing | volume = 24 | issue = 3| pages = 365–375 | doi=10.1007/s11222-013-9375-7| s2cid = 17764829 }}{{cite journal |last1=Jabin |first1=Suraiya |title=Stock Market Prediction using Feed-forward Artificial Neural Network |journal=International Journal of Computer Applications |date=20 August 2014 |volume=99 |issue=9 |pages=4–8 |doi=10.5120/17399-7959 |bibcode=2014IJCA...99i...4J }}

|O. Akbilgic

Default of Credit Card Clients

|Credit default data for Taiwanese creditors.

|Various features about each account are given.

|30,000

|Text

|Classification

|2016

|{{cite journal | last1 = Yeh | first1 = I-Cheng | last2 = Che-hui | first2 = Lien | year = 2009 | title = The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients | journal = Expert Systems with Applications | volume = 36 | issue = 2| pages = 2473–2480 | doi=10.1016/j.eswa.2007.12.020| s2cid = 15696161 }}{{cite journal | last1 = Lin | first1 = Shu Ling | year = 2009 | title = A new two-stage hybrid approach of credit risk in banking industry | journal = Expert Systems with Applications | volume = 36 | issue = 4| pages = 8333–8341 | doi=10.1016/j.eswa.2008.10.015}}

|I. Yeh

[https://github.com/yumoxu/stocknet-dataset StockNet]

|Stock movement prediction from tweets and historical stock prices

|None

|Text

|NLP

|2018

|{{cite book |doi=10.18653/v1/P18-1183 |chapter=Stock Movement Prediction from Tweets and Historical Prices |title=Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |date=2018 |last1=Xu |first1=Yumo |last2=Cohen |first2=Shay B. |pages=1970–1979 }}

|Yumo Xu and Shay B. Cohen

= Weather =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Cloud DataSet

|Data about 1024 different clouds.

|Image features extracted.

|1024

|Text

|Classification, clustering

|1989

|{{cite journal | last1 = Pelckmans | first1 = Kristiaan | display-authors = et al | year = 2005 | title = The differogram: Non-parametric noise variance estimation and its use for model selection | journal = Neurocomputing | volume = 69 | issue = 1| pages = 100–122 | doi=10.1016/j.neucom.2005.02.015}}

|P. Collard

El Nino Dataset

|Oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific.

|12 weather attributes are measured at each buoy.

|178080

|Text

|Regression

|1999

|{{cite journal |last1=Bay |first1=Stephen D. |last2=Kibler |first2=Dennis |last3=Pazzani |first3=Michael J. |last4=Smyth |first4=Padhraic |title=The UCI KDD archive of large data sets for data mining research and experimentation |journal=ACM SIGKDD Explorations Newsletter |date=December 2000 |volume=2 |issue=2 |pages=81–85 |doi=10.1145/380995.381030 }}

|Pacific Marine Environmental Laboratory

Greenhouse Gas Observing Network Dataset

|Time-series of greenhouse gas concentrations at 2921 grid cells in California created using simulations of the weather.

|None.

|2921

|Text

|Regression

|2015

|{{cite journal | last1 = Lucas | first1 = D. D. | display-authors = et al | year = 2015 | title = Designing optimal greenhouse gas observing networks that consider performance and cost | journal = Geoscientific Instrumentation, Methods and Data Systems | volume = 4 | issue = 1| page = 121 | doi=10.5194/gi-4-121-2015| bibcode = 2015GI......4..121L | doi-access = free }}

|D. Lucas

Atmospheric {{CO2}} from Continuous Air Samples at Mauna Loa Observatory

|Continuous air samples in Hawaii, USA. 44 years of records.

|None.

|44 years

|Text

|Regression

|2001

|{{cite journal | last1 = Pales | first1 = Jack C. | last2 = Keeling | first2 = Charles D. | year = 1965 | title = The concentration of atmospheric carbon dioxide in Hawaii | journal = Journal of Geophysical Research | volume = 70 | issue = 24| pages = 6053–6076 | doi=10.1029/jz070i024p06053 | bibcode=1965JGR....70.6053P}}

|Mauna Loa Observatory

Ionosphere Dataset

|Radar data from the ionosphere. Task is to classify into good and bad radar returns.

|Many radar features given.

|351

|Text

|Classification

|1989

|Sigillito, Vincent G., et al. "Classification of radar returns from the ionosphere using neural networks." Johns Hopkins APL Technical Digest10.3 (1989): 262–266.

|Johns Hopkins University

Ozone Level Detection Dataset

|Two ground ozone level datasets.

|Many features given, including weather conditions at time of measurement.

|2536

|Text

|Classification

|2008

|{{cite journal |last1=Zhang |first1=Kun |last2=Fan |first2=Wei |title=Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond |journal=Knowledge and Information Systems |date=March 2008 |volume=14 |issue=3 |pages=299–326 |doi=10.1007/s10115-007-0095-1 }}{{cite journal |last1=Reich |first1=Brian J. |last2=Fuentes |first2=Montserrat |last3=Dunson |first3=David B. |title=Bayesian Spatial Quantile Regression |journal=Journal of the American Statistical Association |date=March 2011 |volume=106 |issue=493 |pages=6–20 |doi=10.1198/jasa.2010.ap09237 |pmid=23459794 |pmc=3583387 }}

|K. Zhang et al.

= Census =

style="width: 100%" class="wikitable sortable"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Adult Dataset

|Census data from 1994 containing demographic features of adults and their income.

|Cleaned and anonymized.

|48,842

|Comma separated values

|Classification

|1996

|{{cite journal | last1 = Kohavi | first1 = Ron | title = Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid | journal = KDD | volume = 96 | year = 1996 }}

|United States Census Bureau

Census-Income (KDD)

|Weighted census data from the 1994 and 1995 Current Population Surveys.

|Split into training and test sets.

|299,285

|Comma separated values

|Classification

|2000

|Oza, Nikunj C., and Stuart Russell. "Experimental comparisons of online and batch versions of bagging and boosting." Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2001.{{cite journal |last1=Bay |first1=Stephen D. |title=Multivariate Discretization for Set Mining |journal=Knowledge and Information Systems |date=November 2001 |volume=3 |issue=4 |pages=491–512 |doi=10.1007/pl00011680 }}

|United States Census Bureau

IPUMS Census Database

|Census data from the Los Angeles and Long Beach areas.

|None

|256,932

|Text

|Classification, regression

|1999

|{{cite journal | last1 = Ruggles | first1 = Steven | year = 1995 | title = Sample designs and sampling errors | journal = Historical Methods| volume = 28 | issue = 1| pages = 40–46 | doi=10.1080/01615440.1995.9955312}}

|IPUMS

US Census Data 1990

|Partial data from 1990 US census.

|Results randomized and useful attributes selected.

|2,458,285

|Text

|Classification, regression

|1990

|Meek, Christopher, Bo Thiesson, and David Heckerman. "[https://www.microsoft.com/en-us/research/wp-content/uploads/2001/01/lc-aistats.pdf The Learning Curve Method Applied to Clustering]." AISTATS. 2001.

|United States Census Bureau

= Transit =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Bike Sharing Dataset

|Hourly and daily count of rental bikes in a large city.

|Many features, including weather, length of trip, etc., are given.

|17,389

|Text

|Regression

|2013

|{{cite journal | last1 = Fanaee-T | first1 = Hadi | last2 = Gama | first2 = Joao | year = 2013| title = Event labeling combining ensemble detectors and background knowledge | url = http://repositorio.inesctec.pt/handle/123456789/3506| journal = Progress in Artificial Intelligence | volume = 2 | issue = 2–3| pages = 113–127 | doi = 10.1007/s13748-013-0040-3 | s2cid = 3345087 }}{{cite book |doi=10.1109/CIVTS.2014.7009473 |chapter=Predicting bikeshare system usage up to one day ahead |title=2014 IEEE Symposium on Computational Intelligence in Vehicles and Transportation Systems (CIVTS) |date=2014 |last1=Giot |first1=Romain |last2=Cherrier |first2=Raphael |pages=22–29 |isbn=978-1-4799-4497-2 |url=https://hal.archives-ouvertes.fr/hal-01065983/file/paper_final.pdf }}

|H. Fanaee-T

New York City Taxi Trip Data

|Trip data for yellow and green taxis in New York City.

|Gives pick up and drop off locations, fares, and other details of trips.

|6 years

|Text

|Classification, clustering

|2015

|{{cite journal | last1 = Zhan | first1 = Xianyuan | display-authors = et al | year = 2013 | title = Urban link travel time estimation using large-scale taxi data with partial information | journal = Transportation Research Part C: Emerging Technologies | volume = 33 | pages = 37–49 | doi=10.1016/j.trc.2013.04.001| bibcode = 2013TRPC...33...37Z }}

|New York City Taxi and Limousine Commission

Taxi Service Trajectory ECML PKDD

|Trajectories of all taxis in a large city.

|Many features given, including start and stop points.

|1,710,671

|Text

|Clustering, causal-discovery

|2015

|{{cite journal | last1 = Moreira-Matias | first1 = Luis | display-authors = et al | year = 2013 | title = Predicting taxi–passenger demand using streaming data | url = http://repositorio.inesctec.pt/handle/123456789/5356| journal = IEEE Transactions on Intelligent Transportation Systems| volume = 14 | issue = 3| pages = 1393–1402 | doi=10.1109/tits.2013.2262376| s2cid = 14764358 }}{{cite journal | last1 = Hwang | first1 = Ren-Hung | last2 = Hsueh | first2 = Yu-Ling | last3 = Chen | first3 = Yu-Ting | year = 2015 | title = An effective taxi recommender system based on a spatio-temporal factor analysis model | journal = Information Sciences | volume = 314 | pages = 28–40 | doi=10.1016/j.ins.2015.03.068}}

|M. Ferreira et al.

METR-LA

|Speed from loop detectors in the highway of Los Angeles County.

|Average speed in 5 minutes timesteps.

|7,094,304 from 207 sensors and 34,272 timesteps

|Comma separated values

|Regression, Forecasting

|2014

|H. V. Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M. Patel,

Raghu Ramakrishnan, and Cyrus Shahabi. Big data and its technical challenges. Commun. ACM,

57(7):86–94, July 2014.

|Jagadish et al.

PeMS

|Speed, flow, occupancy and other metrics from loop detectors and other sensors in the freeway of the State of California, U.S.A..

|Metric usually aggregated via Average into 5 minutes timesteps.

|39,000 individual detectors, each containing years of timeseries

|Comma separated values

|Regression, Forecasting, Nowcasting, Interpolation

|(updated realtime)

|[http://pems.dot.ca.gov/ Caltrans PeMS]

|California Department of Transportation

= Internet =

class="wikitable sortable" style="width: 100%"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Webpages from Common Crawl 2012

|Large collection of webpages and how they are connected via hyperlinks

|None.

|3.5B

|Text

|clustering, classification

|2013

|Meusel, Robert, et al. "[https://www.nowpublishers.com/article/OpenAccessDownload/JWS-0003 The Graph Structure in the Web—Analyzed on Different Aggregation Levels]."The Journal of Web Science 1.1 (2015).

|V. Granville

Internet Advertisements Dataset

|Dataset for predicting if a given image is an advertisement or not.

|Features encode geometry of ads and phrases occurring in the URL.

|3279

|Text

|Classification

|1998

|{{cite book |doi=10.1145/301136.301186 |chapter=Learning to remove Internet advertisements |title=Proceedings of the third annual conference on Autonomous Agents |date=1999 |last1=Kushmerick |first1=Nicholas |pages=175–181 |isbn=978-1-58113-066-9 }}{{cite book |doi=10.1145/956750.956812 |chapter=Experiments with random projections for machine learning |title=Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining |date=2003 |last1=Fradkin |first1=Dmitriy |last2=Madigan |first2=David |pages=517–522 |isbn=978-1-58113-737-8 }}

|N. Kushmerick

Internet Usage Dataset

|General demographics of internet users.

|None.

|10,104

|Text

|Classification, clustering

|1999

|This data was used in the American Statistical Association Statistical Graphics and Computing Sections 1999 Data Exposition.

|D. Cook

URL Dataset

|120 days of URL data from a large conference.

|Many features of each URL are given.

|2,396,130

|Text

|Classification

|2009

|{{cite book |doi=10.1145/1553374.1553462 |chapter=Identifying suspicious URLs: An application of large-scale online learning |title=Proceedings of the 26th Annual International Conference on Machine Learning |date=2009 |last1=Ma |first1=Justin |last2=Saul |first2=Lawrence K. |last3=Savage |first3=Stefan |last4=Voelker |first4=Geoffrey M. |pages=681–688 |isbn=978-1-60558-516-1 }}{{cite book |doi=10.1109/SP.2011.24 |chapter=Click Trajectories: End-to-End Analysis of the Spam Value Chain |title=2011 IEEE Symposium on Security and Privacy |date=2011 |last1=Levchenko |first1=K. |last2=Pitsillidis |first2=A. |last3=Chachra |first3=N. |last4=Enright |first4=B. |last5=Felegyhazi |first5=M. |last6=Grier |first6=C. |last7=Halvorson |first7=T. |last8=Kanich |first8=C. |last9=Kreibich |first9=C. |last10=He Liu |last11=McCoy |first11=D. |last12=Weaver |first12=N. |last13=Paxson |first13=V. |last14=Voelker |first14=G. M. |last15=Savage |first15=S. |pages=431–446 |isbn=978-0-7695-4402-1 }}

|J. Ma

Phishing Websites Dataset

|Dataset of phishing websites.

|Many features of each site are given.

|2456

|Text

|Classification

|2015

|Mohammad, Rami M., Fadi Thabtah, and Lee McCluskey. "[http://eprints.hud.ac.uk/16229/1/The_7th_ICITST_2012_Conference_-An_Assessment_of_Features_Related_to_Phishing_Websites_using_an_Automated_Technique.pdf An assessment of features related to phishing websites using an automated technique]."Internet Technology And Secured Transactions, 2012 International Conference for. IEEE, 2012.

|R. Mustafa et al.

Online Retail Dataset

|Online transactions for a UK online retailer.

|Details of each transaction given.

|541,909

|Text

|Classification, clustering

|2015

|{{cite book |doi=10.1145/2640087.2644161 |chapter=Clustering Experiments on Big Transaction Data for Market Segmentation |title=Proceedings of the 2014 International Conference on Big Data Science and Computing |date=2014 |last1=Singh |first1=Ashishkumar |last2=Rumantir |first2=Grace |last3=South |first3=Annie |last4=Bethwaite |first4=Blair |pages=1–7 |isbn=978-1-4503-2891-3 }}

|D. Chen

Freebase Simple Topic Dump

|Freebase is an online effort to structure all human knowledge.

|Topics from Freebase have been extracted.

|large

|Text

|Classification, clustering

|2011

|{{cite book |doi=10.1145/1376616.1376746 |chapter=Freebase: A collaboratively created graph database for structuring human knowledge |title=Proceedings of the 2008 ACM SIGMOD international conference on Management of data |date=2008 |last1=Bollacker |first1=Kurt |last2=Evans |first2=Colin |last3=Paritosh |first3=Praveen |last4=Sturge |first4=Tim |last5=Taylor |first5=Jamie |pages=1247–1250 |isbn=978-1-60558-102-6 }}Mintz, Mike, et al. "[https://www.aclweb.org/anthology/P09-1113 Distant supervision for relation extraction without labeled data]." Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2. Association for Computational Linguistics, 2009.

|Freebase

Farm Ads Dataset

|The text of farm ads from websites. Binary approval or disapproval by content owners is given.

|SVMlight sparse vectors of text words in ads calculated.

|4143

|Text

|Classification

|2011

|{{cite book |doi=10.1145/2020408.2020553 |chapter=Active learning using on-line algorithms |title=Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining |date=2011 |last1=Mesterharm |first1=Chris |last2=Pazzani |first2=Michael J. |pages=850–858 |isbn=978-1-4503-0813-7 }}{{cite journal | last1 = Wang | first1 = Shusen | last2 = Zhang | first2 = Zhihua | year = 2013 | title = Improving CUR matrix decomposition and the Nyström approximation via adaptive sampling | url = http://www.jmlr.org/papers/volume14/wang13c/wang13c.pdf| journal = The Journal of Machine Learning Research | volume = 14 | issue = 1| pages = 2729–2769 | arxiv = 1303.4207 | bibcode = 2013arXiv1303.4207W }}

|C. Masterharm et al.

The Pile

|Assembling several large datasets of diverse and unstructured texts

|Various (removing HTML and Javascript from websites, removing duplicated sentences)

|825 GiB English text

|JSON Lines{{Cite web |title=The Pile |url=https://pile.eleuther.ai/ |access-date=2022-04-14 |website=pile.eleuther.ai}}{{Cite web |title=JSON Lines |url=https://jsonlines.org/ |access-date=2022-04-14 |website=jsonlines.org}}

|Natural Language Processing, Text Prediction

|2021

|{{Cite arXiv |eprint=2101.00027 |class=cs.CL |first1=Leo |last1=Gao |first2=Stella |last2=Biderman |title=The Pile: An 800GB Dataset of Diverse Text for Language Modeling |date=2020-12-31 |last3=Black |first3=Sid |last4=Golding |first4=Laurence |last5=Hoppe |first5=Travis |last6=Foster |first6=Charles |last7=Phang |first7=Jason |last8=He |first8=Horace |last9=Thite |first9=Anish |last10=Nabeshima |first10=Noa |last11=Presser |first11=Shawn}}

|Gao et al.

OSCAR

|Large collection of monolingual corpora extracted from web data (Common Crawl dumps) covering 150+ languages

|Various (filtering, language classification, adult-content detection and other labelling)

|3.4 TB English text, 1.4 TB Chinese text, 1.1 TB Russian text, 595 MB German text, 431 MB French text, and data for 150+ languages (figures for version 23.01)

|JSON Lines{{Cite web |title=OSCAR |url=https://oscar-project.org/ |access-date=2023-08-12 |website=oscar-project.org}}

|Natural Language Processing, Text Prediction

|2021

|Ortiz Suarez, Pedro, et al. "[https://inria.hal.science/hal-02148693v1/file/Asynchronous_Pipeline_for_Processing_Huge_Corpora_on_Medium_to_Low_Resource_Infrastructures.pdf]." Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures. CMLC-7, 2019.Abadji, Julien, et al. "[https://aclanthology.org/2022.lrec-1.463.pdf]." Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. LREC, 2022.

|Ortiz Suarez, Abadji, Sagot et al.

OpenWebText

|An open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit with at least three upvotes.

|Extracted non-HTML content, deduplicated, and tokenized.

|8,013,769 Documents, 38GB

|Text

|Natural Language Processing, Text Prediction

|2019

|{{Cite web |last=Cohen |first=Vanya |title=OpenWebTextCorpus |url=https://skylion007.github.io/OpenWebTextCorpus/ |access-date=2023-01-09 |website=OpenWebTextCorpus |language=en}}{{Cite web |title=openwebtext · Datasets at Hugging Face |url=https://huggingface.co/datasets/openwebtext |access-date=2023-01-09 |website=huggingface.co|date=16 November 2022 }}

|A. Gokaslan, V. Cohen

ROOTS

|A well-documented and representative multilingual dataset with the explicit goal of doing good for and by the people whose data was collected.

|Extracted non-HTML content, cleaned out UI and ads, deduplicated, removed PII, and tokenized.

|1.6 TB, 59 languages.

|Parquet

|Natural Language Processing, Text Prediction

|2022

|{{Cite arXiv |last=Saulnier |first=Lucile |title=The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset |date=2023 |class=cs.CL |eprint=2303.03915 |language=en}}{{Cite web |title=BigScience Data · Datasets at Hugging Face |url=https://huggingface.co/bigscience-data |access-date=2023-08-29 |website=huggingface.co|date=29 August 2023 }}

|H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. Villanova del Moral, T. Le Scao

= Games =

style="width: 100%" class="wikitable sortable"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Poker Hand Dataset

|5 card hands from a standard 52 card deck.

|Attributes of each hand are given, including the Poker hands formed by the cards it contains.

|1,025,010

|Text

|Regression, classification

|2007

|{{cite journal |last1=Cattral |first1=Robert |first2=Franz |last2=Oppacher |first3=Dwight |last3=Deugo |title=Evolutionary data mining with automatic rule generalization |journal=Recent Advances in Computers, Computing and Communications |year=2002 |pages=296–300 |s2cid=18625415 }}

|R. Cattral

Connect-4 Dataset

|Contains all legal 8-ply positions in the game of connect-4 in which neither player has won yet, and in which the next move is not forced.

|None.

|67,557

|Text

|Classification

|1995

|{{cite journal |last1=Burton |first1=Ariel N. |last2=Kelly |first2=Paul H.J. |title=Performance prediction of paging workloads using lightweight tracing |journal=Future Generation Computer Systems |date=August 2006 |volume=22 |issue=7 |pages=784–793 |doi=10.1016/j.future.2006.02.003 }}

|J. Tromp

Chess (King-Rook vs. King) Dataset

|Endgame Database for White King and Rook against Black King.

|None.

|28,056

|Text

|Classification

|1994

|{{cite book |doi=10.1093/oso/9780198538509.003.0012 |chapter=Learning Optimal Chess Strategies |title=Machine Intelligence 13 |date=1994 |last1=Bain |first1=M. |last2=Muggleton |first2=S. |pages=291–309 |isbn=978-0-19-853850-9 }}{{cite book |doi=10.1007/978-3-662-12405-5_15 |chapter=Learning Efficient Classification Procedures and Their Application to Chess End Games |title=Machine Learning |date=1983 |last1=Quinlan |first1=J. Ross |pages=463–482 |isbn=978-3-662-12407-9 }}

|M. Bain et al.

Chess (King-Rook vs. King-Pawn) Dataset

|King+Rook versus King+Pawn on a7.

|None.

|3196

|Text

|Classification

|1989

|{{cite book |last=Shapiro |first=Alen D. |title=Structured induction in expert systems |publisher=Addison-Wesley Longman Publishing Co., Inc. |year=1987}}

|R. Holte

Tic-Tac-Toe Endgame Dataset

|Binary classification for win conditions in tic-tac-toe.

|None.

|958

|Text

|Classification

|1991

|{{cite journal | last1 = Matheus | first1 = Christopher J. | last2 = Rendell | first2 = Larry A. | title = Constructive Induction on Decision Trees | url =https://www.ijcai.org/Proceedings/89-1/Papers/103.pdf | journal = IJCAI | volume = 89 | year = 1989 |s2cid=11018089 }}

|D. Aha

= Other multivariate =

style="width: 100%" class="wikitable sortable"

! scope="col" style="width: 15%;" | Dataset Name

! scope="col" style="width: 18%;" | Brief description

! scope="col" style="width: 18%;" | Preprocessing

! scope="col" style="width: 6%;" | Instances

! scope="col" style="width: 7%;" | Format

! scope="col" style="width: 7%;" | Default Task

! scope="col" style="width: 6%;" | Created (updated)

! scope="col" style="width: 6%;" | Reference

! scope="col" style="width: 11%;" | Creator

Housing Data Set

|Median home values of Boston with associated home and neighborhood attributes.

|None.

|506

|Text

|Regression

|1993

|Belsley, David A., Edwin Kuh, and Roy E. Welsch. Regression diagnostics: Identifying influential data and sources of collinearity. Vol. 571. John Wiley & Sons, 2005.

|D. Harrison et al.

The Getty Vocabularies

|structured terminology for art and other material culture, archival materials, visual surrogates, and bibliographic materials.

|None.

|large

|Text

|Classification

|2015

|{{cite journal | last1 = Ruotsalo | first1 = Tuukka | last2 = Aroyo | first2 = Lora | last3 = Schreiber | first3 = Guus | year = 2009 | title = Knowledge-based linguistic annotation of digital cultural heritage collections | url = http://dare.ubvu.vu.nl/bitstream/handle/1871/24407/243319.pdf?sequence=3 | journal = IEEE Intelligent Systems | volume = 24 | issue = 2 | pages = 64–75 | doi = 10.1109/MIS.2009.32 | hdl = 1871.1/9f6091aa-9596-46a9-9251-f11edeeb28b7 | s2cid = 6667472 | access-date = 6 December 2018 | archive-date = 16 August 2017 | archive-url = https://web.archive.org/web/20170816023938/http://dare.ubvu.vu.nl/bitstream/handle/1871/24407/243319.pdf?sequence=3 | url-status = dead }}

|Getty Center

Yahoo! Front Page Today Module User Click Log

|User click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page.

|Conjoint analysis with a bilinear model.

|45,811,883 user visits

|Text

|Regression, clustering

|2009

|{{cite book |doi=10.1145/1935826.1935878 |chapter=Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms |title=Proceedings of the fourth ACM international conference on Web search and data mining |date=2011 |last1=Li |first1=Lihong |last2=Chu |first2=Wei |last3=Langford |first3=John |last4=Wang |first4=Xuanhui |pages=297–306 |arxiv=1003.5956 |isbn=978-1-4503-0493-1 }}{{cite book |doi=10.1109/DeSE.2010.40 |chapter=A Proactive Personalized Mobile News Recommendation System |title=2010 Developments in E-systems Engineering |date=2010 |last1=Yeung |first1=Kam Fung |last2=Yang |first2=Yanyan |pages=207–212 |isbn=978-1-4244-8044-9 }}

|Chu et al.

British Oceanographic Data Centre

|Biological, chemical, physical and geophysical data for oceans. 22K variables tracked.

|Various.

|22K variables, many instances

|Text

|Regression, clustering

|2015

|{{cite journal | last1 = Gass | first1 = Susan E. | last2 = Roberts | first2 = J. Murray | year = 2006 | title = The occurrence of the cold-water coral Lophelia pertusa (Scleractinia) on oil and gas platforms in the North Sea: colony growth, recruitment and environmental controls on distribution | journal = Marine Pollution Bulletin | volume = 52 | issue = 5| pages = 549–559 | doi=10.1016/j.marpolbul.2005.10.002| pmid = 16300800 | bibcode = 2006MarPB..52..549G }}

|British Oceanographic Data Centre

Congressional Voting Records Dataset

|Voting data for all USA representatives on 16 issues.

|Beyond the raw voting data, various other features are provided.

|435

|Text

|Classification

|1987

|{{cite journal |last1=Gionis |first1=Aristides |last2=Mannila |first2=Heikki |last3=Tsaparas |first3=Panayiotis |title=Clustering aggregation |journal=ACM Transactions on Knowledge Discovery from Data |date=March 2007 |volume=1 |issue=1 |pages=4 |doi=10.1145/1217299.1217303 }}

|J. Schlimmer

Entree Chicago Recommendation Dataset

|Record of user interactions with Entree Chicago recommendation system.

|Details of each user's usage of the app are recorded in detail.

|50,672

|Text

|Regression, recommendation

|2000

|Obradovic, Zoran, and Slobodan Vucetic.Challenges in Scientific Data Mining: Heterogeneous, Biased, and Large Samples. Technical Report, Center for Information Science and Technology Temple University, 2004.

|R. Burke

Insurance Company Benchmark (COIL 2000)

|Information on customers of an insurance company.

|Many features of each customer and the services they use.

|9,000

|Text

|Regression, classification

|2000

|{{cite journal | last1 = Van Der Putten | first1 = Peter | last2 = van Someren | first2 = Maarten | year = 2000 | title = CoIL challenge 2000: The insurance company case | journal = Published by Sentient Machine Research, Amsterdam. Also a Leiden Institute of Advanced Computer Science Technical Report | volume = 9 | pages = 1–43 }}{{cite journal | last1 = Mao | first1 = K. Z. | year = 2002 | title = RBF neural network center selection based on Fisher ratio class separability measure | journal = IEEE Transactions on Neural Networks| volume = 13 | issue = 5| pages = 1211–1217 | doi=10.1109/tnn.2002.1031953| pmid = 18244518 }}

|P. van der Putten

Nursery Dataset

|Data from applicants to nursery schools.

|Data about applicant's family and various other factors included.

|12,960

|Text

|Classification

|1997

|{{cite journal | last1 = Olave | first1 = Manuel | last2 = Rajkovic | first2 = Vladislav | last3 = Bohanec | first3 = Marko | year = 1989 | title = An application for admission in public school systems | url = http://kt.ijs.si/MarkoBohanec/pub/Nursery89.pdf | journal = Expert Systems in Public Administration | volume = 1 | pages = 145–160 }}{{cite arXiv | eprint=1212.2472 | last1=Lizotte | first1=Daniel J. | last2=Madani | first2=Omid | last3=Greiner | first3=Russell | title=Budgeted Learning of Naive-Bayes Classifiers | year=2012 | class=cs.LG }}

|V. Rajkovic et al.

University Dataset

|Data describing attributed of a large number of universities.

|None.

|285

|Text

|Clustering, classification

|1988

|{{cite report |last1=Lebowitz |first1=Michael |title=Concept Learning in a Rich Input Domain: Generalization-Based Memory |date=1984 |doi=10.7916/D8KP8990 |doi-access=free }}

|S. Sounders et al.

Blood Transfusion Service Center Dataset

|Data from blood transfusion service center. Gives data on donors return rate, frequency, etc.

|None.

|748

|Text

|Classification

|2008

|{{cite journal | last1 = Yeh | first1 = I-Cheng | last2 = Yang | first2 = King-Jang | last3 = Ting | first3 = Tao-Ming | year = 2009 | title = Knowledge discovery on RFM model using Bernoulli sequence | journal = Expert Systems with Applications | volume = 36 | issue = 3| pages = 5866–5871 | doi=10.1016/j.eswa.2008.07.018}}{{cite journal | last1 = Lee | first1 = Wen-Chen | last2 = Cheng | first2 = Bor-Wen | year = 2011 | title = An intelligent system for improving performance of blood donation | url = http://www.airitilibrary.com/Publication/alDetailedMesh?docid=10220690-201104-201105050019-201105050019-173-185| journal = Journal of Quality Vol | volume = 18 | issue = 2| page = 173 }}

|I. Yeh

Record Linkage Comparison Patterns Dataset

|Large dataset of records. Task is to link relevant records together.

|Blocking procedure applied to select only certain record pairs.

|5,749,132

|Text

|Classification

|2011

|Schmidtmann, Irene, et al. "[http://www.krebsregister-nrw.de/fileadmin/user_upload/dokumente/Evaluation/EKR_NRW_Evaluation_Abschlussbericht_2009-06-11.pdf Evaluation des Krebsregisters NRW Schwerpunkt Record Linkage] {{Webarchive|url=https://web.archive.org/web/20181206102339/http://www.krebsregister-nrw.de/fileadmin/user_upload/dokumente/Evaluation/EKR_NRW_Evaluation_Abschlussbericht_2009-06-11.pdf |date=6 December 2018 }}." Abschlußbericht vom 11 (2009).{{cite journal | last1 = Sariyar | first1 = Murat | last2 = Borg | first2 = Andreas | last3 = Pommerening | first3 = Klaus | year = 2011 | title = Controlling false match rates in record linkage using extreme value theory | journal = Journal of Biomedical Informatics | volume = 44 | issue = 4| pages = 648–654 | doi=10.1016/j.jbi.2011.02.008| pmid = 21352952 }}

|University of Mainz

Nomao Dataset

|Nomao collects data about places from many different sources. Task is to detect items that describe the same place.

|Duplicates labeled.

|34,465

|Text

|Classification

|2012

|{{cite book |last1=Candillier |first1=Laurent |last2=Lemaire |first2=Vincent |chapter=Active learning in the real-world design and analysis of the Nomao challenge |title=The 2013 International Joint Conference on Neural Networks (IJCNN) |date=August 2013 |volume=8 |pages=1–8 |doi=10.1109/IJCNN.2013.6706908 |isbn=978-1-4673-6129-3 }}{{cite thesis |last1=Garrido Marquez |first1=Ivan |title=A domain adaptation method for text classification based on self-adjusted training approach |date=2013 |url=https://inaoe.repositorioinstitucional.mx/jspui/handle/1009/230 }}{{page needed|date=September 2024}}

|Nomao Labs

Movie Dataset

|Data for 10,000 movies.

|Several features for each movie are given.

|10,000

|Text

|Clustering, classification

|1999

|Nagesh, Harsha S., Sanjay Goil, and Alok N. Choudhary. "Adaptive Grids for Clustering Massive Data Sets." SDM. 2001.

|G. Wiederhold

Open University Learning Analytics Dataset

|Information about students and their interactions with a virtual learning environment.

|None.

|~ 30,000

|Text

|Classification, clustering, regression

|2015

|Kuzilek, Jakub, et al. "[http://oro.open.ac.uk/42529/1/__userdata_documents4_ctb44_Desktop_analysing-at-risk-students-at-open-university.pdf OU Analyse: analysing at-risk students at The Open University]." Learning Analytics Review (2015): 1–16.Siemens, George, et al. [https://www.solaresearch.org/core/open-learning-analytics-an-integrated-modularized-platform/ Open Learning Analytics: an integrated & modularized platform]. Diss. Open University Press, 2011.

|J. Kuzilek et al.

Mobile phone records

|Telecommunications activity and interactions

|Aggregation per geographical grid cells and every 15 minutes.

|large

|Text

|Classification, Clustering, Regression

|2015

|{{cite journal |last1=Barlacchi |first1=Gianni |last2=De Nadai |first2=Marco |last3=Larcher |first3=Roberto |last4=Casella |first4=Antonio |last5=Chitic |first5=Cristiana |last6=Torrisi |first6=Giovanni |last7=Antonelli |first7=Fabrizio |last8=Vespignani |first8=Alessandro |last9=Pentland |first9=Alex |last10=Lepri |first10=Bruno |title=A multi-source dataset of urban life in the city of Milan and the Province of Trentino |journal=Scientific Data |date=27 October 2015 |volume=2 |issue=1 |doi=10.1038/sdata.2015.55 |pmid=26528394 |pmc=4622222 |bibcode=2015NatSD...250055B }}

|G. Barlacchi et al.

Curated repositories of datasets

As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research.

OpenML:{{cite journal | vauthors = Vanschoren J, van Rijn JN, Bischl B, Torgo L | year = 2013 | title = OpenML: networked science in machine learning | journal = SIGKDD Explorations | volume = 15 | issue = 2 | pages = 49–60 | doi = 10.1145/2641190.2641198 | arxiv = 1407.7722 | s2cid = 4977460 }} Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms.
PMLB:{{cite journal | vauthors = Olson RS, La Cava W, Orzechowski P, Urbanowicz RJ, Moore JH | year = 2017 | title = PMLB: a large benchmark suite for machine learning evaluation and comparison | journal = BioData Mining | volume = 10 | issue = 1 | pages = 36 | doi = 10.1186/s13040-017-0154-4 | pmid = 29238404 | pmc = 5725843 | bibcode = 2017arXiv170300512O | arxiv = 1703.00512 | doi-access = free }} A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API.
Metatext NLP: https://metatext.io/datasets web repository maintained by community, containing nearly 1000 benchmark datasets, and counting. Provides many tasks from classification to QA, and various languages from English, Portuguese to Arabic.
Appen: Off The Shelf and Open Source Datasets hosted and maintained by the company. These biological, image, physical, question answering, signal, sound, text, and video resources number over 250 and can be applied to over 25 different use cases.{{cite web |title=Off The Shelf Datasets |url=https://appen.com/off-the-shelf-datasets/ |website=appen.com |publisher=Appen |access-date=30 December 2020}}{{cite web |title=Open Source Datasets |url=https://appen.com/resources/datasets/ |website=appen.com |publisher=Appen |access-date=30 December 2020}}

List of datasets for machine-learning research#GLUE

List of sorting used for datasets

List of open data portals

List of portals suitable for multiple types of applications

List of portals suitable for a specific subtype of applications

Image data

Text data

= Reviews =

= News articles =

= Messages =

= Twitter and tweets =

= Dialogues =

= Legal =

= Other text =

Sound data

= Speech =

= Music =

= Other sounds =

Signal data

= Electrical =

= Motion-tracking =

= Other signals =

Physical data

= High-energy physics =

= Systems =

= Astronomy =

= Earth science =

= Other physical =

Biological data

= Human =

= Animal =

= Fungi =

= Plant =

= Microbe =

= Drug discovery =

Anomaly data

Question answering data

Dialog or instruction prompted data

Cybersecurity

Climate and sustainability

Code data

Multivariate data

= Financial =

= Weather =

= Census =

= Transit =

= Internet =

= Games =

= Other multivariate =

Curated repositories of datasets

See also

References