List of datasets for machine-learning research#GLUE
{{Short description|none}}
{{Dynamic list|multiple=yes}}
{{Use dmy dates|date=September 2017}}
{{machine learning bar|Related articles}}
These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets.{{cite web|url = https://edge.org/response-detail/26587|title = Datasets Over Algorithms|publisher = Edge.com|access-date = 8 January 2016|last = Wissner-Gross|first = A.}} High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.{{cite journal |last1=Weiss |first1=G. M. |last2=Provost |first2=F. |title=Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction |journal=Journal of Artificial Intelligence Research |date=October 2003 |volume=19 |pages=315–354 |doi=10.1613/jair.1199 }}{{cite book |last1=Abney |first1=Steven |title=Semisupervised Learning for Computational Linguistics |date=2007 |publisher=CRC Press |isbn=978-1-4200-1080-0 }}{{page needed|date=September 2024}}{{cite book |doi=10.1007/978-3-642-23808-6_39 |chapter=Active Learning with Evolving Streaming Data |title=Machine Learning and Knowledge Discovery in Databases |series=Lecture Notes in Computer Science |date=2011 |last1=Žliobaitė |first1=Indrė |last2=Bifet |first2=Albert |last3=Pfahringer |first3=Bernhard |last4=Holmes |first4=Geoff |volume=6913 |pages=597–612 |isbn=978-3-642-23807-9 }}
Many organizations, including governments, publish and share their datasets. The datasets are classified, based on the licenses, as Open data and Non-Open data.
The datasets from various governmental-bodies are presented in List of open government data sites. The datasets are ported on open data portals. They are made available for searching, depositing and accessing through interfaces like Open API. The datasets are made available as various sorted types and subtypes.
List of sorting used for datasets
class="wikitable"
!Type !Subtypes |
Specific category
|Finance, Economics, Commerce, Societal, Health, Academy, Sports, Food, Agriculture, Travel, Geospatial, Political, Consumer, Transport, Logistics, Environmental, Real-Estate, Legal, Entertainment, Energy, Hospitality |
Scope
|Supranational Union, National, Subnational, Municipality, Urban, Rural |
Language |
Type |
Usage |
File-Formats |
Licenses
|Creative-Commons, GPL, Other Non-Open data licenses |
Last-Updated
|Last-Hour, Last-Day, Last-Week, Last-Month, Last-Year |
File-Size
|Minimum, Maximum, Range |
[https://docs.openml.org/#dataset-status Status]
|Verified, In-Preparation, Deactivated(or Deprecated) |
Number of records
|100s, 1000s, 10000s, 100000s, Millions |
Number of variables
|Less than 10, 10s, 100s, 1000s, 10000s |
Services
|Individual, Aggregation |
The data portal is classified based on its type of license. The open source license based data portals are known as open data portals which are used by many government organizations and academic institutions.
List of open data portals
{{see also|Open data portal}}
class="wikitable"
!Portal-name !License !List of installations of the portal !Typical usages |
Comprehensive Knowledge Archive Network (CKAN)
|AGPL |https://ckan.github.io/ckan-instances/ https://github.com/sebneu/ckan_instances/blob/master/instances.csv |Data repository for government or non-profit organisations, Data Management Solution for Research Institutes |
[https://getdkan.org/ DKAN]
|GPL |https://getdkan.org/community |Data repository for government or non-profit organisations, Data Management Solution for Research Institutes |
Dataverse
|https://dataverse.org/installations https://dataverse.org/metrics |Data Management Solution for Research Institutes |
DSpace
|BSD |https://registry.lyrasis.org/ |Data Management Solution for Research Institutes |
[https://www.openml.org/ OpenML]
|BSD |https://www.openml.org/search?type=data&sort=runs&status=active |Data Management Solution to share datasets, algorithms, and experiments results through APIs. |
List of portals suitable for multiple types of applications
{{see also|machine learning}}
The data portal sometimes lists a wide variety of subtypes of datasets pertaining to many machine learning applications.
class="wikitable"
|https://academictorrents.com |
Amazon Datasets
|https://registry.opendata.aws/ |
Awesome Public Datasets Collection
|https://github.com/awesomedata/awesome-public-datasets |
data.world
|https://data.world/datasets/machine-learning |
Datahub – Core Datasets
|https://datahub.io/docs/core-data |
DataONE
|https://www.dataone.org/ |
DataPortals
|https://dataportals.org/ |
Datasetlist.com
|https://www.datasetlist.com |
Global Open Data Index – Open Knowledge Foundation
|https://okfn.org/ {{Webarchive|url=https://web.archive.org/web/20200525213547/https://index.okfn.org/ |date=25 May 2020 }} |
Google Dataset Search
|https://datasetsearch.research.google.com/ |
Hugging Face
|https://huggingface.co/docs/datasets/ |
IBM's Data Asset Exchange
|https://developer.ibm.com/exchanges/data/ |
Jupyter – Tutorial Data
|https://jupyter-tutorial.readthedocs.io/en/latest/data-processing/opendata.html |
Kaggle
|https://www.kaggle.com/datasets |
Machine learning datasets
|https://macgence.com/data-sets-and-cataloges/ |
Major Smart Cities with Open Data
|https://rlist.io/l/major-smart-cities-with-open-data-portals |
Microsoft Datasets
|https://msropendata.com/datasets |
Open Data Inception
|https://opendatainception.io/ |
Opendatasoft
|https://data.opendatasoft.com/explore/dataset/open-data-sources%40public/table/?sort=code_en |
OpenDOAR
|https://v2.sherpa.ac.uk/opendoar/ |
OpenML
|https://www.openml.org/search?type=data |
Papers with Code
|https://paperswithcode.com/datasets |
Penn Machine Learning Benchmarks
|https://github.com/EpistasisLab/pmlb/tree/master/datasets |
Public APIs
|https://github.com/public-apis/public-apis |
Registry of Open Access Repositories
|http://roar.eprints.org/ |
REgistry of REsearch Data REpositories
|https://www.re3data.org/ |
UCI Machine Learning Repository
|http://mlr.cs.umass.edu/ml/ {{Webarchive|url=https://web.archive.org/web/20200626215834/http://mlr.cs.umass.edu/ml/ |date=26 June 2020 }} |
Speech Dataset
|https://www.shaip.com/offerings/speech-data-catalog/ |
Visual Data Discovery
|https://visualdata.io/discovery |
List of portals suitable for a specific subtype of applications
{{see also|Machine learning}}
The data portals which are suitable for a specific subtype of machine learning application are listed in the subsequent sections.
Image data
{{Main|List of datasets in computer vision and image processing}}
Text data
These datasets consist primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.
= Reviews =
= News articles =
class="wikitable sortable" style="width: 100%"
! scope="col" style="width: 15%;" | Dataset Name ! scope="col" style="width: 18%;" | Brief description ! scope="col" style="width: 18%;" | Preprocessing ! scope="col" style="width: 6%;" | Instances ! scope="col" style="width: 7%;" | Format ! scope="col" style="width: 7%;" | Default Task ! scope="col" style="width: 6%;" | Created (updated) ! scope="col" style="width: 6%;" | Reference ! scope="col" style="width: 11%;" | Creator |
NYSK Dataset
|English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn. |Filtered and presented in XML format. |10,421 |XML, text |Sentiment analysis, topic extraction |2013 |Dermouche, M. et al. |
The Reuters Corpus Volume 1
|Large corpus of Reuters news stories in English. |Fine-grain categorization and topic codes. |810,000 |Text |Classification, clustering, summarization |2002 |
The Reuters Corpus Volume 2
|Large corpus of Reuters news stories in multiple languages. |Fine-grain categorization and topic codes. |487,000 |Text |Classification, clustering, summarization |2005 |
Thomson Reuters Text Research Collection
|Large corpus of news stories. |Details not described. |1,800,370 |Text |Classification, clustering, summarization |2009 |T. Rose et al. |
Saudi Newspapers Corpus
|31,030 Arabic newspaper articles. |Metadata extracted. |31,030 |JSON |Summarization, clustering |2015 |M. Alhagri |
RE3D (Relationship and Entity Extraction Evaluation Dataset)
|Entity and Relation marked data from various news and government sources. Sponsored by Dstl |Filtered, categorisation using Baleen types |not known |JSON |Classification, Entity and Relation recognition |2017 |{{Cite web | url=https://github.com/dstl/re3d | title=Relationship and Entity Extraction Evaluation Dataset: Dstl/re3d| website=GitHub| date=2018-12-17}} |Dstl |
Examiner Spam Clickbait Catalogue
|Clickbait, spam, crowd-sourced headlines from 2010 to 2015 |Publish date and headlines |3,089,781 |CSV |Clustering, Events, Sentiment |2016 |R. Kulkarni |
ABC Australia News Corpus
|Entire news corpus of ABC Australia from 2003 to 2019 |Publish date and headlines |1,186,018 |CSV |Clustering, Events, Sentiment |2020 |{{Cite web | url=https://www.kaggle.com/therohk/million-headlines | title=A Million News Headlines}} |R. Kulkarni |
Worldwide News – Aggregate of 20K Feeds
|One week snapshot of all online headlines in 20+ languages |Publish time, URL and headlines |1,398,431 |CSV |Clustering, Events, Language Detection |2018 |R. Kulkarni |
Reuters News Wire Headline
|11 Years of timestamped events published on the news-wire |Publish time, Headline Text |16,121,310 |CSV |NLP, Computational Linguistics, Events |2018 |R. Kulkarni |
The Irish Times Ireland News Corpus
|24 Years of Ireland News from 1996 to 2019 |Publish time, Headline Category and Text |1,484,340 |CSV |NLP, Computational Linguistics, Events |2020 |R. Kulkarni |
News Headlines Dataset for Sarcasm Detection
|High quality dataset with Sarcastic and Non-sarcastic news headlines. |Clean, normalized text |26,709 |JSON |NLP, Classification, Linguistics |2018 |Rishabh Misra |
= Messages =
= Twitter and tweets =
= Dialogues =
style="width: 100%" class="wikitable sortable"
! scope="col" style="width: 15%;" | Dataset Name ! scope="col" style="width: 18%;" | Brief description ! scope="col" style="width: 18%;" | Preprocessing ! scope="col" style="width: 6%;" | Instances ! scope="col" style="width: 7%;" | Format ! scope="col" style="width: 7%;" | Default Task ! scope="col" style="width: 6%;" | Created (updated) ! scope="col" style="width: 6%;" | Reference ! scope="col" style="width: 11%;" | Creator |
NPS Chat Corpus
|Posts from age-specific online chat rooms. |Hand privacy masked, tagged for part of speech and dialogue-act. |~ 500,000 |XML |NLP, programming, linguistics |2007 |Forsyth, E., Lin, J., & Martell, C. |
Twitter Triple Corpus
|A-B-A triples extracted from Twitter. | |4,232 |Text |NLP |2016 |Sordini, A. et al. |
UseNet Corpus
|UseNet forum postings. |Anonymized e-mails and URLs. Omitted documents with lengths <500 words or >500,000 words, or that were <90% English. |7 billion |Text | |2011 |Shaoul, C., & Westbury C. |
NUS SMS Corpus
|SMS messages collected between two users, with timing analysis. | |~ 10,000 |XML |NLP |2011 |KAN, M |
Reddit All Comments Corpus
|All Reddit comments (as of 2015). | |~ 1.7 billion |JSON |NLP, research |2015 |Stuck_In_the_Matrix |
Ubuntu Dialogue Corpus
|Dialogues extracted from Ubuntu chat stream on IRC. | |930 thousand dialogues, 7.1 million utterances |CSV |Dialogue Systems Research |2015 |Lowe, R. et al. |
Dialog State Tracking Challenge
|The Dialog State Tracking Challenges 2 & 3 (DSTC2&3) were research challenge focused on improving the state of the art in tracking the state of spoken dialog systems. |Transcription of spoken dialogs with labelling |DSTC2 contains ~3.2k calls – DSTC3 contains ~2.3k calls |Json |Dialogue state tracking |2014 |Henderson, Matthew and Thomson, Blaise and Williams, Jason D |
= Legal =
class="wikitable sortable" style="width: 100%"
! scope="col" style="width: 15%;" | Dataset Name ! scope="col" style="width: 18%;" | Brief description ! scope="col" style="width: 18%;" | Preprocessing ! scope="col" style="width: 6%;" | Instances ! scope="col" style="width: 7%;" | Format ! scope="col" style="width: 7%;" | Default Task ! scope="col" style="width: 6%;" | Created (updated) ! scope="col" style="width: 6%;" | Reference ! scope="col" style="width: 11%;" | Creator |
FreeLaw
|Filtered data from Court Listener, part of the FreeLaw project. |Cleaned and normalized text |4,940,710 |Json |NLP, linguistics |2020 |T. Hoppe |
Pile of Law
|Corpus of legal and administrative data |Cleaned, normalized, and privatized |~50,000,000 |Json |NLP, linguistics, sentiment |2022 |{{Cite book |last1=Zheng |first1=Lucia |last2=Guha |first2=Neel |last3=Anderson |first3=Brandon R. |last4=Henderson |first4=Peter |last5=Ho |first5=Daniel E. |title=Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law |chapter=When does pretraining help? |date=2021-06-21 |chapter-url=http://dx.doi.org/10.1145/3462757.3466088 |pages=159–168 |location=New York, NY, USA |publisher=ACM |doi=10.1145/3462757.3466088|isbn=9781450385268 |s2cid=233296302 }}{{Cite web |title=pile-of-law/pile-of-law · Datasets at Hugging Face |url=https://huggingface.co/datasets/pile-of-law/pile-of-law |access-date=2023-01-11 |website=huggingface.co|date=4 July 2022 }} |L. Zheng; N. Guha; B. Anderson; P. Henderson; D. Ho |
Caselaw Access Project
|All official, book-published state and federal United States case law — every volume or case designated as an official report of decisions by a court within the United States. |Cleaned and normalized text |~10,000 |Json |NLP, linguistics |2022 |A. Aizman; S. Chapman; J. Cushman; K. Dulin; H. Eidolon; et al. |
= Other text =
Sound data
These datasets consist of sounds and sound features used for tasks such as speech recognition and speech synthesis.
= Speech =
= Music =
= Other sounds =
Signal data
Datasets containing electric signal information requiring some sort of signal processing for further analysis.
= Electrical =
= Motion-tracking =
= Other signals =
Physical data
Datasets from physical systems.
= High-energy physics =
= Systems =
= Astronomy =
class="wikitable sortable" style="width: 100%"
! scope="col" style="width: 15%;" | Dataset Name ! scope="col" style="width: 18%;" | Brief description ! scope="col" style="width: 18%;" | Preprocessing ! scope="col" style="width: 6%;" | Instances ! scope="col" style="width: 7%;" | Format ! scope="col" style="width: 7%;" | Default Task ! scope="col" style="width: 6%;" | Created (updated) ! scope="col" style="width: 6%;" | Reference ! scope="col" style="width: 11%;" | Creator |
Volcanoes on Venus – JARtool experiment Dataset
|Venus images returned by the Magellan spacecraft. |Images are labeled by humans. |not given |Images |Classification |1991 |{{cite journal |last1=Pettengill |first1=Gordon H. |last2=Ford |first2=Peter G. |last3=Johnson |first3=William T. K. |last4=Raney |first4=R. Keith |last5=Soderblom |first5=Laurence A. |title=Magellan: Radar Performance and Data Products |journal=Science |date=12 April 1991 |volume=252 |issue=5003 |pages=260–265 |doi=10.1126/science.252.5003.260 |pmid=17769272 |bibcode=1991Sci...252..260P }}{{cite journal | last1 = Aharonian | first1 = F. | display-authors = et al | year = 2008 | title = Energy spectrum of cosmic-ray electrons at TeV energies | journal = Physical Review Letters | volume = 101 | issue = 26| page = 261104 | bibcode = 2008PhRvL.101z1104A | doi = 10.1103/PhysRevLett.101.261104 | pmid = 19437632 | arxiv = 0811.3894 | hdl = 2440/51450 | s2cid = 41850528 }} |M. Burl |
MAGIC Gamma Telescope Dataset
|Monte Carlo generated high-energy gamma particle events. |Numerous features extracted from the simulations. |19,020 |Text |Classification |2007 |{{cite journal | last1 = Bock | first1 = R. K. | display-authors = et al | year = 2004 | title = Methods for multidimensional event classification: a case study using images from a Cherenkov gamma-ray telescope | journal = Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment | volume = 516 | issue = 2| pages = 511–528 | doi=10.1016/j.nima.2003.08.157| bibcode = 2004NIMPA.516..511B }} |R. Bock |
Solar Flare Dataset
|Measurements of the number of certain types of solar flare events occurring in a 24-hour period. |Many solar flare-specific features are given. |1389 |Text |Regression, classification |1989 |G. Bradshaw |
CAMELS Multifield Dataset
|2D maps and 3D grids from thousands of N-body and state-of-the-art hydrodynamic simulations spanning a broad range in the value of the cosmological and astrophysical parameters |Each map and grid has 6 cosmological and astrophysical parameters associated to it |405,000 2D maps and 405,000 3D grids |2D maps and 3D grids |Regression |2021 |Francisco Villaescusa-Navarro et al. |
= Earth science =
class="wikitable sortable" style="width: 100%"
! scope="col" style="width: 15%;" | Dataset Name ! scope="col" style="width: 18%;" | Brief description ! scope="col" style="width: 18%;" | Preprocessing ! scope="col" style="width: 6%;" | Instances ! scope="col" style="width: 7%;" | Format ! scope="col" style="width: 7%;" | Default Task ! scope="col" style="width: 6%;" | Created (updated) ! scope="col" style="width: 6%;" | Reference ! scope="col" style="width: 11%;" | Creator |
Volcanoes of the World
|Volcanic eruption data for all known volcanic events on earth. |Details such as region, subregion, tectonic setting, dominant rock type are given. |1535 |Text |Regression, classification |2013 |E. Venzke et al. |
Seismic-bumps Dataset
|Seismic activities from a coal mine. |Seismic activity was classified as hazardous or not. |2584 |Text |Classification |2013 |{{cite journal | last1 = Sikora | first1 = Marek | last2 = Wróbel | first2 = Łukasz | year = 2010 | title = Application of rule induction algorithms for analysis of data collected by seismic hazard monitoring systems in coal mines | url = https://www.infona.pl/resource/bwmeta1.element.baztech-article-BPZ5-0008-0008| journal = Archives of Mining Sciences | volume = 55 | issue = 1| pages = 91–114 }}{{cite book |doi=10.1007/978-1-4471-2760-4_10 |chapter=Rough Natural Hazards Monitoring |title=Rough Sets: Selected Methods and Applications in Management and Engineering |series=Advanced Information and Knowledge Processing |date=2012 |last1=Sikora |first1=Marek |last2=Sikora |first2=Beata |pages=163–179 |isbn=978-1-4471-2759-8 }} |M. Sikora et al. |
CAMELS-US
|Catchment hydrology dataset with hydrometeorological timeseries and various attributes |see Reference |671 |CSV, Text, Shapefile |Regression |2017 |{{cite journal |last1=Addor |first1=Nans |last2=Newman |first2=Andrew J. |last3=Mizukami |first3=Naoki |last4=Clark |first4=Martyn P. |title=The CAMELS data set: catchment attributes and meteorology for large-sample studies |journal=Hydrology and Earth System Sciences |date=20 October 2017 |volume=21 |issue=10 |pages=5293–5313 |doi=10.5194/hess-21-5293-2017 |doi-access=free |bibcode=2017HESS...21.5293A }}{{cite journal |last1=Newman |first1=A. J. |last2=Clark |first2=M. P. |last3=Sampson |first3=K. |last4=Wood |first4=A. |last5=Hay |first5=L. E. |last6=Bock |first6=A. |last7=Viger |first7=R. J. |last8=Blodgett |first8=D. |last9=Brekke |first9=L. |last10=Arnold |first10=J. R. |last11=Hopson |first11=T. |last12=Duan |first12=Q. |title=Development of a large-sample watershed-scale hydrometeorological data set for the contiguous USA: data set characteristics and assessment of regional variability in hydrologic model performance |journal=Hydrology and Earth System Sciences |date=14 January 2015 |volume=19 |issue=1 |pages=209–223 |doi=10.5194/hess-19-209-2015 |doi-access=free |bibcode=2015HESS...19..209N }} |N. Addor et al. / A. Newman et al. |
CAMELS-Chile
|Catchment hydrology dataset with hydrometeorological timeseries and various attributes |see Reference |516 |CSV, Text, Shapefile |Regression |2018 |C. Alvarez-Garreton et al. |
CAMELS-Brazil
|Catchment hydrology dataset with hydrometeorological timeseries and various attributes |see Reference |897 |CSV, Text, Shapefile |Regression |2020 |V. Chagas et al. |
CAMELS-GB
|Catchment hydrology dataset with hydrometeorological timeseries and various attributes |see Reference |671 |CSV, Text, Shapefile |Regression |2020 |G. Coxon et al. |
CAMELS-Australia
|Catchment hydrology dataset with hydrometeorological timeseries and various attributes |see Reference |222 |CSV, Text, Shapefile |Regression |2021 |K. Fowler et al. |
LamaH-CE
|Catchment hydrology dataset with hydrometeorological timeseries and various attributes |see Reference |859 |CSV, Text, Shapefile |Regression |2021 |C. Klingler et al. |
= Other physical =
{{sort-under}}
Biological data
Datasets from biological systems.
= Human =
= Animal =
class="wikitable sortable" style="width: 100%"
! scope="col" style="width: 15%;" | Dataset Name ! scope="col" style="width: 18%;" | Brief description ! scope="col" style="width: 18%;" | Preprocessing ! scope="col" style="width: 6%;" | Instances ! scope="col" style="width: 7%;" | Format ! scope="col" style="width: 7%;" | Default Task ! scope="col" style="width: 6%;" | Created (updated) ! scope="col" style="width: 6%;" | Reference ! scope="col" style="width: 11%;" | Creator |
Abalone Dataset
|Physical measurements of Abalone. Weather patterns and location are also given. |None. |4177 |Text |Regression |1995 |Marine Research Laboratories – Taroona |
Zoo Dataset
|Artificial dataset covering 7 classes of animals. |Animals are classed into 7 categories and features are given for each. |101 |Text |Classification |1990 |R. Forsyth |
Demospongiae Dataset
|Data about marine sponges. |503 sponges in the Demosponge class are described by various features. |503 |Text |Classification |2010 |E. Armengol et al. |
Farm animals data
|PLF data inventory (cows, pigs; location, acceleration, etc.). |Labeled datasets. |List is constantly updated |Text |Classification |2020 |{{Cite web|url=https://github.com/Animal-Data-Inventory/PLFDataInventory|title=PLF data inventory|website=GitHub|date=5 November 2021}} |V. Bloch |
Splice-junction Gene Sequences Dataset
|Primate splice-junction gene sequences (DNA) with associated imperfect domain theory. |None. |3190 |Text |Classification |1992 |G. Towell et al. |
Mice Protein Expression Dataset
|Expression levels of 77 proteins measured in the cerebral cortex of mice. |None. |1080 |Text |Classification, Clustering |2015 |{{cite journal | last1 = Higuera | first1 = Clara | last2 = Gardiner | first2 = Katheleen J. | last3 = Cios | first3 = Krzysztof J. | year = 2015 | title = Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome | journal = PLOS ONE | volume = 10 | issue = 6| page = e0129126 | doi=10.1371/journal.pone.0129126| pmid = 26111164 | pmc = 4482027 | bibcode = 2015PLoSO..1029126H | doi-access = free }}{{cite journal | last1 = Ahmed | first1 = Md Mahiuddin | display-authors = et al | year = 2015 | title = Protein dynamics associated with failed and rescued learning in the Ts65Dn mouse model of Down syndrome | journal = PLOS ONE | volume = 10 | issue = 3| page = e0119491 | doi=10.1371/journal.pone.0119491| pmid = 25793384 | pmc = 4368539 | bibcode = 2015PLoSO..1019491A | doi-access = free }} |C. Higuera et al. |
= Fungi =
class="wikitable sortable" style="width: 100%"
! scope="col" style="width: 15%;" | Dataset Name ! scope="col" style="width: 18%;" | Brief description ! scope="col" style="width: 18%;" | Preprocessing ! scope="col" style="width: 6%;" | Instances ! scope="col" style="width: 7%;" | Format ! scope="col" style="width: 7%;" | Default Task ! scope="col" style="width: 6%;" | Created (updated) ! scope="col" style="width: 6%;" | Reference ! scope="col" style="width: 11%;" | Creator |
UCI Mushroom Dataset
|Mushroom attributes and classification. |Many properties of each mushroom are given. |8124 |Text |Classification |1987 |J. Schlimmer |
Secondary Mushroom Dataset
|Mushroom attributes and classification |Simulated data from larger and more realistic primary mushroom entries. Fully reproducible. |61069 |Text |Classification |2020 |{{Cite web|title=Mushroom Data Set 2020|url=https://mushroom.mathematik.uni-marburg.de/|access-date=2021-04-06|website=mushroom.mathematik.uni-marburg.de}}{{cite journal |last1=Wagner |first1=Dennis |last2=Heider |first2=Dominik |last3=Hattab |first3=Georges |title=Mushroom data creation, curation, and simulation to support classification tasks |journal=Scientific Reports |date=14 April 2021 |volume=11 |issue=1 |page=8134 |doi=10.1038/s41598-021-87602-3 |pmid=33854157 |pmc=8046754 |bibcode=2021NatSR..11.8134W }} |D. Wagner et al. |
= Plant =
= Microbe =
= Drug discovery =
class="wikitable sortable" style="width: 100%"
! scope="col" style="width: 15%;" | Dataset Name ! scope="col" style="width: 18%;" | Brief description ! scope="col" style="width: 18%;" | Preprocessing ! scope="col" style="width: 6%;" | Instances ! scope="col" style="width: 7%;" | Format ! scope="col" style="width: 7%;" | Default Task ! scope="col" style="width: 6%;" | Created (updated) ! scope="col" style="width: 6%;" | Reference ! scope="col" style="width: 11%;" | Creator |
Tox21 Dataset
|Prediction of outcome of biological assays. |Chemical descriptors of molecules are given. |12707 |Text |Classification |2016 |A. Mayr et al. |
Anomaly data
style="width: 100%" class="wikitable sortable"
! scope="col" style="width: 15%;" | Dataset Name ! scope="col" style="width: 18%;" | Brief description ! scope="col" style="width: 18%;" | Preprocessing ! scope="col" style="width: 6%;" | Instances ! scope="col" style="width: 7%;" | Format ! scope="col" style="width: 7%;" | Default Task ! scope="col" style="width: 6%;" | Created (updated) ! scope="col" style="width: 6%;" | Reference ! scope="col" style="width: 11%;" | Creator |
Numenta Anomaly Benchmark (NAB)
|Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted. |None |50+ files |CSV |2016 (continually updated) |Numenta |
Skoltech Anomaly Benchmark (SKAB)
|Each file represents a single experiment and contains a single anomaly. The dataset represents a multivariate time series collected from the sensors installed on the testbed. |There are two markups for Outlier detection (point anomalies) and Changepoint detection (collective anomalies) problems |30+ files (v0.9) |CSV |2020 (continually updated) | {{cite web |author1=Iurii D. Katser |author2=Vyacheslav O. Kozitsin |title=SKAB GitHub repository |website=GitHub |url=https://github.com/waico/skab |access-date=12 January 2021}} |Iurii D. Katser and Vyacheslav O. Kozitsin |
On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study
|Most data files are adapted from UCI Machine Learning Repository data, some are collected from the literature. |treated for missing values, numerical attributes only, different percentages of anomalies, labels |1000+ files |ARFF |2016 (possibly updated with new datasets and/or results) | |Campos et al. |
Question answering data
This section includes datasets that deals with structured data.
class="wikitable sortable" style="width: 100%"
! scope="col" style="width: 15%;" | Dataset Name ! scope="col" style="width: 18%;" | Brief description ! scope="col" style="width: 18%;" | Preprocessing ! scope="col" style="width: 6%;" | Instances ! scope="col" style="width: 7%;" | Format ! scope="col" style="width: 7%;" | Default Task ! scope="col" style="width: 6%;" | Created (updated) ! scope="col" style="width: 6%;" | Reference ! scope="col" style="width: 11%;" | Creator |
DBpedia Neural Question Answering (DBNQA) Dataset
|A large collection of Question to SPARQL specially design for Open Domain Neural Question Answering over DBpedia Knowledgebase. |This dataset contains a large collection of Open Neural SPARQL Templates and instances for training Neural SPARQL Machines; it was pre-processed by semi-automatic annotation tools as well as by three SPARQL experts. |894,499 |Question-query pairs |Question Answering |2018 |Ann-Kathrin Hartmann, Tommaso Soru, Edgard Marx. [https://www.researchgate.net/publication/324482598_Generating_a_Large_Dataset_for_Neural_Question_Answering_over_the_DBpedia_Knowledge_Base Generating a Large Dataset for Neural Question Answering over the DBpedia Knowledge Base]. 2018.{{cite report |type=Preprint |last1=Soru |first1=Tommaso |last2=Marx |first2=Edgard |last3=Moussallem |first3=Diego |last4=Publio |first4=Gustavo |last5=Valdestilhas |first5=André |last6=Esteves |first6=Diego |last7=Neto |first7=Ciro Baron |title=SPARQL as a Foreign Language |date=2017 |arxiv=1708.07624 }} |Hartmann, Soru, and Marx et al. |
Vietnamese Question Answering Dataset (UIT-ViQuAD)
|A large collection of Vietnamese questions for evaluating MRC models. |This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia. |23,074 |Question-answer pairs |Question Answering |2020 |Nguyen et al. |
Vietnamese Multiple-Choice Machine Reading Comprehension Corpus(ViMMRC)
|A collection of Vietnamese multiple-choice questions for evaluating MRC models. |This corpus includes 2,783 Vietnamese multiple-choice questions. |2,783 |Question-answer pairs |Question Answering/Machine Reading Comprehension |2020 |Nguyen et al. |
Open-Domain Question Answering Goes Conversational via Question Rewriting
|An end-to-end open-domain question answering. |This dataset includes 14,000 conversations with 81,000 question-answer pairs. | |Context, Question, Rewrite, Answer, Answer_URL, Conversation_no, Turn_no, Conversation_source Further details are provided in the [https://github.com/apple/ml-qrecc project's GitHub repository] and respective [https://huggingface.co/datasets/svakulenk0/qrecc Hugging Face dataset card]. |Question Answering |2021 |Anantha and Vakulenko et al. |
UnifiedQA
|Question-answer data |Processed dataset | | |Question Answering |2020 |Khashabi et al. |
Dialog or instruction prompted data
This section includes datasets that contains multi-turn text with at least two actors, a "user" and an "agent". The user makes requests for the agent, which performs the request.
class="wikitable sortable" style="width: 100%"
! scope="col" style="width: 15%;" | Dataset Name ! scope="col" style="width: 18%;" | Brief description ! scope="col" style="width: 18%;" | Preprocessing ! scope="col" style="width: 6%;" | Instances ! scope="col" style="width: 7%;" | Format ! scope="col" style="width: 7%;" | Default Task ! scope="col" style="width: 6%;" | Created (updated) ! scope="col" style="width: 6%;" | Reference ! scope="col" style="width: 11%;" | Creator |
Taskmaster
|"The Taskmaster corpus consists of THREE datasets, Taskmaster-1 (TM-1), Taskmaster-2 (TM-2), and Taskmaster-3 (TM-3), comprising over 55,000 spoken and written task-oriented dialogs in over a dozen domains."{{Citation |title=Taskmaster |date=2022-12-17 |url=https://github.com/google-research-datasets/Taskmaster |publisher=Google Research Datasets |access-date=2023-01-07}} |Taskmaster-1: goal-oriented conversational dataset. It includes 13,215 task-based dialogs comprising six domains. Taskmaster-2: 17,289 dialogs in the seven domains (restaurants, food ordering, movies, hotels, flights, music and sports). Taskmaster-3: 23,757 movie ticketing dialogs. | |Taskmaster-1 and Taskmaster-2: conversation id, utterances, Instruction id Taskmaster-3: conversation id, utterances, vertical, scenario, instructions. For further details check the [https://github.com/google-research-datasets/Taskmaster project's GitHub repository] or the Hugging Face dataset cards ([https://huggingface.co/datasets/taskmaster1 taskmaster-1], [https://huggingface.co/datasets/taskmaster2 taskmaster-2], [https://huggingface.co/datasets/taskmaster3 taskmaster-3]). |Dialog/Instruction prompted |2019 |Byrne and Krishnamoorthi et al. |
DrRepair
|A labeled dataset for program repair. |Pre-processed data | |Check format details in the [https://worksheets.codalab.org/worksheets/0x01838644724a433c932bef4cb5c42fbd project's worksheet]. |Dialog/Instruction prompted |2020 |Michihiro et al. |
Natural Instructions v2
|Large dataset that covers a wider range of reasoning abilities | | |Each task consists of input/output, and a task definition. Additionally, each ask contains a task definition. Further information is provided in the [https://github.com/allenai/natural-instructions GitHub repository] of the project and the [https://huggingface.co/datasets/Muennighoff/natural-instructions Hugging Face data card]. |Input/Output and task definition |2022 |Wang et al. |
LAMBADA
|" LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word."{{Citation |last1=Paperno |first1=Denis |title=The LAMBADA dataset |date=2016-08-07 |url=https://zenodo.org/record/2630551 |access-date=2023-01-07 |last2=Kruszewski |first2=Germán |last3=Lazaridou |first3=Angeliki |last4=Pham |first4=Quan Ngoc |last5=Bernardi |first5=Raffaella |last6=Pezzelle |first6=Sandro |last7=Baroni |first7=Marco |last8=Boleda |first8=Gemma |last9=Fernández |first9=Raquel|doi=10.5281/zenodo.2630551 }} | | |Information about this dataset's format is available in the [https://huggingface.co/datasets/lambada HuggingFace dataset card] and the [https://zenodo.org/record/2630551#.Y7uPquzMKNi project's website]. The dataset can be downloaded [https://zenodo.org/record/2630551/files/lambada-dataset.tar.gz here], and the rejected data [https://zenodo.org/record/2630551/files/rejected-data1.tar.gz here]. | |2016 |Paperno et al. |
FLAN
| |A re-preprocessed version of the FLAN dataset with updates since the original FLAN dataset was released is available in [https://huggingface.co/datasets/Muennighoff/flan Hugging Face]:
The scripts to process the data are available in the GitHub repo mentioned on the paper: https://github.com/google-research/FLAN/tree/main/flan. Another [https://github.com/Muennighoff/FLAN FLAN GitHub repo] was created as well. This is the one associated with the dataset card in Hugging Face. | | | |2021 |Wei et al. |
Cybersecurity
{{sort-under}}
class="wikitable sortable sort-under"
!Dataset Name !Brief description !Preprocessing !Instances !Format !Default Task !Created (updated) !Reference !Creator |
MITRE ATTACK
|The ATT&CK is a globally-accessible knowledge base of adversary tactics and techniques. | | |Data can be downloaded from these two GitHub repositories: [https://github.com/mitre-attack/attack-stix-data/archive/refs/heads/master.zip version 2.1] and [https://github.com/mitre/cti/archive/refs/heads/master.zip version 2.0] | | |MITRE ATTACK |
CAPEC
|Common Attack Pattern Enumeration and Classification | | |Data can be downloaded from [https://capec.mitre.org/data/archive/capec_latest.zip CAPEC's website]: [https://capec.mitre.org/data/csv/1000.csv.zip Mechanisms of Attack] [https://capec.mitre.org/data/csv/3000.csv.zip Domains of Attack] | | |CAPEC |
CVE
|CVE is a list of publicly disclosed cybersecurity vulnerabilities that is free to search, use, and incorporate into products and services. | | |Data can be downloaded from: [https://cve.mitre.org/data/downloads/allitems.csv Allitems] | | |CVE |
CWE
|Common Weakness Enumeration data. | | |Data can be downloaded from: [https://cwe.mitre.org/data/csv/699.csv.zip Software Development] [https://cwe.mitre.org/1194.csv.zip Hardware Design]{{Dead link|date=September 2023 |bot=InternetArchiveBot |fix-attempted=yes }}[https://cwe.mitre.org/data/csv/1000.csv.zip Research Concepts] | | |CWE |
MalwareTextDB
|Annotated database of malware texts. | | |The [https://github.com/statnlp-research/statnlp-datasets/tree/master/dataset GitHub repository of the project] contains the data to download. | | |Kiat et al. |
USENIX Security Symposium proceedings
|Collection of security proceedings from USENIX Security Symposium – technical sessions from 1995 to 2022. |This data is not pre-processed. | |[https://www.usenix.org/legacy/publications/library/proceedings/security95/index.html 1995], [https://www.usenix.org/legacy/publications/library/proceedings/sec96/ 1996], [https://www.usenix.org/legacy/publications/library/proceedings/ana97/technical.html 1997], [https://www.usenix.org/legacy/publications/library/proceedings/sec98/technical.html 1998], [https://www.usenix.org/legacy/events/sec99/technical.html 1999], [https://www.usenix.org/legacy/events/sec2000/tech.html 2000], [https://www.usenix.org/legacy/events/sec2001/tech.html 2001], [https://www.usenix.org/legacy/publications/library/proceedings/sec02/tech.html 2002], [https://www.usenix.org/legacy/publications/library/proceedings/sec03/tech.html 2003], [https://www.usenix.org/legacy/events/sec04/tech/ 2004], [https://www.usenix.org/legacy/events/sec05/tech/ 2005], [https://www.usenix.org/legacy/events/sec06/tech/ 2006], [https://www.usenix.org/legacy/events/sec07/tech/ 2007], [https://www.usenix.org/legacy/events/sec08/tech/#wed 2008], [https://www.usenix.org/legacy/events/sec09/tech/ 2009], [https://www.usenix.org/legacy/events/sec10/tech/ 2010] [https://static.usenix.org/event/sec11/tech/ 2011], [https://www.usenix.org/conference/usenixsecurity12/technical-sessions 2012], [https://www.usenix.org/conference/usenixsecurity13/technical-sessions 2013], [https://www.usenix.org/conference/usenixsecurity14/technical-sessions 2014], [https://www.usenix.org/conference/usenixsecurity15/technical-sessions 2015], [https://www.usenix.org/conference/usenixsecurity16/technical-sessions 2016], [https://www.usenix.org/conference/usenixsecurity17/technical-sessions 2017], [https://www.usenix.org/conference/usenixsecurity18/technical-sessions 2018], [https://www.usenix.org/conference/usenixsecurity19/technical-sessions 2019], [https://www.usenix.org/conference/usenixsecurity20/technical-sessions 2020], [https://www.usenix.org/conference/usenixsecurity21/technical-sessions 2021], [https://www.usenix.org/conference/usenixsecurity22/technical-sessions 2022]. | | |USENIX Security Symposium |
APTNotes
|Collection of public documents, whitepapers and articles about APT campaigns. All the documents are publicly available data. |This data is not pre-processed. | |The [https://github.com/aptnotes/data GitHub repository] of the project contains a file with links to the data stored in box. Data files can also be downloaded [https://github.com/ameza13/APTNotesData/ here]. | | |APT Notes |
arXiv Cryptography and Security papers
|Collection of articles about cybersecurity |This data is not pre-processed. | |All articles available [https://github.com/ameza13/Cryptography-and-Security here]. | | |arXiv |
Security eBooks for free
|Small collection of security eBooks, and security presentations publicly available. |This data is not pre-processed. | | | | |{{Cite web |title=Holistic Info-Sec for Web Developers - Fascicle 0 |url=https://f0.holisticinfosecforwebdevelopers.com/ |access-date=2023-01-20 |website=f0.holisticinfosecforwebdevelopers.com}} | |
National Cyber Security strategy repository
|Repository of worldwide strategy documents about cybersecurity. |This data is not pre-processed. | | | | | |
Cyber Security Natural Language Processing
|Data about cybersecurity strategies from more than 75 countries. |Tokenization, meaningless-frequent words removal. | | | | |Yanlin Chen, Yunjian Wei, Yifan Yu, Wen Xue, Xianya Qin |
APT Reports collection
|Sample of APT reports, malware, technology, and intelligence collection |Raw and tokenize data available. | |All data is available in this [https://github.com/blackorbird/APT_REPORT GitHub] repository. | | |{{Citation needed| reason=deleted twitter link|date=October 2023}} |blackorbird |
Offensive Language Identification Dataset (OLID)
| | | |Data available in the [https://sites.google.com/site/offensevalsharedtask/olid project's website]. Data is also available [https://github.com/ameza13/OLIDdataset here]. | | |Zampieri et al. |
Cyber reports from the National Cyber Security Centre
| |This data is not pre-processed. | |[https://www.ncsc.gov.uk/section/keep-up-to-date/threat-reports Threat reports], [https://www.ncsc.gov.uk/section/keep-up-to-date/reports-advisories reports and advisory], [https://www.ncsc.gov.uk/section/keep-up-to-date/ncsc-news news], [https://www.ncsc.gov.uk/section/keep-up-to-date/ncsc-blog blog-posts], [https://www.ncsc.gov.uk/section/keep-up-to-date/all-speeches speeches]. [https://github.com/bee3202/cybersecurity-reports-ncsc Alternate list of reports]. | | | |
APT reports by Kaspersky
| |This data is not pre-processed. | | | | | |
The cyberwire
| |This data is not pre-processed. | |[https://thecyberwire.com/newsletters Newsletters], [https://thecyberwire.com/podcasts podcasts], and [https://thecyberwire.com/stories stories]. | | | |
Databreaches news
| |This data is not pre-processed. | |[https://www.databreaches.net/news/ News], [https://github.com/bee3202/cybersecurity-data-sources/blob/main/DATABREACHES.md list of news from Aug 2022 to Feb 2023] | | | |
Cybernews
| |This data is not pre-processed. | |[https://cybernews.com/news/ News], [https://github.com/bee3202/cybersecurity-data-sources/blob/main/CYBERNEWS.md curated list of news] | | |{{Cite web |title=Cybernews |url=https://cybernews.com/ |website=Cybernews}} | |
Bleepingcomputer
| |This data is not pre-processed. | |[https://www.bleepingcomputer.com/ News] | | | |
Therecord
| |This data is not pre-processed. | |[https://therecord.media/news/cybercrime/ Cybercrime news] | | | |
Hackread
| |This data is not pre-processed. | |[https://www.hackread.com/hacking-news/ Hacking news] | | | |
Securelist
| |This data is not pre-processed. | |[https://securelist.com/category/apt-reports/ APT reports], [https://securelist.com/category/archive/ archive], [https://securelist.com/category/ddos-reports/ DDOS reports], [https://securelist.com/category/incidents/ incidents], [https://securelist.com/category/kaspersky-security-bulletin/ Kaspersky security bulletin], [https://securelist.com/category/industrial-threats/ industrial threats], [https://securelist.com/category/malware-reports/ malware-reports], [https://securelist.com/category/opinions/ opinions], [https://securelist.com/category/publications/ publications], [https://securelist.com/category/research/ research], and [https://securelist.com/category/sas/ SAS]. | | | |
Stucco project
|The Stucco project collects data not typically integrated into security systems. |This data is not pre-processed | |[https://stucco.github.io/data/ Project's website with data information][https://github.com/bee3202/cybersecurity-data-sources Reviewed source with links to data sources] | | | |
Farsightsecurity
|Website with technical information, reports, and more about security topics. |This data is not pre-processed | |[https://www.farsightsecurity.com/technical/ Technical information], [https://www.farsightsecurity.com/research/ research], [https://www.farsightsecurity.com/reports/ reports]. | | | |
Schneier
|Website with academic papers about security topics. |This data is not pre-processed | |[https://www.schneier.com/academic/ Papers per category], [https://www.schneier.com/academic/archive/ papers archive by date]. | | | |
Trendmicro
|Website with research, news, and perspectives bout security topics. |This data is not pre-processed | |[https://github.com/bee3202/cybersecurity-data-sources/blob/main/TRENDMICRO.md Reviewed list of Trendmicro research, news, and perspectives]. | | | |
The Hacker News
|News about cybersecurity topics. |This data is not pre-processed | |[https://thehackernews.com/search/label/data%20breach data breaches], [https://thehackernews.com/search/label/Cyber%20Attack cyberattacks], [https://thehackernews.com/search/label/Vulnerability vulnerabilities], [https://thehackernews.com/search/label/Malware malware news]. | | | |
Krebsonsecurity
|Security news and investigation |This data is not pre-processed | |[https://github.com/bee3202/cybersecurity-data-sources/blob/main/krebsonsecurity.md curated list of news] | | | |
Mitre Defend
|Matrix of Defend artifacts | | |json files | | | |
Mitre Atlas
|Mitre Atlas is a knowledge base of adversary tactics, techniques, and case studies for machine learning (ML) systems based on real-world observations. |This data is not pre-processed | | | | | |
Mitre Engage
|MITRE Engage is a framework for planning and discussing adversary engagement operations that empowers you to engage your adversaries and achieve your cybersecurity goals. |This data is not pre-processed | | | | | |
Hacking Tutorials
| |This data is not pre-processed | | | | | |
Climate and sustainability
class="wikitable sortable"
!Dataset Name !Brief description !Preprocessing !Instances !Format !Default Task !Created (updated) !Reference !Creator |
TCFD reports
|Database of company reports that include TCFD-related disclosures. |This data is not pre-processed | |[https://www.tcfdhub.org/reports Direct link to reports][https://github.com/bee3202/cybersecurity-data-sources/blob/main/TCFDreports.md Curated list of reports] | | |TCFD Knowledge Hub |
Corporate Social Responsibility Reports
|A listing of responsibility reports on the internet. |This data is not pre-processed | |[https://github.com/bee3202/cybersecurity-data-sources/blob/main/RESPONSABILITYREPORTS.md Curated list of reports] | | |ResponsibilityReports |
The Intergovernmental Panel on Climate Change (IPCC)
|A collection of comprehensive assessment reports about knowledge on climate change, its causes, potential impacts and response options |This data is not pre-processed | |[https://www.ipcc.ch/reports/ Reports][https://github.com/bee3202/cybersecurity-data-sources/blob/main/IPCC.md Curated list of reports] | | |{{Cite web |title=About — IPCC |url=https://www.ipcc.ch/about/ |access-date=2023-02-20}} |IPCC |
Alliance for Research on Corporate Sustainability
| |This data is not pre-processed | |[https://github.com/bee3202/cybersecurity-data-sources/blob/main/arcs.md Curated list of blog posts] | | |ARCS |
ESG corpus: Knowledge Hub of the Accounting for Sustainability
| |This data is not pre-processed | |[https://www.accountingforsustainability.org/content/a4s/corporate/en/knowledge-hub.html?tab1=guides Guides], [https://www.accountingforsustainability.org/content/a4s/corporate/en/knowledge-hub.html?tab1=case-studies case studies], [https://www.accountingforsustainability.org/content/a4s/corporate/en/knowledge-hub.html?tab1=blogs blogs], and [https://www.accountingforsustainability.org/content/a4s/corporate/en/knowledge-hub.html?tab1=reports reports & surveys]. | | |Mehra et al. |
CLIMATE-FEVER
|A dataset adopting the FEVER methodology that consists of 1,535 real-world claims regarding climate-change collected on the internet. |Each claim is accompanied by five manually annotated evidence sentences retrieved from the English Wikipedia that support, refute or do not give enough information to validate the claim totalling in 7,675 claim-evidence pairs.{{Creative Commons text attribution notice|url=https://www.tensorflow.org/datasets/community_catalog/huggingface/climate_fever|cc=by4}} | |[https://huggingface.co/datasets/climate_fever Dataset HF card], and project's [https://github.com/tdiggelm/climate-fever-dataset GitHub repository]. | | |Diggelmann et al. |
Climate News dataset
|A dataset for NLP and climate change media researchers |The dataset is made up of a number of data artifacts (JSON, JSONL & CSV text files & SQLite database) | |[http://www.climate-news-db.com/ Climate news DB], Project's [https://github.com/ADGEfficiency/climate-news-db GitHub repository] | | |ADGEfficiency |
Climatext
|Climatext is a dataset for sentence-based climate change topic detection. | | |[https://huggingface.co/datasets/mwong/climatetext-evidence-related-evaluation/tree/main/data HF dataset] | | |University of Zurich |
GreenBiz
|Collection of articles and news about climate and sustainability |This data is not pre-processed | |[https://github.com/bee3202/cybersecurity-data-sources/blob/main/climate-tech.md Curated list of climate articles][https://github.com/bee3202/cybersecurity-data-sources/blob/main/sustainability-strategy.md Curated list of sustainability articles] | | | |
Top research pre-prints in climate and sustainability
|List of pre-prints from researchers in the reuters hot list |This data is not pre-processed | |[https://github.com/bee3202/climate/blob/main/preprints_app_dimentions_ai.md Curated list of pre-prints] | | |Maurice Tamman |
ARCS
| |This data is not pre-processed | |[https://github.com/bee3202/climate/blob/main/arcs.md Curated list of corporate sustainability blogs] | | | |
GreenBiz
|Website with articles about climate and sustainability |This data is not pre-processed | | | | |GreenBiz |
CSRWIRE
| |This data is not pre-processed | |[https://github.com/bee3202/climate/blob/main/csrwire_all.md Curated list of articles] | | |CSRWIRE |
CDP
|Articles about [https://www.cdp.net/en/climate climate], [https://www.cdp.net/en/water water], and [https://www.cdp.net/en/forests forests] |This data is not pre-processed | | | | |CDP |
Code data
class="wikitable sortable"
!Dataset Name !Brief description !Preprocessing !Instances !Format !Default Task !Created (updated) !Reference !Creator |
The Stack
|A 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. |Filtered through license detection and deduplication. |6 TB, 51.76B files (prior to deduplication); 3 TB, 5.28B files (after). 358 programming languages. |Parquet |Language modeling, autocompletion, program synthesis. |2022 |{{cite arXiv |last1=de Vries |first1=Harm |title=The Stack: 3 TB of permissively licensed source code |date=2022 |class=cs.CL |eprint=2211.15533 }}{{cite web |title=The Stack Dedup |url=https://huggingface.co/datasets/bigcode/the-stack-dedup |website=Huggingface |access-date=29 August 2023}} |D. Kocetkov, R. Li, L. Ben Allal, L. von Werra, H. de Vries |
LEMUR Neural Network Dataset
|The structured repository of standardized neural network models designed to facilitate AutoML tasks and model analysis with LLMs |Filtered through license detection and deduplication. |PyTorch models. |Python scripts. |Image classification, object detection, image segmentation, and natural language processing. |2024 |A. Goodarzi, R. Kochnev, W. Khalid, F. Qin, T. Uzun, Y. Dhameliya, Y. Kathiriya, Z. Bentyn, D. Ignatov, R. Timofte |
GitHub repositories
| |This data is not pre-processed | |Curated lis of repositories from GitHub: [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.61.md 61] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.62.md 62] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.63.md 63] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.64.md 64] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.65.md 65] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.66.md 66] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.67.md 67] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.68.md 68] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.69.md 69] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.70.md 70] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.71.md 71], [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.72.md 72], [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.73.md 73], [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.74.md 74], [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.75.md 75], [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.76.md 76], [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others.77.md 77] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/git_others_main.md 101] | | | | |
IBM Public GitHub repositories
| |This data is not pre-processed | |[https://github.com/bee3202/cybersecurity-data-sources/blob/main/CODEPUBLIC.md Curated list of repositories] from GitHub | | | | |
RedHat Public GitHub repositories
| |This data is not pre-processed | |[https://github.com/bee3202/cybersecurity-data-sources/blob/main/CODERHPUBLIC.md Curated list of repositories] from GitHub | | | | |
StackExchange Public Archive.org files
| |This data is not pre-processed | |[https://github.com/bee3202/cybersecurity-data-sources/blob/main/CODESEPPUBLIC.md Curated list of files] from [https://archive.org/ Archive.org] | | | | |
Gitlab Public repositories
| |This data is not pre-processed | |Curated list of repositories from Gitlab: [https://github.com/bee3202/cybersecurity-data-sources/blob/main/CODELABPUBLIC.md 1] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/CODELABPUBLIC2.md 2] | | | | |
Ansible Collections public repositories
| |This data is not pre-processed | |[https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/CODEANSIBLEPUBLIC.0.md Curated list of repositories] from GitHub. | | | | |
CodeParrot GitHub Code Dataset
| |This data is not pre-processed | |Curated list of repositories from Hugging Face: [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/GITHUBCODE_RAWPUBLIC.md 1] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/GITHUBCODE_CLEANPUBLIC.md 2] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/CODEPARROT_TRAINV2NEARDEDUP.md 3] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/CODEPARROT_TRAINV2NEARDEDUPVALID.md 4] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/CODEPARROT_TRAINNEARDEDUPLICATIONPUBLIC.md 5] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/CODEPARROT_TRAINNEARDEDUPLICATIONPUBLICVALID.md 6] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/CODEPARROT_TRAINMOREPUBLIC.md 7] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/CODEPARROT_TRAINMOREVALIDPUBLIC.md 8] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/CODEPARROT_CLEANTRAINPUBLIC.md 9] [https://github.com/bee3202/cybersecurity-data-sources/blob/main/code/CODEPARROT_CLEANVALIDPUBLIC.md 10] | | | | |
OKD
|The Community Distribution of Kubernetes that powers Red Hat OpenShift |This data is not pre-processed | |[https://github.com/orgs/okd-project/repositories List of GitHub repositories of the project] | | | | |
OpenShift
|The developer and operations friendly Kubernetes distro | | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_openshift.md List of GitHub repositories of the project] | | | | |
Kubernetes
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_kubernetes.md List of GitHub repositories of the project] | | | | |
Red Hat Developer
|GitHub home of the Red Hat Developer program |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_developer.md List of GitHub repositories of the project] | | | | |
Red Hat
Workshops | |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_workshops.md List of GitHub repositories of the project] | | | | |
Kubernetes SIGs
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_kubernetes_sigs.md List of GitHub repositories of the project] | | | | |
Konveyor
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_konveyor.md List of GitHub repositories of the project] | | | | |
RedHat Marketplace
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_marketplace.md List of GitHub repositories of the project] | | | | |
Redhat blog
| |This data is not pre-processed | | | | | |
Kubernetes io
| |This data is not pre-processed | | | | | |
Docs Openshift
| |This data is not pre-processed | | | | | |
cncf io
| |This data is not pre-processed | | | | | |
Kubernetes presentations
|List of publicly available Kubernetes presentations |This data is not pre-processed | |[https://github.com/bee3202/kubernetes_presentations/archive/refs/heads/main.zip data link] | | | |
Red Hat Open Innovation Labs
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_open_innovation_labs.md List of GitHub repositories of the project] | | | | |
Red Hat Demos
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_RedHatDemos.md List of GitHub repositories of the project] | | | | |
Red Hat OpenShift Online
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_openshift-online.md List of GitHub repositories of the project] | | | | |
Software Collections
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_software_collections.md List of GitHub repositories of the project] | | | | |
Red Hat Insights
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_insights.md List of GitHub repositories of the project] | | | | |
Red Hat Government
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_government.md List of GitHub repositories of the project] | | | | |
Red Hat Consulting
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_consulting.md List of GitHub repositories of the project] | | | | |
Red Hat Communities of Practice
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_communities_of_practice.md List of GitHub repositories of the project] | | | | |
Red Hat Partner Tech
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_partner_tech.md List of GitHub repositories of the project] | | | | |
Red Hat Documentation
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_documentation.md List of GitHub repositories of the project] | | | | |
IBM
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_IBM.md List of GitHub repositories of the project] | | | | |
IBM Cloud
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_IBM_cloud.md List of GitHub repositories of the project] | | | | |
Build Lab Team
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_build_lab_team.md List of GitHub repositories of the project] | | | | |
Terraform IBM Modules
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_terraform-ibm-modules.md List of GitHub repositories of the project] | | | | |
Cloud Schematics
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_Cloud-Schematics.md List of GitHub repositories of the project] | | | | |
OCP Power Demos
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_ocp-power-demos.md List of GitHub repositories of the project] | | | | |
IBM App Modernization
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_IBMAppModernization.md List of GitHub repositories of the project] | | | | |
Kubernetes OperatorHub
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_k8s-operatorhub.md List of GitHub repositories of the project] | | | | |
Cloud Native Computing Foundation (CNCF)
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_cncf.md List of GitHub repositories of the project] | | | | |
Operator Framework
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_operator-framework.md List of GitHub repositories of the project] | | | |
GitHub repositories referenced in artifacthub.io
| |This data is not pre-processed | |[https://github.com/bee3202/artifacthub_packages/blob/main/artifacthub_git_repos.md List of GitHub repositories in artifacthub.io] | | | | |
Red Hat Communities of Practice
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/pages_redhat_cop.md List of GitHub repositories of the project] | | | | |
Red Hat partner
| |This data is not pre-processed | |[https://github.com/redhat-partner-tech?tab=repositories List of GitHub repositories of the project] | | | | |
IBM Repositories
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/p_ibm_repositories.md List of GitHub repositories for the project] | | | | |
Build Lab Team
| |This data is not pre-processed | |[https://github.com/orgs/ibm-build-lab/repositories List of GitHub repositories for the project] | | | | |
Operator Framework
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/operator_framework.md List of GitHub repositories for the project] | | | | |
GitHub repositories
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/individual_git_repos.md List of GitHub repositories for the project] | | | | |
Red Hat
| |This data is not pre-processed | |[https://www.redhat.com/en List of GitHub repositories of the project] | | | | |
Kubernetes Patterns
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/kubernetes-patterns.md List of GitHub repositories of the project] | | | | |
Kubernetes Deployment & Security Patterns
| |This data is not pre-processed | |[https://resources.linuxfoundation.org/LF+Projects/CNCF/TheNewStack_Book2_KubernetesDeploymentAndSecurityPatterns.pdf List of GitHub repositories of the project] | | | | |
Kubernetes for Full-Stack Developers
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/kubernetes-fs-developers.md List of GitHub repositories of the project] | | | | |
Load Balancer Cloudwatch Metrics
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/load-balancer-cloudwatch-metrics.md GitHub repository of the project] | | | | |
Dynatrace
| |This data is not pre-processed | |[https://docs.dynatrace.com/docs/observe-and-explore/metrics-classic/built-in-metrics] | | | | |
AIOps Challenge 2020 Data
| |This data is not pre-processed | |[https://github.com/NetManAIOps/AIOps-Challenge-2020-Data GitHub repository of the project] | | | | |
Loghub
| |This data is not pre-processed | |[https://github.com/logpai/loghub List of repositories] | | | | |
HTML Pages
| |This data is not pre-processed | |[https://github.com/bee3202/open-shift-repos/blob/main/html_pages.md List of HTML pages] | | | | |
Opensift ebooks
| |This data is not pre-processed | | | | | |
Kubernetes ebooks
| |This data is not pre-processed | |[https://www.redhat.com/rhdc/managed-files/cm-oreilly-kubernetes-patterns-ebook-f19824-201910-en_1.pdf Kubernetes Patterns], [https://resources.linuxfoundation.org/LF+Projects/CNCF/TheNewStack_Book2_KubernetesDeploymentAndSecurityPatterns.pdf Kubernetes Deployment], [https://assets.digitalocean.com/books/kubernetes-for-full-stack-developers.pdf Kubernetes for Full-Stack Developers] | | | | |
Kubernetes for Full-Stack Developers
| |This data is not pre-processed | |[https://assets.digitalocean.com/books/kubernetes-for-full-stack-developers.pdf Kubernetes for Full-Stack Developers] | | | | |
List of public and licensed Github repositories
| |This data is not pre-processed | |[https://github.com/bee3202/code_dataset/tree/main/licensed_batch_A List of repositories] | | | | |
Multivariate data
= Financial =
= Weather =
class="wikitable sortable" style="width: 100%"
! scope="col" style="width: 15%;" | Dataset Name ! scope="col" style="width: 18%;" | Brief description ! scope="col" style="width: 18%;" | Preprocessing ! scope="col" style="width: 6%;" | Instances ! scope="col" style="width: 7%;" | Format ! scope="col" style="width: 7%;" | Default Task ! scope="col" style="width: 6%;" | Created (updated) ! scope="col" style="width: 6%;" | Reference ! scope="col" style="width: 11%;" | Creator |
Cloud DataSet
|Data about 1024 different clouds. |Image features extracted. |1024 |Text |Classification, clustering |1989 |P. Collard |
El Nino Dataset
|Oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific. |12 weather attributes are measured at each buoy. |178080 |Text |Regression |1999 |
Greenhouse Gas Observing Network Dataset
|Time-series of greenhouse gas concentrations at 2921 grid cells in California created using simulations of the weather. |None. |2921 |Text |Regression |2015 |D. Lucas |
Atmospheric {{CO2}} from Continuous Air Samples at Mauna Loa Observatory
|Continuous air samples in Hawaii, USA. 44 years of records. |None. |44 years |Text |Regression |2001 |
Ionosphere Dataset
|Radar data from the ionosphere. Task is to classify into good and bad radar returns. |Many radar features given. |351 |Text |Classification |1989 |Sigillito, Vincent G., et al. "Classification of radar returns from the ionosphere using neural networks." Johns Hopkins APL Technical Digest10.3 (1989): 262–266. |
Ozone Level Detection Dataset
|Two ground ozone level datasets. |Many features given, including weather conditions at time of measurement. |2536 |Text |Classification |2008 |{{cite journal |last1=Zhang |first1=Kun |last2=Fan |first2=Wei |title=Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond |journal=Knowledge and Information Systems |date=March 2008 |volume=14 |issue=3 |pages=299–326 |doi=10.1007/s10115-007-0095-1 }}{{cite journal |last1=Reich |first1=Brian J. |last2=Fuentes |first2=Montserrat |last3=Dunson |first3=David B. |title=Bayesian Spatial Quantile Regression |journal=Journal of the American Statistical Association |date=March 2011 |volume=106 |issue=493 |pages=6–20 |doi=10.1198/jasa.2010.ap09237 |pmid=23459794 |pmc=3583387 }} |K. Zhang et al. |
= Census =
style="width: 100%" class="wikitable sortable"
! scope="col" style="width: 15%;" | Dataset Name ! scope="col" style="width: 18%;" | Brief description ! scope="col" style="width: 18%;" | Preprocessing ! scope="col" style="width: 6%;" | Instances ! scope="col" style="width: 7%;" | Format ! scope="col" style="width: 7%;" | Default Task ! scope="col" style="width: 6%;" | Created (updated) ! scope="col" style="width: 6%;" | Reference ! scope="col" style="width: 11%;" | Creator |
Adult Dataset
|Census data from 1994 containing demographic features of adults and their income. |Cleaned and anonymized. |48,842 |Comma separated values |Classification |1996 |United States Census Bureau |
Census-Income (KDD)
|Weighted census data from the 1994 and 1995 Current Population Surveys. |Split into training and test sets. |299,285 |Comma separated values |Classification |2000 |Oza, Nikunj C., and Stuart Russell. "Experimental comparisons of online and batch versions of bagging and boosting." Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2001.{{cite journal |last1=Bay |first1=Stephen D. |title=Multivariate Discretization for Set Mining |journal=Knowledge and Information Systems |date=November 2001 |volume=3 |issue=4 |pages=491–512 |doi=10.1007/pl00011680 }} |
IPUMS Census Database
|Census data from the Los Angeles and Long Beach areas. |None |256,932 |Text |Classification, regression |1999 |
US Census Data 1990
|Partial data from 1990 US census. |Results randomized and useful attributes selected. |2,458,285 |Text |Classification, regression |1990 |
= Transit =
class="wikitable sortable" style="width: 100%"
! scope="col" style="width: 15%;" | Dataset Name ! scope="col" style="width: 18%;" | Brief description ! scope="col" style="width: 18%;" | Preprocessing ! scope="col" style="width: 6%;" | Instances ! scope="col" style="width: 7%;" | Format ! scope="col" style="width: 7%;" | Default Task ! scope="col" style="width: 6%;" | Created (updated) ! scope="col" style="width: 6%;" | Reference ! scope="col" style="width: 11%;" | Creator |
Bike Sharing Dataset
|Hourly and daily count of rental bikes in a large city. |Many features, including weather, length of trip, etc., are given. |17,389 |Text |Regression |2013 |{{cite journal | last1 = Fanaee-T | first1 = Hadi | last2 = Gama | first2 = Joao | year = 2013| title = Event labeling combining ensemble detectors and background knowledge | url = http://repositorio.inesctec.pt/handle/123456789/3506| journal = Progress in Artificial Intelligence | volume = 2 | issue = 2–3| pages = 113–127 | doi = 10.1007/s13748-013-0040-3 | s2cid = 3345087 }}{{cite book |doi=10.1109/CIVTS.2014.7009473 |chapter=Predicting bikeshare system usage up to one day ahead |title=2014 IEEE Symposium on Computational Intelligence in Vehicles and Transportation Systems (CIVTS) |date=2014 |last1=Giot |first1=Romain |last2=Cherrier |first2=Raphael |pages=22–29 |isbn=978-1-4799-4497-2 |url=https://hal.archives-ouvertes.fr/hal-01065983/file/paper_final.pdf }} |H. Fanaee-T |
New York City Taxi Trip Data
|Trip data for yellow and green taxis in New York City. |Gives pick up and drop off locations, fares, and other details of trips. |6 years |Text |Classification, clustering |2015 |
Taxi Service Trajectory ECML PKDD
|Trajectories of all taxis in a large city. |Many features given, including start and stop points. |1,710,671 |Text |Clustering, causal-discovery |2015 |{{cite journal | last1 = Moreira-Matias | first1 = Luis | display-authors = et al | year = 2013 | title = Predicting taxi–passenger demand using streaming data | url = http://repositorio.inesctec.pt/handle/123456789/5356| journal = IEEE Transactions on Intelligent Transportation Systems| volume = 14 | issue = 3| pages = 1393–1402 | doi=10.1109/tits.2013.2262376| s2cid = 14764358 }}{{cite journal | last1 = Hwang | first1 = Ren-Hung | last2 = Hsueh | first2 = Yu-Ling | last3 = Chen | first3 = Yu-Ting | year = 2015 | title = An effective taxi recommender system based on a spatio-temporal factor analysis model | journal = Information Sciences | volume = 314 | pages = 28–40 | doi=10.1016/j.ins.2015.03.068}} |M. Ferreira et al. |
METR-LA
|Speed from loop detectors in the highway of Los Angeles County. |Average speed in 5 minutes timesteps. |7,094,304 from 207 sensors and 34,272 timesteps |Comma separated values |Regression, Forecasting |2014 |H. V. Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M. Patel, Raghu Ramakrishnan, and Cyrus Shahabi. Big data and its technical challenges. Commun. ACM, 57(7):86–94, July 2014. |Jagadish et al. |
PeMS
|Speed, flow, occupancy and other metrics from loop detectors and other sensors in the freeway of the State of California, U.S.A.. |Metric usually aggregated via Average into 5 minutes timesteps. |39,000 individual detectors, each containing years of timeseries |Comma separated values |Regression, Forecasting, Nowcasting, Interpolation |(updated realtime) |[http://pems.dot.ca.gov/ Caltrans PeMS] |California Department of Transportation |
= Internet =
= Games =
style="width: 100%" class="wikitable sortable"
! scope="col" style="width: 15%;" | Dataset Name ! scope="col" style="width: 18%;" | Brief description ! scope="col" style="width: 18%;" | Preprocessing ! scope="col" style="width: 6%;" | Instances ! scope="col" style="width: 7%;" | Format ! scope="col" style="width: 7%;" | Default Task ! scope="col" style="width: 6%;" | Created (updated) ! scope="col" style="width: 6%;" | Reference ! scope="col" style="width: 11%;" | Creator |
Poker Hand Dataset
|5 card hands from a standard 52 card deck. |Attributes of each hand are given, including the Poker hands formed by the cards it contains. |1,025,010 |Text |Regression, classification |2007 |R. Cattral |
Connect-4 Dataset
|Contains all legal 8-ply positions in the game of connect-4 in which neither player has won yet, and in which the next move is not forced. |None. |67,557 |Text |Classification |1995 |J. Tromp |
Chess (King-Rook vs. King) Dataset
|Endgame Database for White King and Rook against Black King. |None. |28,056 |Text |Classification |1994 |{{cite book |doi=10.1093/oso/9780198538509.003.0012 |chapter=Learning Optimal Chess Strategies |title=Machine Intelligence 13 |date=1994 |last1=Bain |first1=M. |last2=Muggleton |first2=S. |pages=291–309 |isbn=978-0-19-853850-9 }}{{cite book |doi=10.1007/978-3-662-12405-5_15 |chapter=Learning Efficient Classification Procedures and Their Application to Chess End Games |title=Machine Learning |date=1983 |last1=Quinlan |first1=J. Ross |pages=463–482 |isbn=978-3-662-12407-9 }} |M. Bain et al. |
Chess (King-Rook vs. King-Pawn) Dataset
|King+Rook versus King+Pawn on a7. |None. |3196 |Text |Classification |1989 |R. Holte |
Tic-Tac-Toe Endgame Dataset
|Binary classification for win conditions in tic-tac-toe. |None. |958 |Text |Classification |1991 |D. Aha |
= Other multivariate =
Curated repositories of datasets
As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research.
- OpenML:{{cite journal | vauthors = Vanschoren J, van Rijn JN, Bischl B, Torgo L | year = 2013 | title = OpenML: networked science in machine learning | journal = SIGKDD Explorations | volume = 15 | issue = 2 | pages = 49–60 | doi = 10.1145/2641190.2641198 | arxiv = 1407.7722 | s2cid = 4977460 }} Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms.
- PMLB:{{cite journal | vauthors = Olson RS, La Cava W, Orzechowski P, Urbanowicz RJ, Moore JH | year = 2017 | title = PMLB: a large benchmark suite for machine learning evaluation and comparison | journal = BioData Mining | volume = 10 | issue = 1 | pages = 36 | doi = 10.1186/s13040-017-0154-4 | pmid = 29238404 | pmc = 5725843 | bibcode = 2017arXiv170300512O | arxiv = 1703.00512 | doi-access = free }} A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API.
- Metatext NLP: https://metatext.io/datasets web repository maintained by community, containing nearly 1000 benchmark datasets, and counting. Provides many tasks from classification to QA, and various languages from English, Portuguese to Arabic.
- Appen: Off The Shelf and Open Source Datasets hosted and maintained by the company. These biological, image, physical, question answering, signal, sound, text, and video resources number over 250 and can be applied to over 25 different use cases.{{cite web |title=Off The Shelf Datasets |url=https://appen.com/off-the-shelf-datasets/ |website=appen.com |publisher=Appen |access-date=30 December 2020}}{{cite web |title=Open Source Datasets |url=https://appen.com/resources/datasets/ |website=appen.com |publisher=Appen |access-date=30 December 2020}}