Scikit-learn
{{Short description|Python library for machine learning}}
{{lowercase title}}
{{Infobox software
| name = scikit-learn
| logo = Scikit learn logo small.svg
| screenshot =
| caption =
| collapsible =
| author = David Cournapeau
| developer = Google Summer of Code project
| released = {{Start date and age|2007|06|df=yes}}
| latest release version = {{wikidata|property|reference|P348}}
| latest release date = {{start date and age|{{wikidata|qualifier|P348|P577}}}}
| latest preview version =
| latest preview date =
| programming language = Python, Cython, C and C++{{wikidata|reference|P277}}
| operating system = Linux, macOS, Windows
| platform =
| size =
| language =
| genre = Library for machine learning
| license = New BSD License
| website = {{URL|https://scikit-learn.org/}}
}}
{{Portal|Free and open-source software}}
scikit-learn (formerly scikits.learn and also known as sklearn) is a free and open-source machine learning library for the Python programming language.{{cite journal
|author1=Fabian Pedregosa
|author2=Gaël Varoquaux
|author3=Alexandre Gramfort
|author4=Vincent Michel
|author5=Bertrand Thirion
|author6=Olivier Grisel
|author7=Mathieu Blondel
|author8=Peter Prettenhofer
|author9=Ron Weiss
|author10=Vincent Dubourg
|author11=Jake Vanderplas
|author12=Alexandre Passos
|author13=David Cournapeau
|author14=Matthieu Perrot
|author15=Édouard Duchesnay
|title=scikit-learn: Machine Learning in Python
|journal=Journal of Machine Learning Research
|year=2011
|volume=12
|pages=2825–2830
|arxiv=1201.0490
|bibcode=2011JMLR...12.2825P
|url=http://jmlr.org/papers/v12/pedregosa11a.html
}}
It features various classification, regression and clustering algorithms including support-vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. Scikit-learn is a NumFOCUS fiscally sponsored project.{{cite web|title=NumFOCUS Sponsored Projects|url=https://numfocus.org/sponsored-projects|publisher=NumFOCUS|access-date=2021-10-25}}
Overview
The scikit-learn project started as scikits.learn, a Google Summer of Code project by French data scientist David Cournapeau. The name of the project derives from its role as a "scientific toolkit for machine learning", originally developed and distributed as a third-party extension to SciPy.{{cite web
|url=https://scikits.appspot.com/scikit-learn
|title=scikit-learn
|last1=Dreijer
|first1=Janto
}} The original codebase was later rewritten by other developers.{{Who|date=March 2025}} In 2010, contributors Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort and Vincent Michel, from the French Institute for Research in Computer Science and Automation in Saclay, France, took leadership of the project and released the first public version of the library on February 1, 2010.{{cite web|url=https://scikit-learn.org/stable/about.html#history|title=About us — scikit-learn 0.20.1 documentation|website=scikit-learn.org}} In November 2012, scikit-learn as well as scikit-image were described as two of the "well-maintained and popular" {{As of|2012|11|alt=scikits libraries}}.{{cite book
|author=Eli Bressert
|title=SciPy and NumPy: an overview for developers
|publisher=O'Reilly
|date=2012
|url=https://books.google.com/books?id=fLKTuJqQLVEC&pg=PA43
|page=43
|isbn=978-1-4493-6162-4
}} In 2019, it was noted that scikit-learn is one of the most popular machine learning libraries on GitHub.{{Cite web|url=https://github.blog/2019-01-24-the-state-of-the-octoverse-machine-learning/|title=The State of the Octoverse: machine learning|date=2019-01-24|website=The GitHub Blog|publisher=GitHub|language=en-US|access-date=2019-10-17}}
Features
- Large catalogue of well-established machine learning algorithms and data pre-processing methods (i.e. feature engineering)
- Utility methods for common data-science tasks, such as splitting data into train and test sets, cross-validation and grid search
- Consistent way of running machine learning models ({{code|estimator.fit()|python}} and {{code|estimator.predict()|python}}), which libraries can implement
- Declarative way of structuring a data science process (the {{Code|Pipeline|Python}}), including data pre-processing and model fitting
Examples
Fitting a random forest classifier:
>>> from sklearn.ensemble import RandomForestClassifier
>>> classifier = RandomForestClassifier(random_state=0)
>>> X = [[ 1, 2, 3], # 2 samples, 3 features
... [11, 12, 13]]
>>> y = [0, 1] # classes of each sample
>>> classifier.fit(X, y)
RandomForestClassifier(random_state=0)
Implementation
scikit-learn is largely written in Python, and uses NumPy extensively for high-performance linear algebra and array operations. Furthermore, some core algorithms are written in Cython to improve performance. Support vector machines are implemented by a Cython wrapper around LIBSVM; logistic regression and linear support vector machines by a similar wrapper around LIBLINEAR. In such cases, extending these methods with Python may not be possible.
scikit-learn integrates well with many other Python libraries, such as Matplotlib and plotly for plotting, NumPy for array vectorization, Pandas dataframes, SciPy, and many more.
History
scikit-learn was initially developed by David Cournapeau as a Google Summer of Code project in 2007. Later that year, Matthieu Brucher joined the project and started to use it as a part of his thesis work. In 2010, INRIA, the French Institute for Research in Computer Science and Automation, got involved and the first public release (v0.1 beta) was published in late January 2010.
Awards
- 2019 Inria-French Academy of Sciences-Dassault Systèmes Innovation Prize{{Cite web |title=The 2019 Inria-French Academy of Sciences-Dassault Systèmes Innovation Prize : scikit-learn , a success story for machine learning free software {{!}} Inria |url=https://www.inria.fr/en/2019-inria-french-academy-sciences-dassault-systemes-innovation-prize-scikit-learn-success-story |access-date=2025-03-19 |website=www.inria.fr}}
- 2022 Open Science Award for Open Source Research Software{{Cite web |last=Badolato |first=Anne-Marie |date=2022-02-07 |title=Open Science Awards for Open Source Research Software |url=https://www.ouvrirlascience.fr/open-science-free-software-award-ceremony/ |access-date=2025-03-19 |website=Ouvrir la Science |language=en}}
scikit-learn alternatives
References
{{Reflist|30em}}
External links
- {{Official website|https://scikit-learn.org/}}
- {{GitHub|https://github.com/scikit-learn}}
{{SciPy ecosystem}}
{{differentiable computing}}
Category:Data mining and machine learning software
Category:Free statistical software