Discretization of continuous features
In statistics and machine learning, discretization refers to the process of converting or partitioning continuous attributes, features or variables to discretized or nominal attributes/features/variables/intervals. This can be useful when creating probability mass functions – formally, in density estimation. It is a form of discretization in general and also of binning, as in making a histogram. Whenever continuous data is discretized, there is always some amount of discretization error. The goal is to reduce the amount to a level considered negligible for the modeling purposes at hand.
Typically data is discretized into partitions of K equal lengths/width (equal intervals) or K% of the total data (equal frequencies).{{Cite journal
| last1 = Clarke | first1 = E. J.
| last2 = Barton | first2 = B. A.
| doi = 10.1002/(SICI)1098-111X(200001)15:1<61::AID-INT4>3.0.CO;2-O
| title = Entropy and MDL discretization of continuous variables for Bayesian belief networks
| journal = International Journal of Intelligent Systems
| volume = 15
| pages = 61–92
| year = 2000
| pmid =
| pmc =
| url=http://sci2s.ugr.es/keel/pdf/specific/articulo/IJIS00.pdf |accessdate=2008-07-10
}}
Mechanisms for discretizing continuous data include Fayyad & Irani's MDL method,Fayyad, Usama M.; Irani, Keki B. (1993) {{cite web|hdl=2014/35171 | url = https://www.ijcai.org/Proceedings/93-2/Papers/022.pdf | title = Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning| date = 29 July 2023 }}, Proc. 13th Int. Joint Conf. on Artificial Intelligence (Q334 .I571 1993), pp. 1022-1027 which uses mutual information to recursively define the best bins, CAIM, CACC, Ameva, and many othersDougherty, J.; Kohavi, R.; Sahami, M. (1995). "[http://robotics.stanford.edu/users/sahami/papers-dir/disc.pdf Supervised and Unsupervised Discretization of Continuous Features]". In A. Prieditis & S. J. Russell, eds. Work. Morgan Kaufmann, pp. 194-202
Many machine learning algorithms are known to produce better models by discretizing continuous attributes.{{cite journal| first1=S. |last1=Kotsiantis |first2= D| last2= Kanellopoulos |title=Discretization Techniques: A recent survey|journal= GESTS International Transactions on Computer Science and Engineering |volume=32 |issue=1 |year=2006 |pages= 47–58|citeseerx = 10.1.1.109.3084}}
Software
This is a partial list of software that implement MDL algorithm.
- [https://gforge.inria.fr/projects/discretize4crf discretize4crf] tool designed to work with popular CRF implementations (C++)
- [https://cran.r-project.org/web/packages/discretization/discretization.pdf mdlp] in the R package discretization
- [https://cran.r-project.org/web/packages/RWeka/RWeka.pdf Discretize] in the R package RWeka