Quantification (machine learning)
{{Short description|Machine learning practice of supervised learning}}
In machine learning and data mining, quantification (variously called learning to quantify, or supervised prevalence estimation, or class prior estimation) is the task of using supervised learning in order to train models (quantifiers) that estimate the relative frequencies (also known as prevalence values) of the classes of interest in a sample of unlabelled data items.{{cite journal
|title=A review on quantification learning
|author1=Pablo González
|author2=Alberto Castaño
|author3=Nitesh Chawla
|author4=Juan José del Coz
|journal=ACM Computing Surveys
|volume=50
|number=5
|pages=74:1–74:40
|year=2017
|doi=10.1145/3117807 |hdl=10651/45313
|s2cid=38185871
|hdl-access=free
|title= Learning to Quantify
|author1=Andrea Esuli
|author2=Alessandro Fabris
|author3=Alejandro Moreo
|author4=Fabrizio Sebastiani
|series=The Information Retrieval Series
|publisher=Springer Nature
|location=Cham, CH
|year=2023
|volume=47
|doi=10.1007/978-3-031-20467-8
|isbn=978-3-031-20466-1
|url=https://link.springer.com/content/pdf/10.1007/978-3-031-20467-8.pdf
|s2cid=257560090
}}
For instance, in a sample of 100,000 unlabelled tweets known to express opinions about a certain political candidate, a quantifier may be used to estimate the percentage of these tweets which belong to class `Positive' (i.e., which manifest a positive stance towards this candidate), and to do the same for classes `Neutral' and `Negative'.
Quantification may also be viewed as the task of training predictors that estimate a (discrete) probability distribution, i.e., that generate a predicted distribution that approximates the unknown true distribution of the items across the classes of interest. Quantification is different from classification, since the goal of classification is to predict the class labels of individual data items, while the goal of quantification it to predict the class prevalence values of sets of data items. Quantification is also different from regression, since in regression the training data items have real-valued labels, while in quantification the training data items have class labels.
It has been shown in multiple research works{{cite journal
|title= Quantifying counts and costs via classification
|author1=George Forman
|journal=Data Mining and Knowledge Discovery
|volume=17
|number=2
|pages=164–206
|year=2008
|doi=10.1007/s10618-008-0097-y
|s2cid=1435935
|author1= Antonio Bella
|author2=Cèsar Ferri
|author3=José Hernández-Orallo
|author4=María José Ramírez-Quintana
|title=2010 IEEE International Conference on Data Mining
|chapter=Quantification via Probability Estimators
|pages=737–742
|year=2010
|doi=10.1109/icdm.2010.75
|isbn=978-1-4244-9131-5
|s2cid=9670485
}}
|title= Quantification-oriented learning based on reliable classifiers
|author1=José Barranquero
|author2=Jorge Díez
|author3=Juan José del Coz
|journal=Pattern Recognition
|volume=48
|number=2
|pages=591–604
|year=2015
|doi=10.1016/j.patcog.2014.07.032
|bibcode=2015PatRe..48..591B
|hdl=10651/30611
|hdl-access=free
}}
|title= Optimizing text quantifiers for multivariate loss functions
|author1=Andrea Esuli
|author2=Fabrizio Sebastiani
|journal=ACM Transactions on Knowledge Discovery from Data
|volume=9
|number=4
|pages=Article 27
|year=2015
|doi=10.1145/2700406
|arxiv=1502.05491
|s2cid=16824608
}}
|title= From classification to quantification in tweet sentiment analysis
|author1=Wei Gao
|author2=Fabrizio Sebastiani
|journal=Social Network Analysis and Mining
|volume=6
|number=19
|pages=1–22
|year=2016
|doi=10.1007/s13278-016-0327-z
|s2cid=15631612
|url=https://ink.library.smu.edu.sg/sis_research/4547
}}
that performing quantification by classifying all unlabelled instances and then counting the instances that have been attributed to each class (the 'classify and count' method) usually leads to suboptimal quantification accuracy. This suboptimality may be seen as a direct consequence of 'Vapnik's principle', which states:
{{Blockquote|text=If you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general problem as an intermediate step. It is possible that the available information is sufficient for a direct solution but is insufficient for solving a more general intermediate problem.{{cite book
|title= Statistical learning theory
|author1=Vladimir Vapnik
|publisher=Wiley
|year=1998
|location=New York, US
}}}}
In our case, the problem to be solved directly is quantification, while the more general intermediate problem is classification. As a result of the suboptimality of the 'classify and count' method, quantification has evolved as a task in its own right, different (in goals, methods, techniques, and evaluation measures) from classification.
Quantification tasks
The main variants of quantification, according to the characteristics of the set of classes used, are:
- Binary quantification, corresponding to the case in which there are only classes and each data item belongs to exactly one of them;
- Single-label multiclass quantification, corresponding to the case in which there are classes and each data item belongs to exactly one of them;
- Multi-label multiclass quantification, corresponding to the case in which there are classes and each data item can belong to zero, one, or several classes at the same time;
- Ordinal quantification, corresponding to the single-label multiclass case in which a total order is defined on the set of classes.
- Regression quantification, a task which stands to 'standard' quantification as regression stands to classification. Strictly speaking, this task is not a quantification task as defined above (since the individual items do not have class labels but are labelled by real values), but has enough commonalities with other quantification tasks to be considered one of them.
Most known quantification methods address the binary case or the single-label multiclass case, and only few of them address the multi-label, ordinal, and regression cases.
Binary-only methods include the Mixture Model (MM) method, the HDy method,{{cite journal
|author1=Víctor González-Castro
|author2=Rocío Alaiz-Rodríguez
|author3=Enrique Alegre
|title=Class distribution estimation based on the {H}ellinger distance
|journal=Information Sciences
|volume=218
|pages=146–164
|year=2013
|doi=10.1016/j.ins.2012.05.028
Methods that can deal with both the binary case and the single-label multiclass case include probabilistic classify and count (PCC), adjusted classify and count (ACC), probabilistic adjusted classify and count (PACC), and the Saerens-Latinne-Decaestecker EM-based method (SLD).
Methods for multi-label quantification include regression-based quantification (RQ) and label powerset-based quantification (LPQ).{{cite journal
|author1=Alejandro Moreo
|author2=Manuel Francisco
|author3=Fabrizio Sebastiani
|title=Multi-label quantification
|journal=ACM Transactions on Knowledge Discovery from Data
|year=2024
|volume=18
|issue=1
|doi=10.1145/3606264
|arxiv=2211.08063
}}
Methods for the ordinal case include Ordinal Quantification Tree (OQT),{{cite book
|author1=Giovanni Da San Martino
|author2=Wei Gao
|author3=Fabrizio Sebastiani
|title=Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
|chapter=Ordinal Text Quantification
|date=2016
|pages=937–940
|year=2016
|doi=10.1145/2911451.2914749
|isbn=9781450340694
|s2cid=8102324
|chapter-url=https://ink.library.smu.edu.sg/sis_research/4569
}} and ordinal versions of the above-mentioned ACC, PACC, and SLD methods.{{cite journal
|author1=Mirko Bunse
|author2=Alejandro Moreo
|author3=Fabrizio Sebastiani
|author4=Martin Senz
|title=Ordinal quantification through regularization
|journal=Data Mining and Knowledge Discovery
|year=2024
|volume=38
|issue=6
|pages=4076-4121
}}
Methods for the regression case include Regress and splice and Adjusted regress and sum.{{cite journal
|author1=Antonio Bella
|author2=Cèsar Ferri
|author3=José Hernández-Orallo
|author4=María José Ramírez-Quintana
|title=Aggregative quantification for regression
|journal=Data Mining and Knowledge Discovery
|year=2014
|volume=28
|issue=2
|pages=475–518
|doi=10.1007/s10618-013-0308-z
|hdl=10251/49300
|hdl-access=free
}}
Evaluation measures for quantification
Several evaluation measures can be used for evaluating the error of a quantification method.
Since quantification consists of generating a predicted probability distribution that estimates a true probability distribution, these evaluation measures are ones that compare two probability distributions. Most evaluation measures for quantification belong to the class of divergences. Evaluation measures for binary quantification and single-label multiclass quantification are{{cite journal
|title= Evaluation measures for quantification: An axiomatic approach
|author1=Fabrizio Sebastiani
|journal=Information Retrieval Journal
|volume=23
|number=3
|pages=255–288
|year=2020
|doi=10.1007/s10791-019-09363-y
|arxiv=1809.01991
|s2cid=52170301
}}
- Absolute Error
- Squared Error
- Relative Absolute Error
- Kullback-Leibler divergence
- Pearson Divergence
Evaluation measures for ordinal quantification are
- Normalized Match Distance (a particular case of the Earth Mover's Distance)
- Root Normalized Order-Aware Distance
Applications
Quantification is of special interest in fields such as the social sciences,{{cite journal
|title= A method of automated nonparametric content analysis for social science
|author1= Daniel J. Hopkins
|author2= Gary King
|journal=American Journal of Political Science
|volume=54
|number=1
|pages=229–247
|year=2010
|doi=10.1111/j.1540-5907.2009.00428.x
|s2cid= 1177676
}}
|title= Verbal autopsy methods with multiple causes of death
|author1=Gary King
|author2=Ying Lu
|journal=Statistical Science
|volume=23
|number=1
|pages=78–91
|year=2008
|doi=10.1214/07-sts247
|arxiv=0808.0645
|s2cid=4084198
}}
market research, and ecological modelling,{{cite journal
|title= Validation methods for plankton image classification systems
|author1=Pablo González
|author2=Eva Álvarez
|author3=Jorge Díez
|author4=Ángel López-Urrutia
|author5=Juan J. del Coz
|journal=Limnology and Oceanography: Methods
|volume=15
|pages=221–237
|year=2017
|issue=3
|doi=10.1002/lom3.10151
|bibcode=2017LimOM..15..221G
|s2cid=59438870
|url=https://epic.awi.de/id/eprint/47173/1/Gonzalez_et_al_2017_LandOMeths.pdf
}}
since these fields are inherently concerned with aggregate data. However, quantification is also useful as a building block for solving other downstream tasks, such as improving the accuracy of classifiers on out-of-distribution data,{{cite journal
|title=Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure
|author1=Marco Saerens
|author2=Patrice Latinne
|author3=Christine Decaestecker
|journal=Neural Computation
|volume=14
|number=1
|pages=21–41
|year=2002
|doi=10.1162/089976602753284446
|pmid=11747533
|s2cid=18254013
|url=https://dipot.ulb.ac.be/dspace/bitstream/2013/68391/1/Decaestecker_NeuralComp02.pdf
}}
|title= Word sense disambiguation with distribution estimation
|author1=Yee Seng Chan
|author2=Hwee Tou Ng
|journal=Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 2005)
|location=Edinburgh, UK
|pages=1010–1015
|year=2005
|url=https://dl.acm.org/doi/10.5555/1642293.1642455
}}
measuring classifier bias,{{cite journal
|title= Measuring Fairness Under Unawareness of Sensitive Attributes: A Quantification-Based Approach
|author1=Alessandro Fabris
|author2=Andrea Esuli
|author3=Alejandro Moreo
|author4=Fabrizio Sebastiani
|journal=Journal of Artificial Intelligence Research
|year=2023
|volume=76
|url=https://www.jair.org/index.php/jair/article/view/14033/26912.pdf
|pages=1117–1180
|doi=10.1613/jair.1.14033
|arxiv=2109.08549
|s2cid=247315416
}}
and estimating the accuracy of classifiers on out-of-distribution data.{{cite journal
|title= A simple method for classifier accuracy prediction under prior probability shift
|author1=Lorenzo Volpi
|author2=Alejandro Moreo
|author3=Fabrizio Sebastiani
|journal=Proceedings of the 27th International Conference on Discovery Science (DS 2024)
|location=Pisa, IT
|pages=267-283
|year=2024
|url=https://link.springer.com/chapter/10.1007/978-3-031-78980-9_17
}}
Resources
- LQ 2021: the 1st International Workshop on Learning to Quantify{{cite web
|url=https://cikmlq2021.github.io/
|title=LQ 2021: the 1st International Workshop on Learning to Quantify
}}
- LQ 2022: the 2nd International Workshop on Learning to Quantify{{cite web
|url=https://lq-2022.github.io/
|title=LQ 2022: the 2nd International Workshop on Learning to Quantify
}}
- LQ 2023: the 3rd International Workshop on Learning to Quantify{{cite web
|url=https://lq-2023.github.io/
|title=LQ 2023: the 3rd International Workshop on Learning to Quantify
}}
- LQ 2024: the 4th International Workshop on Learning to Quantify{{cite web
|url=https://lq-2024.github.io/
|title=LQ 2024: the 4th International Workshop on Learning to Quantify
}}
- LeQua 2022: the 1st Data Challenge on Learning to Quantify {{cite web
|url=
https://lequa2022.github.io/
|title=LeQua 2022: A Data Challenge on Learning to Quantify
}}
- LeQua 2024: the 2nd Data Challenge on Learning to Quantify {{cite web
|url=
https://lequa2042.github.io/
|title=LeQua 2024: A Data Challenge on Learning to Quantify
}}
- QuaPy: An open-source Python-based software library for quantification{{cite web
|url=https://github.com/HLT-ISTI/QuaPy
|title=QuaPy: A Python-Based Framework for Quantification
|website=GitHub
|date=23 November 2021
}}
- QuantificationLib: A Python library for quantification and prevalence estimation{{cite web
|url=https://aicgijon.github.io/quantificationlib/
|title=QuantificationLib: A Python library for quantification and prevalence estimation
|website=GitHub
|date=8 April 2024
}}
References
{{reflist}}