Coreset
{{Short description|Computational geometry concept}}
In computational geometry, a coreset of an input set is a subset of points, such that solving a problem on the coreset provably yields similar results as solving the problem on the entire point set, for some given family of problems.{{Citation |last1=Jubran |first1=Ibrahim |title=Introduction to Coresets: Accurate Coresets |date=2019-10-19 |url=https://arxiv.org/abs/1910.08707 |access-date=2025-02-22 |arxiv=1910.08707 |last2=Maalouf |first2=Alaa |last3=Feldman |first3=Dan}} Coresets are commonly used in Mathematical optimization, Cluster analysis and Range Queries to reduce computational complexity while maintaining high accuracy. They allow algorithms to operate efficiently on large datasets by replacing the original data with a significantly smaller representative subset.{{Citation |last=Feldman |first=Dan |title=Introduction to Core-sets: an Updated Survey |date=2020 |url=https://arxiv.org/abs/2011.09384 |access-date=2025-02-22 |arxiv=2011.09384}}
Many natural geometric optimization problems have coresets that approximate an optimal solution to within a factor of {{math|1 + ε}}, that can be found quickly (in linear time or near-linear time), and that have size bounded by a function of {{math|1/ε}} independent of the input size, where {{math|ε}} is an arbitrary positive number. When this is the case, one obtains a linear-time or near-linear time approximation scheme, based on the idea of finding a coreset and then applying an exact optimization algorithm to the coreset. Regardless of how slow the exact optimization algorithm is, for any fixed choice of {{math|ε}}, the running time of this approximation scheme will be {{math|O(1)}} plus the time to find the coreset.{{citation
| last1 = Agarwal | first1 = Pankaj K. | author1-link = Pankaj K. Agarwal
| last2 = Har-Peled | first2 = Sariel | author2-link = Sariel Har-Peled
| last3 = Varadarajan | first3 = Kasturi R.
| editor1-last = Goodman | editor1-first = Jacob E. | editor1-link = Jacob E. Goodman
| editor2-last = Pach | editor2-first = János | editor2-link = János Pach
| editor3-last = Welzl | editor3-first = Emo | editor3-link = Emo Welzl
| contribution = Geometric approximation via coresets
| contribution-url = http://sarielhp.org/p/04/survey/
| mr = 2178310
| pages = 1–30
| publisher = Cambridge Univ. Press, Cambridge
| series = Mathematical Sciences Research Institute Publications
| title = Combinatorial and Computational Geometry
| volume = 52
| year = 2005}}.{{cite book |first=Frank |last=Nielsen | title=Introduction to HPC with MPI for Data Science | year=2016
| publisher=Springer |isbn=978-3-319-21903-5 | pages=259–272
|chapter=10. Fast approximate optimization in high dimensions with core-sets and fast dimension reduction | url=https://www.springer.com/gp/book/9783319219028
|chapter-url=https://www.researchgate.net/publication/313162957 }}
Definition
A coreset is a subset of a point set , possibly with associated weights, that preserves an optimization cost function within a factor of , where is some user defined approximation parameter. Formally, for an optimization problem with some cost function COST, a coreset satisfies the following inequality:{{Cite book |last1=Frahling |first1=Gereon |last2=Sohler |first2=Christian |chapter=Coresets in dynamic geometric data streams |date=2005-05-22 |title=Proceedings of the thirty-seventh annual ACM symposium on Theory of computing |chapter-url=https://dl.acm.org/doi/10.1145/1060590.1060622 |series=STOC '05 |location=New York, NY, USA |publisher=Association for Computing Machinery |pages=209–217 |doi=10.1145/1060590.1060622 |isbn=978-1-58113-960-0}}
COST COST (1 + ) COST
Applications
Coresets are used in a variety of problems, a few key examples include:{{Cite web |last=Nijsten |first=Sam |date=2024-07-10 |title=Range-Centric Coresets in Dynamic Geometric Streams |url=https://research.tue.nl/en/studentTheses/range-centric-coresets-in-dynamic-geometric-streams |access-date=2025-02-21 |website=Research Portal Eindhoven University of Technology}}
- Clustering: Approximating solutions for K-means clustering, K-medians clustering and K-center clustering while significantly reducing computation.
- Range Queries: Speeding up spatial searches in Geographic Information Systems or large databases by efficiently summarizing data.
- Machine Learning: Enhancing performance in Hyperparameter optimization by working with a smaller representative set.