Count sketch

{{short description|Method of a dimension reduction}}

{{Machine learning bar}}

Count sketch is a type of dimensionality reduction that is particularly efficient in statistics, machine learning and algorithms.Faisal M. Algashaam; Kien Nguyen; Mohamed Alkanhal; Vinod Chandran; Wageeh Boles. "Multispectral Periocular Classification WithMultimodal Compact Multi-Linear Pooling" [1]. IEEE Access, Vol. 5. 2017.{{Cite web |last1=Ahle |first1=Thomas |last2=Knudsen |first2=Jakob |date=2019-09-03 |title=Almost Optimal Tensor Sketch |url=https://www.researchgate.net/publication/335617805 |access-date=2020-07-11 |website=ResearchGate}}

It was invented by Moses Charikar, Kevin Chen and Martin Farach-Colton{{sfn | Charikar | Chen | Farach-Colton | 2004 | pp=}} in an effort to speed up the AMS Sketch by Alon, Matias and Szegedy for approximating the frequency moments of streamsAlon, Noga, Yossi Matias, and Mario Szegedy. "The space complexity of approximating the frequency moments." Journal of Computer and system sciences 58.1 (1999): 137-147. (these calculations require counting of the number of occurrences for the distinct elements of the stream).

The sketch is nearly identical{{cn|reason=see the talk page|date=September 2023}} to the Feature hashing algorithm by John Moody,Moody, John. "Fast learning in multi-resolution hierarchies." Advances in neural information processing systems. 1989. but differs in its use of hash functions with low dependence, which makes it more practical.

In order to still have a high probability of success, the median trick is used to aggregate multiple count sketches, rather than the mean.

These properties allow use for explicit kernel methods, bilinear pooling in neural networks and is a cornerstone in many numerical linear algebra algorithms.Woodruff, David P. "Sketching as a Tool for Numerical Linear Algebra." Theoretical Computer Science 10.1-2 (2014): 1–157.

Intuitive explanation

The inventors of this data structure offer the following iterative explanation of its operation:{{sfn | Charikar | Chen | Farach-Colton | 2004 | pp=}}

  • at the simplest level, the output of a single hash function {{mvar|s}} mapping stream elements {{mvar|q}} into {+1, -1} is feeding a single up/down counter {{mvar|C}}. After a single pass over the data, the frequency n(q) of a stream element {{mvar|q}} can be approximated, although extremely poorly, by the expected value \bold E[C \cdot s(q)];
  • a straightforward way to improve the variance of the previous estimate is to use an array of different hash functions s_i, each connected to its own counter C_i. For each element {{mvar|q}}, the \bold E[C_i \cdot s_i(q)]= n(q) still holds, so averaging across the {{mvar|i}} range will tighten the approximation;
  • the previous construct still has a major deficiency: if a lower-frequency-but-still-important output element {{mvar|a}} exhibits a hash collision with a high-frequency element, n(a) estimate can be significantly affected. Avoiding this requires reducing the frequency of collision counter updates between any two distinct elements. This is achieved by replacing each C_i in the previous construct with an array of {{mvar|m}} counters (making the counter set into a two-dimensional matrix C_{i,j}), with index {{mvar|j}} of a particular counter to be incremented/decremented selected via another set of hash functions h_i that map element {{mvar|q}} into the range {1..{{mvar|m}}}. Since \bold E[C_{i, h_i(q)} \cdot s_i(q)] = n(q), averaging across all values of {{mvar|i}} will work.

Mathematical definition

1. For constants w and t (to be defined later) independently choose d=2t+1 random hash functions

h_1, \dots, h_d and s_1,\dots,s_d such that

h_i : [n] \to [w] and

s_i : [n] \to \{\pm 1\}.

It is necessary that the hash families from which h_i and s_i are chosen be pairwise independent.

2. For each item q_i in the stream, add s_j(q_i) to the h_j(q_i)th bucket of the jth hash.

At the end of this process, one has wd sums (C_{ij}) where

:C_{i,j} = \sum_{h_i(k)=j}s_i(k).

To estimate the count of qs one computes the following value:

:r_q = \text{median}_{i=1}^d\, s_i(q)\cdot C_{i, h_i(q)}.

The values s_i(q)\cdot C_{i, h_i(q)} are unbiased estimates of how many times q has appeared in the stream.

The estimate r_q has variance O(\mathrm{min}\{m_1^2/w^2, m_2^2/w\}), where

m_1 is the length of the stream and m_2^2 is \sum_q (\sum_i [q_i=q])^2.Larsen, Kasper Green, Rasmus Pagh, and Jakub Tětek. "CountSketches, Feature Hashing and the Median of Three." International Conference on Machine Learning. PMLR, 2021.

Furthermore, r_q is guaranteed to never be more than 2m_2/\sqrt{w} off from the true value, with probability 1-e^{-O(t)}.

=Vector formulation=

Alternatively Count-Sketch can be seen as a linear mapping with a non-linear reconstruction function.

Let M^{(i\in[d])}\in\{-1,0,1\}^{w \times n}, be a collection of d=2t+1 matrices, defined by

:M^{(i)}_{h_i(j),j} = s_i(j)

for j\in[w] and 0 everywhere else.

Then a vector v\in\mathbb{R}^n is sketched by C^{(i)} = M^{(i)} v \in \mathbb{R}^w.

To reconstruct v we take v^*_j = \text{median}_i C^{(i)}_j s_i(j).

This gives the same guarantees as stated above, if we take m_1=\|v\|_1 and m_2=\|v\|_2.

Relation to Tensor sketch

The count sketch projection of the outer product of two vectors is equivalent to the convolution of two component count sketches.

The count sketch computes a vector convolution

C^{(1)}x \ast C^{(2)}x^T, where C^{(1)} and C^{(2)} are independent count sketch matrices.

Pham and Pagh{{cite conference

| title = Fast and scalable polynomial kernels via explicit feature maps

| last1 = Ninh

| first1 = Pham

| first2 = Rasmus

| last2 = Pagh | author2-link = Rasmus Pagh

| date = 2013

| publisher = Association for Computing Machinery

| conference = SIGKDD international conference on Knowledge discovery and data mining

|doi = 10.1145/2487575.2487591}}

show that this equals C(x \otimes x^T) – a count sketch C of the outer product of vectors, where \otimes denotes Kronecker product.

The fast Fourier transform can be used to do fast convolution of count sketches.

By using the face-splitting product{{Cite journal|last=Slyusar|first=V. I. |title=End products in matrices in radar applications |url=http://slyusar.kiev.ua/en/IZV_1998_3.pdf|journal=Radioelectronics and Communications Systems |year=1998 |volume=41 |issue=3|pages=50–53}}{{Cite journal|last=Slyusar|first=V. I.|date=1997-05-20|title=Analytical model of the digital antenna array on a basis of face-splitting matrix products. |url=http://slyusar.kiev.ua/ICATT97.pdf|journal=Proc. ICATT-97, Kyiv|pages=108–109}}{{Cite journal|last=Slyusar|first=V. I.|date=March 13, 1998|title=A Family of Face Products of Matrices and its Properties|url=http://slyusar.kiev.ua/FACE.pdf|journal=Cybernetics and Systems Analysis C/C of Kibernetika I Sistemnyi Analiz.- 1999.|volume=35|issue=3|pages=379–384|doi=10.1007/BF02733426|s2cid=119661450 }} such structures can be computed much faster than normal matrices.

See also

References

{{Reflist}}

Further reading

  • {{cite journal | last=Charikar | first=Moses | last2=Chen | first2=Kevin | last3=Farach-Colton | first3=Martin | title=Finding frequent items in data streams | journal=Theoretical Computer Science | publisher=Elsevier BV | volume=312 | issue=1 | year=2004 | issn=0304-3975 | doi=10.1016/s0304-3975(03)00400-6 | pages=3–15 | url=https://people.cs.rutgers.edu/~farach/pubs/FrequentStream.pdf}}
  • Faisal M. Algashaam; Kien Nguyen; Mohamed Alkanhal; Vinod Chandran; Wageeh Boles. "Multispectral Periocular Classification WithMultimodal Compact Multi-Linear Pooling" [https://ieeexplore.ieee.org/document/7990127]. IEEE Access, Vol. 5. 2017.
  • {{Cite web |last1=Ahle |first1=Thomas |last2=Knudsen |first2=Jakob |date=2019-09-03 |title=Almost Optimal Tensor Sketch |url=https://www.researchgate.net/publication/335617805 |access-date=2020-07-11 |website=ResearchGate}}

Category:Dimension reduction

Category:Hash-based data structures

Category:Probabilistic data structures