Tensor sketch
{{short description|Algorithm for reducing the dimension of tensors}}
{{Machine learning bar}}
In statistics, machine learning and algorithms, a tensor sketch is a type of dimensionality reduction that is particularly efficient when applied to vectors that have tensor structure.{{Cite web |title=Low-rank Tucker decomposition of large tensors using: Tensor Sketch |url=https://amath.colorado.edu/faculty/becker/TensorSketch.pdf |website=amath.colorado.edu |publisher=University of Colorado Boulder |location=Boulder, Colorado}}{{Cite web |last1=Ahle |first1=Thomas |last2=Knudsen |first2=Jakob |date=2019-09-03 |title=Almost Optimal Tensor Sketch |url=https://www.researchgate.net/publication/335617805 |access-date=2020-07-11 |website=ResearchGate}} Such a sketch can be used to speed up explicit kernel methods, bilinear pooling in neural networks and is a cornerstone in many numerical linear algebra algorithms.Woodruff, David P. "[https://arxiv.org/pdf/1411.4357 Sketching as a Tool for Numerical Linear Algebra] {{Webarchive|url=https://web.archive.org/web/20221022171734/https://arxiv.org/pdf/1411.4357.pdf |date=2022-10-22 }}." Theoretical Computer Science 10.1-2 (2014): 1–157.
Mathematical definition
Mathematically, a dimensionality reduction or sketching matrix is a matrix , where
:
with high probability.
In other words,
A tensor sketch has the extra property that if
The speedup is achieved by first rewriting
Each of
For higher-order tensors, such as
History
The term tensor sketch was coined in 2013{{cite conference
| title = Fast and scalable polynomial kernels via explicit feature maps
| last1 = Ninh
| first1 = Pham
| first2 = Rasmus
| last2 = Pagh | author2-link = Rasmus Pagh
| date = 2013
| publisher = Association for Computing Machinery
| conference = SIGKDD international conference on Knowledge discovery and data mining
|doi = 10.1145/2487575.2487591}}
describing a technique by Rasmus Pagh
{{cite journal
| title = Compressed matrix multiplication
| first1 = Rasmus
| last1 = Pagh | author1-link = Rasmus Pagh
| date = 2013
| publisher = Association for Computing Machinery
| journal = ACM Transactions on Computation Theory
| volume = 5
| issue = 3
| pages = 1–17
|doi = 10.1145/2493252.2493254| arxiv = 1108.1320
| s2cid = 47560654
}}
from the same year.
Originally it was understood using the fast Fourier transform to do fast convolution of count sketches.
Later research works generalized it to a much larger class of dimensionality reductions via Tensor random embeddings.
Tensor random embeddings were introduced in 2010 in a paperKasiviswanathan, Shiva Prasad, et al. "[https://www.osti.gov/servlets/purl/990798 The price of privately releasing contingency tables and the spectra of random matrices with correlated rows] {{Webarchive|url=https://web.archive.org/web/20221022171727/https://www.osti.gov/biblio/990798 |date=2022-10-22 }}." Proceedings of the forty-second ACM symposium on Theory of computing. 2010. on differential privacy and were first analyzed by Rudelson et al. in 2012 in the context of sparse recovery.Rudelson, Mark, and Shuheng Zhou. "[http://proceedings.mlr.press/v23/rudelson12/rudelson12.pdf Reconstruction from anisotropic random measurements] {{Webarchive|url=https://web.archive.org/web/20221017120610/http://proceedings.mlr.press/v23/rudelson12/rudelson12.pdf |date=2022-10-17 }}." Conference on Learning Theory. 2012.
were the first to study the subspace embedding properties of tensor sketches, particularly focused on applications to polynomial kernels.
In this context, the sketch is required not only to preserve the norm of each individual vector with a certain probability but to preserve the norm of all vectors in each individual linear subspace.
This is a much stronger property, and it requires larger sketch sizes, but it allows the kernel methods to be used very broadly as explored in the book by David Woodruff.
Tensor random projections
The face-splitting product is defined as the tensor products of the rows (was proposed by V. SlyusarAnna Esteve, Eva Boj & Josep Fortiana (2009): Interaction Terms in Distance-Based Regression, Communications in Statistics – Theory and Methods, 38:19, P. 3501 [http://dx.doi.org/10.1080/03610920802592860] {{Webarchive|url=https://web.archive.org/web/20210426020635/http://dx.doi.org/10.1080/03610920802592860|date=2021-04-26}} in 1996{{Cite journal |last=Slyusar |first=V. I. |year=1998 |title=End products in matrices in radar applications. |url=http://slyusar.kiev.ua/en/IZV_1998_3.pdf|journal=Radioelectronics and Communications Systems |volume=41 |issue=3|pages=50–53}}{{Cite journal|last=Slyusar|first=V. I.|date=1997-05-20|title=Analytical model of the digital antenna array on a basis of face-splitting matrix products. |url=http://slyusar.kiev.ua/ICATT97.pdf|journal=Proc. ICATT-97, Kyiv|pages=108–109}}{{Cite journal|last=Slyusar|first=V. I.|date=1997-09-15|title=New operations of matrices product for applications of radars|url=http://slyusar.kiev.ua/DIPED_1997.pdf|journal=Proc. Direct and Inverse Problems of Electromagnetic and Acoustic Wave Theory (DIPED-97), Lviv.|pages=73–74}}{{Cite journal|last=Slyusar|first=V. I.|date=March 13, 1998|title=A Family of Face Products of Matrices and its Properties|url=http://slyusar.kiev.ua/FACE.pdf|journal=Cybernetics and Systems Analysis C/C of Kibernetika I Sistemnyi Analiz. – 1999.|volume=35|issue=3|pages=379–384|doi=10.1007/BF02733426|s2cid=119661450 }}{{Cite journal|last=Slyusar|first=V. I.|date=2003|title=Generalized face-products of matrices in models of digital antenna arrays with nonidentical channels|url=http://slyusar.kiev.ua/en/IZV_2003_10.pdf|journal=Radioelectronics and Communications Systems|volume=46|issue=10|pages=9–17}} for radar and digital antenna array applications).
More directly, let
Then the face-splitting product
\mathbf{C} \bull \mathbf{D}
=
\left[
\begin{array} { c }
\mathbf{C}_1 \otimes \mathbf{D}_1\\\hline
\mathbf{C}_2 \otimes \mathbf{D}_2\\\hline
\mathbf{C}_3 \otimes \mathbf{D}_3\\
\end{array}
\right]
=
\left[
\begin{array} { c c c c c c c c c }
\mathbf{C}_{1,1}\mathbf{D}_{1,1} & \mathbf{C}_{1,1}\mathbf{D}_{1,2} & \mathbf{C}_{1,1}\mathbf{D}_{1,3} & \mathbf{C}_{1,2}\mathbf{D}_{1,1} & \mathbf{C}_{1,2}\mathbf{D}_{1,2} & \mathbf{C}_{1,2}\mathbf{D}_{1,3} & \mathbf{C}_{1,3}\mathbf{D}_{1,1} & \mathbf{C}_{1,3}\mathbf{D}_{1,2} & \mathbf{C}_{1,3}\mathbf{D}_{1,3} \\\hline
\mathbf{C}_{2,1}\mathbf{D}_{2,1} & \mathbf{C}_{2,1}\mathbf{D}_{2,2} & \mathbf{C}_{2,1}\mathbf{D}_{2,3} & \mathbf{C}_{2,2}\mathbf{D}_{2,1} & \mathbf{C}_{2,2}\mathbf{D}_{2,2} & \mathbf{C}_{2,2}\mathbf{D}_{2,3} & \mathbf{C}_{2,3}\mathbf{D}_{2,1} & \mathbf{C}_{2,3}\mathbf{D}_{2,2} & \mathbf{C}_{2,3}\mathbf{D}_{2,3} \\\hline
\mathbf{C}_{3,1}\mathbf{D}_{3,1} & \mathbf{C}_{3,1}\mathbf{D}_{3,2} & \mathbf{C}_{3,1}\mathbf{D}_{3,3} & \mathbf{C}_{3,2}\mathbf{D}_{3,1} & \mathbf{C}_{3,2}\mathbf{D}_{3,2} & \mathbf{C}_{3,2}\mathbf{D}_{3,3} & \mathbf{C}_{3,3}\mathbf{D}_{3,1} & \mathbf{C}_{3,3}\mathbf{D}_{3,2} & \mathbf{C}_{3,3}\mathbf{D}_{3,3}
\end{array}
\right].
The reason this product is useful is the following identity:
:
= \left[
\begin{array} { c }
(\mathbf{C}x)_1 (\mathbf{D} y)_1 \\
(\mathbf{C}x)_2 (\mathbf{D} y)_2 \\
\vdots
\end{array}\right],
where
Since this operation can be computed in linear time,
=Construction with fast Fourier transform=
The tensor sketch of Pham and Pagh computes
They show that, amazingly, this equals
It turns out that this relation can be seen in terms of the face-splitting product as
:
Since
What's left is that
On the other hand,
:
=Application to general matrices=
The problem with the original tensor sketch algorithm was that it used count sketch matrices, which aren't always very good dimensionality reductions.
In 2020 it was shown that any matrices with random enough independent rows suffice to create a tensor sketch.
This allows using matrices with stronger guarantees, such as real Gaussian Johnson Lindenstrauss matrices.
In particular, we get the following theorem
:Consider a matrix
: Then
:
In particular, if the entries of
The paper also shows that the dependency on
Variations
=Recursive construction=
Because of the exponential dependency on
:
= M^{(1)}(x \otimes (M^{(2)}y \otimes \cdots))
We can achieve such an
:
With this method, we only apply the general tensor sketch method to order 2 tensors, which avoids the exponential dependency in the number of rows.
It can be proved that combining
=Fast constructions=
The fast Johnson–Lindenstrauss transform is a dimensionality reduction matrix
Given a matrix
The Fast Johnson Lindenstrauss Transform (FJLT),{{cite encyclopedia
| last1 = Ailon | first1 = Nir | last2 = Chazelle | first2 = Bernard
| chapter = Approximate nearest neighbors and the fast Johnson–Lindenstrauss transform
| title = Proceedings of the 38th Annual ACM Symposium on Theory of Computing
| year = 2006
| mr = 2277181
| doi = 10.1145/1132516.1132597
| pages = 557–563
| publisher = ACM Press
| location = New York
| isbn = 1-59593-134-1| s2cid = 490517 }}
was introduced by Ailon and Chazelle in 2006.
A version of this method takes
where
D is a diagonal matrix where each diagonal entryD_{i,i} is\pm1 independently.
The matrix-vector multiplication
H is a Hadamard matrix, which allows matrix-vector multiplication in timeO(d\log d) S is ak\times d sampling matrix which is all zeros, except a single 1 in each row.
If the diagonal matrix is replaced by one which has a tensor product of
For an example of this, let
We can then split up
:
&\operatorname{SHD}(x\otimes y)
\\
&\quad=
\begin{bmatrix}
1 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 1 & 0 & 0
\end{bmatrix}
\begin{bmatrix}
1 & 1 & 1 & 1 \\
1 & -1 & 1 & -1 \\
1 & 1 & -1 & -1 \\
1 & -1 & -1 & 1
\end{bmatrix}
\begin{bmatrix}
\sigma_1 \rho_1 & 0 & 0 & 0 \\
0 & \sigma_1 \rho_2 & 0 & 0 \\
0 & 0 & \sigma_2 \rho_1 & 0 \\
0 & 0 & 0 & \sigma_2 \rho_2 \\
\end{bmatrix}
\begin{bmatrix}
x_1y_1 \\
x_2y_1 \\
x_1y_2 \\
x_2y_2
\end{bmatrix}
\\[5pt]
&\quad=
\left(
\begin{bmatrix}
1 & 0 \\
0 & 1 \\
1 & 0
\end{bmatrix}
\bullet
\begin{bmatrix}
1 & 0 \\
1 & 0 \\
0 & 1
\end{bmatrix}
\right)
\left(
\begin{bmatrix}
1 & 1 \\
1 & -1
\end{bmatrix}
\otimes
\begin{bmatrix}
1 & 1 \\
1 & -1
\end{bmatrix}
\right)
\left(
\begin{bmatrix}
\sigma_1 & 0 \\
0 & \sigma_2 \\
\end{bmatrix}
\otimes
\begin{bmatrix}
\rho_1 & 0 \\
0 & \rho_2 \\
\end{bmatrix}
\right)
\left(
\begin{bmatrix}
x_1 \\
x_2
\end{bmatrix}
\otimes
\begin{bmatrix}
y_1 \\
y_2
\end{bmatrix}
\right)
\\[5pt]
&\quad=
\left(
\begin{bmatrix}
1 & 0 \\
0 & 1 \\
1 & 0
\end{bmatrix}
\bullet
\begin{bmatrix}
1 & 0 \\
1 & 0 \\
0 & 1
\end{bmatrix}
\right)
\left(
\begin{bmatrix}
1 & 1 \\
1 & -1
\end{bmatrix}
\begin{bmatrix}
\sigma_1 & 0 \\
0 & \sigma_2 \\
\end{bmatrix}
\begin{bmatrix}
x_1 \\
x_2
\end{bmatrix}
\,\otimes\,
\begin{bmatrix}
1 & 1 \\
1 & -1
\end{bmatrix}
\begin{bmatrix}
\rho_1 & 0 \\
0 & \rho_2 \\
\end{bmatrix}
\begin{bmatrix}
y_1 \\
y_2
\end{bmatrix}
\right)
\\[5pt]
&\quad=
\begin{bmatrix}
1 & 0 \\
0 & 1 \\
1 & 0
\end{bmatrix}
\begin{bmatrix}
1 & 1 \\
1 & -1
\end{bmatrix}
\begin{bmatrix}
\sigma_1 & 0 \\
0 & \sigma_2 \\
\end{bmatrix}
\begin{bmatrix}
x_1 \\
x_2
\end{bmatrix}
\,\circ\,
\begin{bmatrix}
1 & 0 \\
1 & 0 \\
0 & 1
\end{bmatrix}
\begin{bmatrix}
1 & 1 \\
1 & -1
\end{bmatrix}
\begin{bmatrix}
\rho_1 & 0 \\
0 & \rho_2 \\
\end{bmatrix}
\begin{bmatrix}
y_1 \\
y_2
\end{bmatrix}
.
\end{align}
In other words,
The same approach can be extended to compute higher degree products, such as
Ahle et al. shows that if
Jin et al.,Jin, Ruhui, Tamara G. Kolda, and Rachel Ward. "Faster Johnson–Lindenstrauss Transforms via Kronecker Products." arXiv preprint arXiv:1909.04801 (2019). the same year, showed a similar result for the more general class of matrices call RIP, which includes the subsampled Hadamard matrices.
They showed that these matrices allow splitting into tensors provided the number of rows is
In the case
These fast constructions can again be combined with the recursion approach mentioned above, giving the fastest overall tensor sketch.
Data aware sketching
It is also possible to do so-called "data aware" tensor sketching.
Instead of multiplying a random matrix on the data, the data points are sampled independently with a certain probability depending on the norm of the point.{{cite conference
| title = Fast and Guaranteed Tensor Decomposition via Sketching
| first1 = Yining
| last1 = Wang
| first2 = Hsiao-Yu
| last2 = Tung
| first3 = Alexander
| last3 = Smola
| first4 = Anima
| last4 = Anandkumar
| arxiv = 1506.04448
| conference = Advances in Neural Information Processing Systems 28 (NIPS 2015)
}}
Applications
=Explicit polynomial kernels=
Kernel methods are popular in machine learning as they give the algorithm designed the freedom to design a "feature space" in which to measure the similarity of their data points.
A simple kernel-based binary classifier is based on the following computation:
:
where
The function
Typical examples are the radial basis function kernel,
When used this way, the kernel method is called "implicit".
Sometimes it is faster to do an "explicit" kernel method, in which a pair of functions
This allows the above computation to be expressed as
:
= \sgn \sum_{i=1}^n y_i \langle f(\mathbf{x}_i), g(\mathbf{x'})\rangle
= \sgn \left\langle\left(\sum_{i=1}^n y_i f(\mathbf{x}_i)\right), g(\mathbf{x'})\right\rangle,
where the value
The problem with this method is that the feature space can be very large. That is
For example, for the polynomial kernel
If
The idea of tensor sketch is that we can compute approximate functions
This method was shown in 2020{{cite conference
| title = Oblivious Sketching of High-Degree Polynomial Kernels
| first1 = Thomas
| last1 = Ahle
| first2 = Michael
| last2 = Kapralov
| first3 = Jakob
| last3 = Knudsen
| first4 = Rasmus
| last4 = Pagh | author4-link = Rasmus Pagh
| first5 = Ameya
| last5 = Velingker
| first6 = David
| last6 = Woodruff
| first7 = Amir
| last7 = Zandieh
| date = 2020
| publisher = Association for Computing Machinery
| conference = ACM-SIAM Symposium on Discrete Algorithms
|doi = 10.1137/1.9781611975994.9| doi-access = free
| arxiv = 1909.01410
}} to work even for high degree polynomials and radial basis function kernels.
=Compressed matrix multiplication=
Assume we have two large datasets, represented as matrices
We could compute
However, this would take at least
The idea of Compressed Matrix Multiplication is the general identity
:
where
Since we can compute a (linear) approximation to
=Compact multilinear pooling=
File:Multimodal Compact Multilinear Pooling.png.]]
Bilinear pooling is the technique of taking two input vectors,
InGao, Yang, et al. "[https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Gao_Compact_Bilinear_Pooling_CVPR_2016_paper.pdf Compact bilinear pooling] {{Webarchive|url=https://web.archive.org/web/20220120115421/https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Gao_Compact_Bilinear_Pooling_CVPR_2016_paper.pdf |date=2022-01-20 }}." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. the authors considered using tensor sketch to reduce the number of variables needed.
In 2017 another paperAlgashaam, Faisal M., et al. "[https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7990127 Multispectral periocular classification with multimodal compact multi-linear pooling] ." IEEE Access 5 (2017): 14572–14578. takes the FFT of the input features, before they are combined using the element-wise product.
This again corresponds to the original tensor sketch.
References
{{Reflist}}
Further reading
- {{Cite web |last1=Ahle |first1=Thomas |last2=Knudsen |first2=Jakob |date=2019-09-03 |title=Almost Optimal Tensor Sketch |url=https://www.researchgate.net/publication/335617805 |access-date=2020-07-11 |website=ResearchGate}}
- {{Cite journal|last=Slyusar|first=V. I. |title=End products in matrices in radar applications. |url=http://slyusar.kiev.ua/en/IZV_1998_3.pdf|journal=Radioelectronics and Communications Systems |year=1998 |volume=41 |issue=3|pages=50–53}}
- {{Cite journal|last=Slyusar|first=V. I.|date=1997-05-20|title=Analytical model of the digital antenna array on a basis of face-splitting matrix products. |url=http://slyusar.kiev.ua/ICATT97.pdf|journal=Proc. ICATT-97, Kyiv|pages=108–109}}
- {{Cite journal|last=Slyusar|first=V. I.|date=1997-09-15|title=New operations of matrices product for applications of radars|url=http://slyusar.kiev.ua/DIPED_1997.pdf|journal=Proc. Direct and Inverse Problems of Electromagnetic and Acoustic Wave Theory (DIPED-97), Lviv.|pages=73–74}}
- {{Cite journal|last=Slyusar|first=V. I.|date=March 13, 1998|title=A Family of Face Products of Matrices and its Properties|url=http://slyusar.kiev.ua/FACE.pdf|journal=Cybernetics and Systems Analysis C/C of Kibernetika I Sistemnyi Analiz.- 1999.|volume=35|issue=3|pages=379–384|doi=10.1007/BF02733426|s2cid=119661450 }}