Gated recurrent unit
{{short description|Memory unit used in neural networks}}
{{Machine learning}}
Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al.{{Cite arXiv |last1=Cho |first1=Kyunghyun |last2=van Merrienboer |first2=Bart |last3=Bahdanau |first3=DZmitry |last4=Bougares |first4=Fethi |last5=Schwenk |first5=Holger |last6=Bengio |first6=Yoshua |date=2014 |title=Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation |class=cs.CL |eprint=1406.1078}} The GRU is like a long short-term memory (LSTM) with a gating mechanism to input or forget certain features,{{Cite book | author = Felix Gers | author2 = Jürgen Schmidhuber | author3 = Fred Cummins | title = 9th International Conference on Artificial Neural Networks: ICANN '99 | chapter = Learning to forget: Continual prediction with LSTM | volume = 1999 | pages = 850–855 | year = 1999| url = https://ieeexplore.ieee.org/document/818041| author2-link = Jürgen Schmidhuber | author-link = Felix Gers | doi = 10.1049/cp:19991218 | isbn = 0-85296-721-7 }} but lacks a context vector or output gate, resulting in fewer parameters than LSTM.{{cite web |url=http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/ |title=Recurrent Neural Network Tutorial, Part 4 – Implementing a GRU/LSTM RNN with Python and Theano – WildML |newspaper=Wildml.com |date=2015-10-27 |access-date=May 18, 2016 |archive-date=2021-11-10 |archive-url=https://web.archive.org/web/20211110112626/http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/ |url-status=dead }}
GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM.{{cite journal |arxiv=1803.10225 |title=Light Gated Recurrent Units for Speech Recognition |last1=Ravanelli |first1=Mirco|last2=Brakel |first2=Philemon |last3=Omologo |first3=Maurizio |last4=Bengio |first4=Yoshua |author4-link=Yoshua Bengio |journal=IEEE Transactions on Emerging Topics in Computational Intelligence |year= 2018|volume=2 |issue=2 |pages=92–102 |doi=10.1109/TETCI.2017.2762739 |s2cid=4402991 }}{{cite journal |arxiv= 1803.01686 |title=On extended long short-term memory and dependent bidirectional recurrent neural network |last1=Su |first1=Yuahang |last2=Kuo |first2=Jay |journal=Neurocomputing |year= 2019|volume=356 |pages=151–161 |doi=10.1016/j.neucom.2019.04.044 |s2cid=3675055 }} GRUs showed that gating is indeed helpful in general, and Bengio's team came to no concrete conclusion on which of the two gating units was better.{{cite arXiv |eprint=1412.3555|title=Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling|last1=Chung |first1=Junyoung |last2=Gulcehre |first2=Caglar |last3=Cho |first3=KyungHyun |last4=Bengio |first4=Yoshua |class=cs.NE |year=2014 }} {{citation |title=Are GRU cells more specific and LSTM cells more sensitive in motive classification of text? |last1=Gruber |first1=N.|last2=Jockisch |first2=A. |year=2020 |journal=Frontiers in Artificial Intelligence |volume=3 |page=40 | doi = 10.3389/frai.2020.00040|pmid=33733157 |pmc=7861254 |s2cid=220252321 |doi-access=free }}
Architecture
There are several variations on the full gated unit, with gating done using the previous hidden state and the bias in various combinations, and a simplified form called minimal gated unit.{{cite arXiv |eprint=1412.3555|title=Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling|last1=Chung |first1=Junyoung |last2=Gulcehre |first2=Caglar |last3=Cho |first3=KyungHyun |last4=Bengio |first4=Yoshua |class=cs.NE |year=2014 }}
The operator denotes the Hadamard product in the following.
= Fully gated unit =
File:Gradient Recurrent Unit.svg
Initially, for , the output vector is .
:
\begin{align}
z_t &= \sigma(W_{z} x_t + U_{z} h_{t-1} + b_z) \\
r_t &= \sigma(W_{r} x_t + U_{r} h_{t-1} + b_r) \\
\hat{h}_t &= \phi(W_{h} x_t + U_{h} (r_t \odot h_{t-1}) + b_h) \\
h_t &= (1-z_t) \odot h_{t-1} + z_t \odot \hat{h}_t
\end{align}
Variables ( denotes the number of input features and the number of output features):
- : input vector
- : output vector
- : candidate activation vector
- : update gate vector
- : reset gate vector
- , and : parameter matrices and vector which need to be learned during training
- : The original is a logistic function.
- : The original is a hyperbolic tangent.
Alternative activation functions are possible, provided that .
File:Gated Recurrent Unit, type 1.svg
File:Gradient Recurrent Unit, type 2.svg
File:Gradient Recurrent Unit, type 3.svg
Alternate forms can be created by changing and {{cite arXiv|last1=Dey|first1=Rahul|last2=Salem|first2=Fathi M.|date=2017-01-20|title=Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks|eprint=1701.05923 |class=cs.NE}}
- Type 1, each gate depends only on the previous hidden state and the bias.
- :
\begin{align}
z_t &= \sigma(U_{z} h_{t-1} + b_z) \\
r_t &= \sigma(U_{r} h_{t-1} + b_r) \\
\end{align}
- Type 2, each gate depends only on the previous hidden state.
- :
\begin{align}
z_t &= \sigma(U_{z} h_{t-1}) \\
r_t &= \sigma(U_{r} h_{t-1}) \\
\end{align}
- Type 3, each gate is computed using only the bias.
- :
\begin{align}
z_t &= \sigma(b_z) \\
r_t &= \sigma(b_r) \\
\end{align}
= Minimal gated unit =
The minimal gated unit (MGU) is similar to the fully gated unit, except the update and reset gate vector is merged into a forget gate. This also implies that the equation for the output vector must be changed:{{cite arXiv|last1=Heck|first1=Joel|last2=Salem|first2=Fathi M.|date=2017-01-12|title=Simplified Minimal Gated Unit Variations for Recurrent Neural Networks|eprint=1701.03452 |class=cs.NE}}
:
\begin{align}
f_t &= \sigma(W_{f} x_t + U_{f} h_{t-1} + b_f) \\
\hat{h}_t &= \phi(W_{h} x_t + U_{h} (f_t \odot h_{t-1}) + b_h) \\
h_t &= (1-f_t) \odot h_{t-1} + f_t \odot \hat{h}_t
\end{align}
Variables
- : input vector
- : output vector
- : candidate activation vector
- : forget vector
- , and : parameter matrices and vector
= Light gated recurrent unit =
The light gated recurrent unit (LiGRU) removes the reset gate altogether, replaces tanh with the ReLU activation, and applies batch normalization (BN):
:
\begin{align}
z_t &= \sigma(\operatorname{BN}(W_z x_t) + U_z h_{t-1}) \\
\tilde{h}_t &= \operatorname{ReLU}(\operatorname{BN}(W_h x_t) + U_h h_{t-1}) \\
h_t &= z_t \odot h_{t-1} + (1 - z_t) \odot \tilde{h}_t
\end{align}
LiGRU has been studied from a Bayesian perspective.{{cite conference |url=https://ieeexplore.ieee.org/document/9414259 |title=A Bayesian Interpretation of the Light Gated Recurrent Unit |last1=Bittar |first1=Alexandre |last2=Garner |first2=Philip N. |date=May 2021 |publisher=IEEE |book-title=ICASSP 2021 |pages=2965–2969 |location=Toronto, ON, Canada |conference=2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |id=10.1109/ICASSP39728.2021.9414259}} This analysis yielded a variant called light Bayesian recurrent unit (LiBRU), which showed slight improvements over the LiGRU on speech recognition tasks.