Gated recurrent unit

Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al.{{Cite arXiv |last1=Cho |first1=Kyunghyun |last2=van Merrienboer |first2=Bart |last3=Bahdanau |first3=DZmitry |last4=Bougares |first4=Fethi |last5=Schwenk |first5=Holger |last6=Bengio |first6=Yoshua |date=2014 |title=Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation |class=cs.CL |eprint=1406.1078}} The GRU is like a long short-term memory (LSTM) with a gating mechanism to input or forget certain features,{{Cite book | author = Felix Gers | author2 = Jürgen Schmidhuber | author3 = Fred Cummins | title = 9th International Conference on Artificial Neural Networks: ICANN '99 | chapter = Learning to forget: Continual prediction with LSTM | volume = 1999 | pages = 850–855 | year = 1999| url = https://ieeexplore.ieee.org/document/818041| author2-link = Jürgen Schmidhuber | author-link = Felix Gers | doi = 10.1049/cp:19991218 | isbn = 0-85296-721-7 }} but lacks a context vector or output gate, resulting in fewer parameters than LSTM.{{cite web |url=http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/ |title=Recurrent Neural Network Tutorial, Part 4 – Implementing a GRU/LSTM RNN with Python and Theano – WildML |newspaper=Wildml.com |date=2015-10-27 |access-date=May 18, 2016 |archive-date=2021-11-10 |archive-url=https://web.archive.org/web/20211110112626/http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/ |url-status=dead }}

GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM.{{cite journal |arxiv=1803.10225 |title=Light Gated Recurrent Units for Speech Recognition |last1=Ravanelli |first1=Mirco|last2=Brakel |first2=Philemon |last3=Omologo |first3=Maurizio |last4=Bengio |first4=Yoshua |author4-link=Yoshua Bengio |journal=IEEE Transactions on Emerging Topics in Computational Intelligence |year= 2018|volume=2 |issue=2 |pages=92–102 |doi=10.1109/TETCI.2017.2762739 |s2cid=4402991 }}{{cite journal |arxiv= 1803.01686 |title=On extended long short-term memory and dependent bidirectional recurrent neural network |last1=Su |first1=Yuahang |last2=Kuo |first2=Jay |journal=Neurocomputing |year= 2019|volume=356 |pages=151–161 |doi=10.1016/j.neucom.2019.04.044 |s2cid=3675055 }} GRUs showed that gating is indeed helpful in general, and Bengio's team came to no concrete conclusion on which of the two gating units was better.{{cite arXiv |eprint=1412.3555|title=Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling|last1=Chung |first1=Junyoung |last2=Gulcehre |first2=Caglar |last3=Cho |first3=KyungHyun |last4=Bengio |first4=Yoshua |class=cs.NE |year=2014 }} {{citation |title=Are GRU cells more specific and LSTM cells more sensitive in motive classification of text? |last1=Gruber |first1=N.|last2=Jockisch |first2=A. |year=2020 |journal=Frontiers in Artificial Intelligence |volume=3 |page=40 | doi = 10.3389/frai.2020.00040|pmid=33733157 |pmc=7861254 |s2cid=220252321 |doi-access=free }}

Architecture

There are several variations on the full gated unit, with gating done using the previous hidden state and the bias in various combinations, and a simplified form called minimal gated unit.{{cite arXiv |eprint=1412.3555|title=Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling|last1=Chung |first1=Junyoung |last2=Gulcehre |first2=Caglar |last3=Cho |first3=KyungHyun |last4=Bengio |first4=Yoshua |class=cs.NE |year=2014 }}

The operator $\odot$ denotes the Hadamard product in the following.

= Fully gated unit =

File:Gradient Recurrent Unit.svg

Initially, for $t = 0$ , the output vector is $h_0 = 0$ .

: $\begin{align}
z_t &= \sigma(W_{z} x_t + U_{z} h_{t-1} + b_z) \\
r_t &= \sigma(W_{r} x_t + U_{r} h_{t-1} + b_r) \\
\hat{h}_t &= \phi(W_{h} x_t + U_{h} (r_t \odot h_{t-1}) + b_h) \\
h_t &= (1-z_t) \odot h_{t-1} + z_t \odot \hat{h}_t
\end{align}$

Variables ( $d$ denotes the number of input features and $e$ the number of output features):

$x_t \in \mathbb{R}^{d}$ : input vector
$h_t \in \mathbb{R}^{e}$ : output vector
$\hat{h}_t \in \mathbb{R}^{e}$ : candidate activation vector
$z_t \in (0,1)^{e}$ : update gate vector
$r_t \in (0,1)^{e}$ : reset gate vector
$W \in \mathbb{R}^{e \times d}$ , $U \in \mathbb{R}^{e \times e}$ and $b \in \mathbb{R}^{e}$ : parameter matrices and vector which need to be learned during training

Activation functions

$\sigma$ : The original is a logistic function.
$\phi$ : The original is a hyperbolic tangent.

Alternative activation functions are possible, provided that $\sigma(x) \isin [0, 1]$ .

File:Gated Recurrent Unit, type 1.svg

File:Gradient Recurrent Unit, type 2.svg

File:Gradient Recurrent Unit, type 3.svg

Alternate forms can be created by changing $z_t$ and $r_t$ {{cite arXiv|last1=Dey|first1=Rahul|last2=Salem|first2=Fathi M.|date=2017-01-20|title=Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks|eprint=1701.05923 |class=cs.NE}}

Type 1, each gate depends only on the previous hidden state and the bias.
:

\begin{align} z_t &= \sigma(U_{z} h_{t-1} + b_z) \\ r_t &= \sigma(U_{r} h_{t-1} + b_r) \\ \end{align}

Type 2, each gate depends only on the previous hidden state.
:

\begin{align} z_t &= \sigma(U_{z} h_{t-1}) \\ r_t &= \sigma(U_{r} h_{t-1}) \\ \end{align}

Type 3, each gate is computed using only the bias.
:

\begin{align} z_t &= \sigma(b_z) \\ r_t &= \sigma(b_r) \\ \end{align}

= Minimal gated unit =

The minimal gated unit (MGU) is similar to the fully gated unit, except the update and reset gate vector is merged into a forget gate. This also implies that the equation for the output vector must be changed:{{cite arXiv|last1=Heck|first1=Joel|last2=Salem|first2=Fathi M.|date=2017-01-12|title=Simplified Minimal Gated Unit Variations for Recurrent Neural Networks|eprint=1701.03452 |class=cs.NE}}

: $\begin{align}
f_t &= \sigma(W_{f} x_t + U_{f} h_{t-1} + b_f) \\
\hat{h}_t &= \phi(W_{h} x_t + U_{h} (f_t \odot h_{t-1}) + b_h) \\
h_t &= (1-f_t) \odot h_{t-1} + f_t \odot \hat{h}_t
\end{align}$

Variables

$x_t$ : input vector
$h_t$ : output vector
$\hat{h}_t$ : candidate activation vector
$f_t$ : forget vector
$W$ , $U$ and $b$ : parameter matrices and vector

= Light gated recurrent unit =

The light gated recurrent unit (LiGRU) removes the reset gate altogether, replaces tanh with the ReLU activation, and applies batch normalization (BN):

: $\begin{align}
z_t &= \sigma(\operatorname{BN}(W_z x_t) + U_z h_{t-1}) \\
\tilde{h}_t &= \operatorname{ReLU}(\operatorname{BN}(W_h x_t) + U_h h_{t-1}) \\
h_t &= z_t \odot h_{t-1} + (1 - z_t) \odot \tilde{h}_t
\end{align}$

LiGRU has been studied from a Bayesian perspective.{{cite conference |url=https://ieeexplore.ieee.org/document/9414259 |title=A Bayesian Interpretation of the Light Gated Recurrent Unit |last1=Bittar |first1=Alexandre |last2=Garner |first2=Philip N. |date=May 2021 |publisher=IEEE |book-title=ICASSP 2021 |pages=2965–2969 |location=Toronto, ON, Canada |conference=2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |id=10.1109/ICASSP39728.2021.9414259}} This analysis yielded a variant called light Bayesian recurrent unit (LiBRU), which showed slight improvements over the LiGRU on speech recognition tasks.

References

Category:Neural network architectures

Category:Artificial neural networks