Directed information

{{context|date=September 2016}}

Directed information is an information theory measure that quantifies the information flow from the random string X^n = (X_1,X_2,\dots,X_n) to the random string Y^n = (Y_1,Y_2,\dots,Y_n). The term directed information was coined by James Massey and is defined as{{cite conference |last1=Massey |first1=James |title=Causality, Feedback And Directed Information |book-title=Proceedings 1990 International Symposium on Information Theory and its Applications, Waikiki, Hawaii, Nov. 27-30, 1990 |date=1990}}

:I(X^n\to Y^n) \triangleq \sum_{i=1}^n I(X^i;Y_i|Y^{i-1})

where I(X^{i};Y_i|Y^{i-1}) is the conditional mutual information I(X_1,X_2,...,X_{i};Y_i|Y_1,Y_2,...,Y_{i-1}).

Directed information has applications to problems where causality plays an important role such as the capacity of channels with feedback,{{cite thesis |type=Doctoral |last=Kramer |first=Gerhard |date=1998 |title=Directed information for channels with feedback |doi=10.3929/ethz-a-001988524 |publisher=ETH Zurich |hdl=20.500.11850/143796 |language=en}}{{cite thesis |type=Doctoral |last=Tatikonda |first=Sekhar Chandra |date=2000 |title=Control under communication constraints |url=https://dspace.mit.edu/handle/1721.1/16755 |publisher=Massachusetts Institute of Technology|hdl=1721.1/16755 }}{{cite journal |last1=Permuter |first1=Haim Henry|last2=Weissman|first2=Tsachy|last3=Goldsmith |first3=Andrea J.|title=Finite State Channels With Time-Invariant Deterministic Feedback|journal=IEEE Transactions on Information Theory |date=February 2009 |volume=55|issue=2 |pages=644–662 |doi=10.1109/TIT.2008.2009849|arxiv=cs/0608070|s2cid=13178}} capacity of discrete memoryless networks,{{cite journal |last1=Kramer |first1=G. |title=Capacity results for the discrete memoryless network|journal=IEEE Transactions on Information Theory|date=January 2003|volume=49|issue=1 |pages=4–21|doi=10.1109/TIT.2002.806135}} capacity of networks with in-block memory,{{cite journal |last1=Kramer |first1=Gerhard |title=Information Networks With In-Block Memory |journal=IEEE Transactions on Information Theory |date=April 2014 |volume=60 |issue=4 |pages=2105–2120 |doi=10.1109/TIT.2014.2303120|arxiv=1206.5389 |s2cid=16382644 }} gambling with causal side information,{{cite journal|last1=Permuter|first1=Haim H.|last2=Kim |first2=Young-Han|last3=Weissman |first3=Tsachy|title=Interpretations of Directed Information in Portfolio Theory, Data Compression, and Hypothesis Testing|journal=IEEE Transactions on Information Theory|date=June 2011|volume=57|issue=6 |pages=3248–3259|doi=10.1109/TIT.2011.2136270|arxiv=0912.4872 |s2cid=11722596}} compression with causal side information,{{cite journal|last1=Simeone|first1=Osvaldo |last2=Permuter|first2=Haim Henri|title=Source Coding When the Side Information May Be Delayed |journal=IEEE Transactions on Information Theory|date=June 2013 |volume=59|issue=6|pages=3607–3618 |doi=10.1109/TIT.2013.2248192|arxiv=1109.1293|s2cid=3211485}} real-time control communication settings,{{Cite journal |last=Sabag |first=Oron |last2=Tian |first2=Peida |last3=Kostina |first3=Victoria |last4=Hassibi |first4=Babak |date=September 2023 |title=Reducing the LQG Cost With Minimal Communication |url=https://ieeexplore.ieee.org/document/9942300/ |journal=IEEE Transactions on Automatic Control |volume=68 |issue=9 |pages=5258–5270 |doi=10.1109/TAC.2022.3220511 |issn=0018-9286|arxiv=2109.12246 }}{{cite journal |last1=Charalambous|first1=Charalambos D. |last2=Stavrou |first2=Photios A.|title=Directed Information on Abstract Spaces: Properties and Variational Equalities |journal=IEEE Transactions on Information Theory|date=August 2016|volume=62|issue=11|pages=6019–6052 |doi=10.1109/TIT.2016.2604846 |arxiv=1302.3971|s2cid=8107565}}{{cite journal |last1=Tanaka |first1=Takashi |last2=Esfahani |first2=Peyman Mohajerin |last3=Mitter |first3=Sanjoy K. |title=LQG Control With Minimum Directed Information: Semidefinite Programming Approach |journal=IEEE Transactions on Automatic Control |date=January 2018 |volume=63 |issue=1 |pages=37–52 |doi=10.1109/TAC.2017.2709618 |s2cid=1401958 |arxiv=1510.04214 |url=http://resolver.tudelft.nl/uuid:d9db1c11-fbfd-4c0c-b66f-f341b49fa61a}} and statistical physics.{{cite journal |last1=Vinkler |first1=Dror A |last2=Permuter |first2=Haim H |last3=Merhav |first3=Neri |title=Analogy between gambling and measurement-based work extraction |journal=Journal of Statistical Mechanics: Theory and Experiment |date=20 April 2016 |volume=2016 |issue=4 |page=043403 |doi=10.1088/1742-5468/2016/04/043403 |arxiv=1404.6788 |bibcode=2016JSMTE..04.3403V |s2cid=124719237}}

Causal conditioning

The essence of directed information is causal conditioning. The probability of x^n causally conditioned on y^n is defined as

:P(x^n||y^n) \triangleq \prod_{i=1}^n P(x_i|x^{i-1},y^{i}).

This is similar to the chain rule for conventional conditioning P(x^n|y^n) = \prod_{i=1}^n P(x_i|x^{i-1},y^{n}) except one conditions on "past" and "present" symbols y^{i} rather than all symbols y^{n}. To include "past" symbols only, one can introduce a delay by prepending a constant symbol:

:P(x^n||(0,y^{n-1})) \triangleq \prod_{i=1}^n P(x_i|x^{i-1},y^{i-1}).

It is common to abuse notation by writing P(x^n||y^{n-1}) for this expression, although formally all strings should have the same number of symbols.

One may also condition on multiple strings: P(x^n||y^n,z^n) \triangleq \prod_{i=1}^n P(x_i|x^{i-1},y^{i},z^{i}).

=Causally conditioned entropy=

The causally conditioned entropy is defined as:

:H(X^n || Y^n)=\mathbf E\left[ -\log {P(X^n||Y^n)} \right]=\sum_{i=1}^n H(X_{i}|X^{i-1},Y^{i})

Similarly, one may causally condition on multiple strings and write

H(X^n || Y^n,Z^n)=\mathbf E\left[ -\log {P(X^n||Y^n,Z^n)} \right].

Properties

A decomposition rule for causal conditioning is

:P(x^n, y^n) = P(x^n||y^{n-1}) P(y^n||x^n).

This rule shows that any product of P(x^n||y^{n-1}), P(y^n||x^n) gives a joint distribution P(x^n, y^n).

The causal conditioning probabilityP(y^n||x^n) = \prod_{i=1}^n P(y_i|y^{i-1},x^{i}) is a probability vector, i.e.,

:P(y^n||x^n)\geq 0 \quad\text{and}\quad \sum_{y^n} P(y^n||x^n)=1 \quad\text{for all } (x^n,y^n).

Directed Information can be written in terms of causal conditioning:

:I(X^N \rightarrow Y^N)=\mathbf E\left[ \log \frac{P(Y^N||X^N)}{P(Y^N)} \right] = H(Y^n)- H(Y^n || X^n).

The relation generalizes to three strings: the directed information flowing from X^n to Y^n causally conditioned on Z^n is

:I(X^n\to Y^n || Z^n) = H(Y^n || Z^n)- H(Y^n || X^n, Z^n).

=Conservation law of information=

This law, established by James Massey and his son Peter Massey,{{cite book |last1=Massey |first1=J.L. |last2=Massey |first2=P.C. |title=Proceedings. International Symposium on Information Theory, 2005. ISIT 2005 |chapter=Conservation of mutual and directed information |date=September 2005 |pages=157–158 |doi=10.1109/ISIT.2005.1523313|isbn=0-7803-9151-9 |s2cid=38053218 }} gives intuition by relating directed information and mutual information. The law states that for any X^n, Y^n , the following equality holds:

:I(X^n;Y^n)= I(X^n \to Y^n)+I(Y^{n-1} \to X^n).

Two alternative forms of this law are{{cite journal |last1=Amblard |first1=Pierre-Olivier |last2=Michel |first2=Olivier |title=The Relation between Granger Causality and Directed Information Theory: A Review |journal=Entropy |date=28 December 2012 |volume=15 |issue=1 |pages=113–143 |doi=10.3390/e15010113 |arxiv=1211.3169 |bibcode=2012Entrp..15..113A |doi-access=free }}

:I(X^n;Y^n) = I(X^n \to Y^n) + I(Y^n \to X^n) - I(X^n \leftrightarrow Y^n)

:I(X^n;Y^n) = I(X^{n-1} \to Y^n) + I(Y^{n-1} \to X^n) + I(X^n \leftrightarrow Y^n)

where I(X^n \leftrightarrow Y^n) = \sum_{i=1}^n I(X_i ; Y_i | X^{i-1}, Y^{i-1}).

Estimation and optimization

Estimating and optimizing the directed information is challenging because it has n terms where n may be large. In many cases, one is interested in optimizing the limiting average, that is, when n grows to infinity termed as a multi-letter expression.

=Estimation=

Estimating directed information from samples is a hard problem since the directed information expression does not depend on samples but on the joint distribution \{P(x_i,y_i|x^{i-1},y^{i-1})_{i=1}^n\} which may be unknown. There are several algorithms based on context tree weighting{{cite journal |last1=Jiao |first1=Jiantao |last2=Permuter |first2=Haim H. |last3=Zhao |first3=Lei |last4=Kim |first4=Young-Han |last5=Weissman |first5=Tsachy |title=Universal Estimation of Directed Information |journal= IEEE Transactions on Information Theory|date=October 2013 |volume=59 |issue=10 |pages=6220–6242 |doi=10.1109/TIT.2013.2267934 |arxiv=1201.2334 |s2cid=10855063 }} and empirical parametric distributions{{cite journal |last1=Quinn |first1=Christopher J. |last2=Kiyavash |first2=Negar |last3=Coleman |first3=Todd P. |title=Directed Information Graphs |journal= IEEE Transactions on Information Theory|date=December 2015 |volume=61 |issue=12 |pages=6887–6909 |doi=10.1109/TIT.2015.2478440|arxiv=1204.2003 |s2cid=3121664}} and using long short-term memory.

=Optimization=

Maximizing directed information is a fundamental problem in information theory. For example, given the channel distributions \{P(y_i|x^{i},y^{i-1}\}_{i=1}^n), the objective might be to optimize I(X^n\to Y^n) over the channel input distributions \{P(x_i|x^{i-1},y^{i-1}\}_{i=1}^n).

There are algorithms to optimize the directed information based on the Blahut-Arimoto,{{cite journal |last1=Naiss |first1=Iddo |last2=Permuter |first2=Haim H. |title=Extension of the Blahut–Arimoto Algorithm for Maximizing Directed Information |journal= IEEE Transactions on Information Theory|date=January 2013 |volume=59 |issue=1 |pages=204–222 |doi=10.1109/TIT.2012.2214202 |s2cid=3115749 |arxiv=1012.5071}} Markov decision process,{{cite journal |last1=Permuter |first1=Haim |last2=Cuff |first2=Paul |last3=Van Roy |first3=Benjamin |last4=Weissman |first4=Tsachy |title=Capacity of the Trapdoor Channel With Feedback |journal= IEEE Transactions on Information Theory|date=July 2008 |volume=54 |issue=7 |pages=3150–3165 |doi=10.1109/TIT.2008.924681|arxiv=cs/0610047 |s2cid=1265}}{{cite journal |last1=Elishco |first1=Ohad |last2=Permuter |first2=Haim |title=Capacity and Coding for the Ising Channel With Feedback |journal= IEEE Transactions on Information Theory|date=September 2014 |volume=60 |issue=9 |pages=5138–5149 |doi=10.1109/TIT.2014.2331951 |arxiv=1205.4674 |s2cid=9761759}}{{cite journal |last1=Sabag |first1=Oron |last2=Permuter |first2=Haim H. |last3=Kashyap |first3=Navin |title=The Feedback Capacity of the Binary Erasure Channel With a No-Consecutive-Ones Input Constraint |journal= IEEE Transactions on Information Theory|date=January 2016 |volume=62 |issue=1 |pages=8–22 |doi=10.1109/TIT.2015.2495239 |s2cid=476381}}{{cite journal |last1=Peled |first1=Ori |last2=Sabag |first2=Oron |last3=Permuter |first3=Haim H. |title=Feedback Capacity and Coding for the $(0,k)$ -RLL Input-Constrained BEC |journal= IEEE Transactions on Information Theory|date=July 2019 |volume=65 |issue=7 |pages=4097–4114 |doi=10.1109/TIT.2019.2903252 |arxiv=1712.02690 |s2cid=86582654}}{{cite journal |last1=Shemuel |first1=Eli |last2=Sabag |first2=Oron |last3=Permuter |first3=Haim H. |title=Finite-State Channels With Feedback and State Known at the Encoder |journal=IEEE Transactions on Information Theory |date=March 2024 |volume=70 |issue=3 |pages=1610–1628 |doi=10.1109/TIT.2023.3336939|arxiv=2212.12886 }} Recurrent neural network,{{cite book |last1=Aharoni |first1=Ziv |last2=Tsur |first2=Dor |last3=Goldfeld |first3=Ziv |last4=Permuter |first4=Haim Henry |title=2020 IEEE International Symposium on Information Theory (ISIT) |chapter=Capacity of Continuous Channels with Memory via Directed Information Neural Estimator |arxiv=2003.04179 |date=June 2020 |pages=2014–2019 |doi=10.1109/ISIT44484.2020.9174109 |isbn=978-1-7281-6432-8 |s2cid=212634151}} Reinforcement learning.{{cite arXiv |last1=Aharoni |first1=Ziv |last2=Sabag |first2=Oron |last3=Permuter |first3=Haim Henri |title=Reinforcement Learning Evaluation and Solution for the Feedback Capacity of the Ising Channel with Large Alphabet |date=18 August 2020 |class=cs.IT |eprint=2008.07983}} and Graphical methods (the Q-graphs).{{cite journal |last1=Sabag |first1=Oron |last2=Permuter |first2=Haim Henry |last3=Pfister |first3=Henry |title=A Single-Letter Upper Bound on the Feedback Capacity of Unifilar Finite-State Channels |journal= IEEE Transactions on Information Theory|date= March 2017 |volume=63 |issue=3 |pages=1392–1409|doi=10.1109/TIT.2016.2636851 |arxiv=1604.01878 |s2cid=3259603 }}{{cite journal |last1=Sabag |first1=Oron |last2= Huleihel |first2= Bashar |last3=Permuter |first3=Haim Henry |title= Graph-Based Encoders and their Performance for Finite-State Channels with Feedback |journal= IEEE Transactions on Communications|date= 2020 |volume= 68 |issue=4 |pages=2106–2117 |doi=10.1109/TCOMM.2020.2965454 |arxiv=1907.08063 |s2cid=197544824}}

For the Blahut-Arimoto algorithm, the main idea is to start with the last mutual information of the directed information expression and go backward. For the Markov decision process, the main ideas is to transform the optimization into an infinite horizon average reward Markov decision process. For a Recurrent neural network, the main idea is to model the input distribution using a Recurrent neural network and optimize the parameters using Gradient descent. For Reinforcement learning, the main idea is to solve the Markov decision process formulation of the capacity using Reinforcement learning tools, which lets one deal with large or even continuous alphabets.

Marko's theory of bidirectional communication

Massey's directed information was motivated by Marko's early work (1966) on developing a theory of bidirectional communication.{{cite journal |last1=Marko |first1=Hans |title=Die Theorie der bidirektionalen Kommunikation und ihre Anwendung auf die Nachrichtenübermittlung zwischen Menschen (Subjektive Information) |journal=Kybernetik |date=1 September 1966 |volume=3 |issue=3 |pages=128–136 |doi=10.1007/BF00288922 |pmid=5920460 |s2cid=33275199 |url=https://link.springer.com/article/10.1007/BF00288922 |language=de |issn=1432-0770|url-access=subscription }}{{cite journal |last1=Marko |first1=H. |title=The Bidirectional Communication Theory--A Generalization of Information Theory |journal=IEEE Transactions on Communications |date=December 1973 |volume=21 |issue=12 |pages=1345–1351 |doi=10.1109/TCOM.1973.1091610|s2cid=51664185 }} Marko's definition of directed transinformation differs slightly from Massey's in that, at time n, one conditions on past symbols X^{n-1},Y^{n-1} only and one takes limits:

:T_{12} = \lim_{n \to \infty} \mathbf E\left[ -\log \frac{P(X_{n}|X^{n-1})}{P(X_{n}|X^{n-1},Y^{n-1})} \right] \quad\text{and}\quad T_{21} = \lim_{n \to \infty} \mathbf E\left[ -\log \frac{P(Y_{n}|Y^{n-1})}{P(Y_{n}|Y^{n-1},X^{n-1})} \right].

Marko defined several other quantities, including:

  • Total information: H_{1} = \lim_{n \to \infty} \mathbf E\left[ -\log P(X_{n}|X^{n-1}) \right] and H_{2} = \lim_{n \to \infty} \mathbf E\left[ -\log P(Y_{n}|Y^{n-1}) \right]
  • Free information: F_{1} = \lim_{n \to \infty} \mathbf E\left[ -\log P(X_{n}|X^{n-1},Y^{n-1}) \right] and F_{2} = \lim_{n \to \infty} \mathbf E\left[ -\log P(Y_{n}|Y^{n-1},X^{n-1}) \right]
  • Coincidence: K = \lim_{n \to \infty} \mathbf E\left[ -\log \frac{P(X_{n}|X^{n-1}) P(Y_{n}|Y^{n-1})}{P(X_{n},Y_{n}|X^{n-1},Y^{n-1})} \right].

The total information is usually called an entropy rate. Marko showed the following relations for the problems he was interested in:

  • K = T_{12}+T_{21}
  • H_{1} = T_{12}+F_{1} and H_{2} = T_{21}+F_{2}

He also defined quantities he called residual entropies:

  • R_{1} = H_{1}-K = F_{1}-T_{21}
  • R_{2} = H_{2}-K = F_{2}-T_{12}

and developed the conservation law F_{1}+F_{2} = R_{1}+R_{2}+K = H_{1}+H_{2}-K and several bounds.

Relation to transfer entropy

Directed information is related to transfer entropy, which is a truncated version of Marko's directed transinformation T_{21}.

The transfer entropy at time i and with memory d is

:

T_{X \to Y} = I(X_{i-d},\dots,X_{i-1} ; Y_i | Y_{i-d},\dots,Y_{i-1}).

where one does not include the present symbol X_i or the past symbols X^{i-d-1},Y^{i-d-1} before time i-d.

Transfer entropy usually assumes stationarity, i.e., T_{X \to Y} does not depend on the time i.

References