# Thermodynamics of the binary symmetric channel

- Evgeny Verbitskiy
^{1, 2}Email author

**8**:2

https://doi.org/10.1186/s40736-015-0021-5

© Verbitskiy. 2016

**Received: **27 August 2015

**Accepted: **28 December 2015

**Published: **14 March 2016

## Abstract

We study a hidden Markov process which is the result of a transmission of the binary symmetric Markov source over the memoryless binary symmetric channel. This process has been studied extensively in Information Theory and is often used as a benchmark case for the so-called denoising algorithms. Exploiting the link between this process and the 1D Random Field Ising Model (RFIM), we are able to identify the Gibbs potential of the resulting Hidden Markov process. Moreover, we obtain a stronger bound on the memory decay rate. We conclude with a discussion on implications of our results for the development of denoising algorithms.

### Keywords

Hidden Markov models Gibbs states Thermodynamic formalism Denoising### Mathematics Subject Classification

37D35 82B20 82B20## Introduction

*X*

_{ n }} be a stationary two-state Markov chain with values {±1}, and

*p*<1; denote by \(P=(p_{x,x'}) =\left (\begin {array}{cc} 1-p & p \\ p& 1-p\end {array}\right)\)the corresponding transition probability matrix. Note that \(\pi =\left (\frac {1}{2},\frac {1}{2}\right)\) is the stationary initial distribution for this chain.

*Z*

_{ n }} with

for all *n*. The process {*Y*
_{
n
}} is a Hidden Markov process, because *Y*
_{
n
}∈{−1,1} is chosen independently for any *n* from an *emission* distribution \(\pi _{X_{n}}\) on {−1,1}: *π*
_{1}=(*ε*,1−*ε*) and *π*
_{−1}=(1−*ε*,*ε*). More generally, the Hidden Markov Processes are random functions of discrete-time Markov chains, where the value *Y*
_{
n
} is chosen according to the distribution which depends on the value *X*
_{
n
}=*x*
_{
n
} of the underlying Markov chain, independently for any *n*. The applications of Hidden Markov processes include automatic character and speech recognition, information theory, statistics and bioinformatics, see [4, 13]. The particular example (1.1) we consider in the present paper is probably one of simplest examples, and often used as a benchmark for testing algorithms.

In particular, this example has been studied rather extensively in connection to computation of entropy of the output process {*Y*
_{
n
}}, see e.g., [8–10, 12].

*Y*

_{ n }} is the push-forward of \(\mathbb {P}\times \mathbb {P}_{Z}\) under \(\psi : \{-1,1\}^{\mathbb {Z}}\times \{-1,1\}^{\mathbb {Z}}\mapsto \{-1,1\}^{\mathbb {Z}}\), with

*ψ*((

*x*

_{ n },

*z*

_{ n }))=

*x*

_{ n }·

*z*

_{ n }. We write \(\mathbb {Q}=(\mathbb {P}\times \mathbb {P}_{Z})\circ \psi ^{-1}\). For every

*m*≤

*n*, and \({y_{m}^{n}}:=(y_{m},\ldots, y_{n})\in \{-1,1\}^{n-m+1}\), the measure of the corresponding cylindric set is given by

*X*

_{ n }} is a sequence of independent identically distributed random variables with \(\mathbb {P}(X_{n}=-1)=\mathbb {P}(X_{n}=+1)=\frac {1}{2}\), and {

*Y*

_{ n }} has the same distribution. If \(\varepsilon =\frac {1}{2}\), then the formula above implies that

*Y*

_{ n }} is a sequence of independent random variables with \(\mathbb {Q}\left (Y_{n}=-1\right)=\mathbb {Q}\left (Y_{n}=+1\right)=\frac {1}{2}\).

The paper is organizes as follows. In Section 2 we exploit methods of Statistical Mechanics to derive expressions for probabilities of cylindric events (1.2). In Section 3, analyzing derived expressions, we show that the measure \(\mathbb {Q}\) has nice thermodynamic properties; in particular, it falls into the class of *g*-measures with exponential decay of variation (memory decay). We also obtain an novel estimate of the decay rate, which is stronger than estimates derived in previous works. In Section 4 we study two-sided conditional probabilities and show that \(\mathbb {Q}\) is in fact a Gibbs state in the sense of Statistical Mechanics. We also discuss well-known denoising algorithm DUDE, and suggest that the Gibbs property of \(\mathbb {Q}\) explains why DUDE performs so well in this particular example. Furthermore, we argue that the development of denoising algorithsms, relying on thermodynamic Gibbs ideas can result in a superior performance.

## Random field Ising model

*Y*

_{ m }=

*y*

_{ m },…,

*Y*

_{ n }=

*y*

_{ n }},

*m*≤

*n*, can be expressed via a partition function of a random field Ising model. We exploit this observation further. Assume

*p*,

*ε*∈(0,1), and put

*y*

_{ m },…,

*y*

_{ n })∈{−1,1}

^{ n−m+1}, expression for the cylinder probability (1.2) can be rewritten as

*x*

_{ m },…,

*x*

_{ n }):

*y*

_{ i }’s. Applying the recursive method of [14], the partition function can be evaluated in the following fashion [1]. Consider the following functions

*s*=±1, then for all \(w\in \mathbb {R}\)

*m*<

*n*, \({y_{m}^{n}}\in \{-1,1\}^{n-m+1}\), then

*Ky*

_{ n−1}, we now have \(w_{n-1}^{(n)}=Ky_{n-1}+A\left (w_{n}^{(n)}\right)\). Continuing the summation over the remaining right-most

*x*-spins, one gets

*A*(0)=0, we can define

## Thermodynamic formalism

*A*is a finite alphabet, be the space of one-sided infinite sequences ω=(

*ω*

_{0},

*ω*

_{1},…) in alphabet

*A*(

*ω*

_{ i }∈

*A*for all

*i*). We equip

*Ω*with the metric

*S*:

*Ω*→

*Ω*the left shift:

*C*⊆

*Ω*.

Let us recall the following well-known definitions:

###
**Definition 3.1.**

Suppose \(\mathbb {P}\) is a fully supported translation invariant measure on \(\Omega =A^{\mathbb Z_{+}}\), where *A* is a finite alphabet.

*g*-measure, if for some positive continuous function

*g*:

*Ω*→(0,1) satisfying the normalization condition

*ω*

_{0},

*ω*

_{1},…)∈

*Ω*, one has

*Ω*.

*C*≥1 such that for all

**ω**∈

*Ω*and every \(n\in \mathbb N\)

where \(h(\mathbb {P})\) is the Kolmogorov-Sinai entropy of \(\mathbb {P}\) and the supremum is taken over the set \(\mathcal M_{1}^{*}(\Omega)\) of all translation invariant Borel probability measures on *Ω*.

It is known that every *g*-measure \(\mathbb {P}\) is also an equilibrium state for log*g*; and that every Bowen-Gibbs measure \(\mathbb {P}\) for potential *ϕ* is an equilibrium state for *ϕ* as well.

###
**Theorem 3.1.**

*p*,

*ε*∈(0,1). Then the measure \(\mathbb {Q}=\mathbb {Q}_{p,\varepsilon }\) on \(\{-1,1\}^{\mathbb {Z}_{+}}\) (c.f., (2.2)) is a

*g*-measure. Moreover, the corresponding function

*g*has

*exponential decay of variation:*define the

*n*-th variation of

*g*by

*ρ*(

*p*,

*ε*)=0 if \(p=\frac {1}{2}\) or \(\varepsilon =\frac {1}{2}\); for all \(p\ne \frac {1}{2}\)

Finally, the measure \(\mathbb {Q}\) is also a Bowen-Gibbs measure for a Hölder continuous potential \(\phi :\{-1,1\}^{\mathbb {Z}_{+}}\to \mathbb R\).

*Y*

_{ n }}, where the underlying Markov chain {

*X*

_{ n }} has strictly positive transition probability matrix

*P*, see [15] for review of several results of this nature. However, example considered in the present paper is rather exceptional since one is able to identify the

*g*-function and the Gibbs potential

*ϕ*

*explicitly*. Another interesting question is the estimate of the decay rate

*ρ*. In [15] a number of previously known estimates of the rate of exponential decay in (3.3) have been compared; the best known estimate for

*ρ*

*ε*, and in fact, it was conjectured in [15] that the estimate could be improved, e.g., by incorporating dependence on

*ε*. The proof of Theorem 3.1 shows that this is indeed the case and one obtains sharper estimate (3.3).

*p*,

*ε*∈(0,1), and \(p\ne \frac {1}{2}\) and \(\varepsilon \ne \frac {1}{2}\). Fix \(\boldsymbol {y}=(y_{0},y_{1},\ldots)\in \{-1,1\}^{\mathbb {Z}_{+}}\). For every \(n\in \mathbb {Z}_{+}\), define the sequence \(w^{(n)}_{i}=w^{(n)}_{i}(\boldsymbol {y})\), \(i\in \mathbb {Z}_{+}\), as follows:

*F*

_{−1}, \(F_{1}:\mathbb {R}\to \mathbb {R}\), given by

*i*≤

*n*,

As we will show the maps *F*
_{−1}, *F*
_{1} are strict contractions, and as a result, for every *i*, the sequence \(\left \{w_{i}^{(n)}\right \}\) is converging as *n*→*∞*; in fact, with exponential speed.

###
**Lemma 3.2.**

*ϱ*∈(0,1) and

*C*>0, both independent of y, such that

*n*≥

*i*. Furthermore,

*w*

_{ i }(y)=

*w*

_{0}(

*S*

^{ i }y) for all \(i\in \mathbb {Z}_{+}\) and y, and \(w_{0}:\{-1,1\}^{\mathbb {Z}_{+}}\to \mathbb {R}\) (and hence every

*w*

_{ i }) is Hölder continuous

*C*

^{′},

*θ*>0 and all \( \boldsymbol {y},\boldsymbol {\tilde {y}}\in \{-1,1\}^{\mathbb {Z}_{+}}\).

###
*Proof.*

*i*≤

*n*≤

*m*. Then

*i*≤

*n*≤

*m*

*n*≥

*i*,

*w*

_{ i }(y)=

*w*

_{0}(

*S*

^{ i }y).

The estimate of a contraction rate in the Lemma above can be improved. If \(p=\frac {1}{2}\), then *A*(*w*)≡0, and hence \(w_{i}^{(n)}\equiv Ky_{i}\) for all *n*≥*i*. We may assume that \(p\ne \frac {1}{2}\). Let us also assume that \(\varepsilon \ne \frac {1}{2}\), i.e., *K*≠0.

*α*=(1−

*p*)

^{2}+

*p*

^{2}, 1−

*α*=2

*p*(1−

*p*). Let

*Δ*>0 be sufficiently small so that for all

*w*∈[−

*Δ*,

*Δ*] one has cosh(2

*K*+2

*A*(

*w*))> cosh(

*K*), and hence for all

*w*∈ [−

*Δ*,

*Δ*]

*w*∉[−

*Δ*,

*Δ*], one has

Even sharper bounds can be achieved by studying minimum of the denominator in (3.8) or higher interates of *F*’s.

###
*Proof of Theorem.*

The cases \(p=\frac {1}{2}\) or \(\varepsilon =\frac {1}{2}\) are obvious: in these cases, \(\mathbb {Q}\) is the Bernoulli \(\left (\frac {1}{2}, \frac {1}{2}\right)\)-measure on \(\{-1,1\}^{\mathbb {Z}_{+}}\), and hence *ρ*(*p*,*ε*)=0. Thus we will assume that \(p,\varepsilon \ne \frac {1}{2}\).

*g*-measure it is sufficient to show that conditional probabilities \(\mathbb {Q}\left (y_{0}|{y_{1}^{n}}\right)\) converge uniformly as

*n*→

*∞*. Given that

*n*→

*∞*, we obtain uniform convergence of conditional probabilities, and hence, \(\mathbb {Q}\) is a

*g*-measure with

*g*given by

*w*

_{0},

*w*

_{1}=

*w*

_{0}∘

*S*are Hölder continuous functions satisfying (3.9), and that cosh, exp, and

*B*are smooth functions, we can conclude that

*g*is also Hölder continuous with the same decay of variation:

*C*

_{3}>0 (

*C*

_{4}=

*C*

_{2}

*C*

_{3}, c.f., (3.9)), and hence

*w*

_{1}(y)=

*w*

_{0}(

*S*y), one has

*g*-measure is also an equilibrium state for log

*g*, we conclude that \(\mathbb {Q}\) is an equilibrium state for

*h*(y)− log

*h*(

*S*y)) and a constant (− log

*λ*

_{ J,K }). Two potentials, whose difference is of a such form, have identical sets of equilibrium states. The reason is that for any translation invariant measure \(\mathbb {Q}'\) one has

*ϕ*as well. Thus \(\mathbb {Q}\) is also an equilibrium state for

*ϕ*is also a Bowen-Gibbs measure [3]. In our particular case, direct proof of the Bowen-Gibbs property for \(\mathbb {Q}\) is straightforward. Indeed, using the result of (2.2) and the notation introduced above, for every y=(

*y*

_{0},

*y*

_{1},…) one has

*P*= log

*λ*

_{ J,K },

*n*and y=(

*y*

_{0},

*y*

_{1},…)) from below and above by some positive constants \(\underline C,\overline C\), respectively. Indeed, since

*p*,

*ε*∈(0,1),

*I*=[−|

*K*|−|

*J*|,|

*K*|+|

*J*|] is a finite interval, by the result of the previous Lemma, \(w_{i}^{(n)}(\boldsymbol {y})\in I\) for all

*i*and

*n*. Using (3.5), one readily checks that the following choice of constants suffices:

We complete this section with a curious continued fraction representation of the *g*-function (3.11).

###
**Proposition 3.3.**

*i*≥1

###
*Proof.*

*Since for all w*\(\in \mathbb {R}\)

*z*

_{ i }=(1−2

*p*)(1−2

*ε*)

*y*

_{ i−1}tanh(

*w*

_{ i }), \(i\in \mathbb N\), then

Since \(g(\boldsymbol {y}) =\frac {1}{2} +\frac {1}{2}z_{1}\), we obtain the continued fraction expansion (3.12). □

## Two-sided conditional probabilities and denoising

*m,n*> 0 for \(\boldsymbol {y}=(\ldots,y_{-1},y_{0},y_{1},\ldots)\in \{-1,1\}^{\mathbb {Z}}\). Indeed,

*x*

_{ n },…,

*x*

_{1}, and then summing over spins on the left:

*x*

_{−m },…,

*x*

_{−1}. One has

Therefore,

### 4.1 Denoising

*y*

_{ n }},

*n*=1,…,

*N*, given by (1.1), i.e.,

*x*

_{ n }} is some unknown realisation of the Markov chain, and {

*z*

_{ n }} is unknown realisation of the Bernoulli sequence {

*Z*

_{ n }}. The natural question is, given the observed data

*y*

^{ N }=(

*y*

_{1},…,

*y*

_{ N }), what is the optimal choice of \(\hat X_{n}=\hat X_{n}\left (y^{N}\right)\) – the estimate of

*X*

_{ n }, such that the empirical zero-one loss (bit error rate)

In case, parameters of the Markov chain (i.e., *P*) and of the channel (i.e., *Π*) are known, conditional probabilities \(\mathbb {P}\left [X_{n}= x\,|\,Y^{N}=y^{N}\right ]\) can be found using the backward-forward algorithm. Namely, one has

*N*∖

*n*={1,…,

*N*}∖{

*n*}, as follows

*Π*is the emission matrix, and

*π*

_{−1},

*π*

_{1}are the columns of

*Π*:

where \(m\left (a_{-k_{N}}^{-1},c,b_{1}^{k_{N}}\right)\) is the number of occurrences of the word \(a_{-k_{N}}^{-1}cb_{1}^{k_{N}}\) in the observed sequence *y*
^{
N
}=(*y*
_{1},…,*y*
_{
N
}); the length of right and left contexts is set to *k*
_{
N
}=*c* log*N*, *c*>0.

The DUDE has shown excellent performance in a number of test cases. In particular, in case of the binary memoryless channel and the symmetric Markov chain, considered in this paper, performance in comparable to the one of the backward-forward algorithm (4.1), which requires full knowledge of the source distribution, while DUDE is completely oblivious in that respect. In our opinion, the excellent performance of DUDE in this case is partially due to the fact that \(\mathbb {Q}\) is a Gibbs measure, admitting smooth two-sided conditional probabilities, which are well approximated by (4.3) and thus can be estimated from the data. It will be interesting to evaluate performance in cases when the output measure is not Gibbs.

Nevertheless, the BFP model seems to perform extremely well [17].

*ε*=0.2 and various values of

*p*(smaller rates are better).

| Gibbs | DUDE |
---|---|---|

0.05 | 5.30 % | 5.58 % |

0.10 | 9.91 % | 10.48 % |

0.15 | 13.20 % | 13.77 % |

0.20 | 18.34 % | 18.77 % |

One could also try to estimate the Gibbsian potential directly, e.g., using the estimation procedure proposed in [5]. This method showed promising performance in experiments on language classification and authorship attribution. In conclusion, let us also mention that the direct two-sided Gibbs modeling of stochastic processes opens possibilities for applying semi-parametric statistical procedures, as opposed to the universal (parameter free) approach of DUDE.

## Declarations

### Acknowledgments

Part of the work described in this paper has been completed during author’s visit to the Institute of Mathematics for Industry, Kyushu University. The author is grateful for the hospitality during his stay and the support of the World Premier International Researcher Invitation Program.

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- Behn, U, Zagrebnov, VA: One-dimensional Markovian-field Ising model: physical properties and characteristics of the discrete stochastic mapping. J. Phys. A. 21(9), 2151–2165 (1988). ISSN:0305-4470, MR952930 (89j:82024).MathSciNetView ArticleGoogle Scholar
- Berghout, S, Verbitskiy, E: On bi-directional modeling of information sources (2015).Google Scholar
- Bowen, R: Some systems with unique equilibrium states. Math. Systems Theory. 8(3), 193–202. (1974/75). ISSN:0025-5661, MR0399413 (53 #3257).Google Scholar
- Ephraim, Y, Merhav, N: Hidden Markov processes, Special issue on Shannon theory: perspective, trends, and applications. IEEE Trans. Inform. Theory. 48(6), 1518–1569 (2002). ISSN:0018-9448, MR1909472 (2003f:94024), doi:10.1109/TIT.2002.1003838.MathSciNetView ArticleMATHGoogle Scholar
- Ermolaev, V, Verbitskiy, E: Thermodynamic Gibbs Formalism and Information Theory. In: The Impact of Applications on Mathematics (M. Wakayama et al, ed.), Mathematics for Industry, pp. 349–362. Springer Japan (2014).Google Scholar
- Fernandez, F, Viola, A, Weinberger, MJ: Efficient Algorithms for Constructing Optimal Bi-directional Context Sets. Data Compression Conference (DCC), 179–188 (2010). http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=5453460. doi:10.1109/DCC.2010.23.
- Fernández, R, Ferrari, PA, Galves, A: Coupling renewal and perfect simulation of chains of infinite order, Ubatuba (2001).Google Scholar
- Hochwald, BM, Jelenkovic, PŔ: State learning and mixing in entropy of hidden Markov processes and the Gilbert-Elliott channel. IEEE Trans. Inform. Theory. 45(1), 128–138 (1999). ISSN:0018-9448, MR1677853 (99k:94028).MathSciNetView ArticleMATHGoogle Scholar
- Jacquet, P, Seroussi, G, Szpankowski, W: On the entropy of a hidden Markov process. Theoret Comput. Sci. 395(2–3), 203–219 (2008).MathSciNetView ArticleMATHGoogle Scholar
- Ordentlich, E, Weissman, T: Approximations for the entropy rate of a hidden Markov process. In:
*Proceedings International Symposium on Information Theory 2005*, pp. 2198–2202 (2005). http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1523737. doi:10.1109/ISIT.2005.1523737. - Ordentlich, E, Weinberger, MJ, Weissman, T: Multi-directional context sets with applications to universal denoising and compression. In:
*ISIT 2005. Proceedings International Symposium on Information Theory. pp. 1270–1274*(2005). http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1523546. doi:10.1109/ISIT.2005.1523546. - Pollicott M: Computing entropy rates for hidden Markov processes. In: Entropy of hidden Markov processes and connections to dynamical systems, London Math. Soc. Lecture Note Ser, pp. 223–245 (2011). Cambridge Univ. Press, Cambridge MR2866670 (2012i:37010).Google Scholar
- Rabiner, LR: A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE. 77, 257–286 (1989).View ArticleGoogle Scholar
- Ruján, P: Calculation of the free energy of Ising systems by a recursion method. Physica A: Stat Theoretical Phys. 91(3–4), 549–562 (1978).View ArticleGoogle Scholar
- Verbitskiy, EA: Thermodynamics of Hidden Markov Processes. In: Markus, B, Petersen, K, Weissman, T (eds.)
*Papers from the Banff International Research Station Workshop on Entropy of Hidden Markov Processes and Connections to Dynamical Systems*, pp. 258–272. London Mathematical Society, Lecture Note Series (2011). http://dx.doi.org/10.1017/CBO9780511819407.010. - Weissman, T, Ordentlich, E, Seroussi, G, Verdú, S, Weinberger, MJ: Universal discrete denoising: known channel, IEEE Trans. Inform. Theory. 51(1), 5–28 (2005). ISSN: 0018-9448 MR2234569 (2008h:94036).MathSciNetView ArticleMATHGoogle Scholar
- Yu, J, Verdú, S: Schemes for bidirectional modeling of discrete stationary sources. IEEE Trans. Inform. Theory. 52(11), 4789–4807 (2006). ISSN:0018-9448, MR2300356 (2007m:94144).MathSciNetView ArticleMATHGoogle Scholar
- Zuk, O, Domany, E, Kanter, I, Aizenman, M: From Finite-System Entropy to Entropy Rate for a Hidden Markov Process. IEEE Sig Proc. Letters. 13(9), 517–520 (2006).View ArticleGoogle Scholar