# Thermodynamics of the binary symmetric channel

## Abstract

We study a hidden Markov process which is the result of a transmission of the binary symmetric Markov source over the memoryless binary symmetric channel. This process has been studied extensively in Information Theory and is often used as a benchmark case for the so-called denoising algorithms. Exploiting the link between this process and the 1D Random Field Ising Model (RFIM), we are able to identify the Gibbs potential of the resulting Hidden Markov process. Moreover, we obtain a stronger bound on the memory decay rate. We conclude with a discussion on implications of our results for the development of denoising algorithms.

## Introduction

We study the binary symmetric Markov source over the memoryless binary symmetric channel. More specifically, let {X n } be a stationary two-state Markov chain with values {±1}, and

$$\mathbb{P}(X_{n+1}\ne X_{n}) =p,$$

where 0<p<1; denote by $$P=(p_{x,x'}) =\left (\begin {array}{cc} 1-p & p \\ p& 1-p\end {array}\right)$$the corresponding transition probability matrix. Note that $$\pi =\left (\frac {1}{2},\frac {1}{2}\right)$$ is the stationary initial distribution for this chain.

The binary symmetric channel will be modelled as a sequence of Bernoulli random variables {Z n } with

$$\mathbb{P}_{Z}(Z_{n}=-1)=\varepsilon,\quad \mathbb{P}_{Z}(Z_{n}=1)=1-\varepsilon,\quad \varepsilon \in (0,1).$$

Finally, put

$$Y_{n}=X_{n}\cdot Z_{n}$$
(1.1)

for all n. The process {Y n } is a Hidden Markov process, because Y n {−1,1} is chosen independently for any n from an emission distribution $$\pi _{X_{n}}$$ on {−1,1}: π 1=(ε,1−ε) and π −1=(1−ε,ε). More generally, the Hidden Markov Processes are random functions of discrete-time Markov chains, where the value Y n is chosen according to the distribution which depends on the value X n =x n of the underlying Markov chain, independently for any n. The applications of Hidden Markov processes include automatic character and speech recognition, information theory, statistics and bioinformatics, see [4, 13]. The particular example (1.1) we consider in the present paper is probably one of simplest examples, and often used as a benchmark for testing algorithms.

In particular, this example has been studied rather extensively in connection to computation of entropy of the output process {Y n }, see e.g., [810, 12].

The law $$\mathbb {Q}$$ of the process {Y n } is the push-forward of $$\mathbb {P}\times \mathbb {P}_{Z}$$ under $$\psi : \{-1,1\}^{\mathbb {Z}}\times \{-1,1\}^{\mathbb {Z}}\mapsto \{-1,1\}^{\mathbb {Z}}$$, with ψ((x n ,z n ))=x n ·z n . We write $$\mathbb {Q}=(\mathbb {P}\times \mathbb {P}_{Z})\circ \psi ^{-1}$$. For every mn, and $${y_{m}^{n}}:=(y_{m},\ldots, y_{n})\in \{-1,1\}^{n-m+1}$$, the measure of the corresponding cylindric set is given by

\begin{aligned} \mathbb{Q}\!\left({y_{m}^{n}}\right)&:=\mathbb{Q}(Y_{m}=y_{m},\ldots,Y_{n}=y_{n})\\ & =\!\sum_{{x_{m}^{n}},{z_{m}^{n}}\in\{-1,1\}^{n-m+1}}\! \mathbb{P}\left({x_{m}^{n}}\right)\mathbb{P}_{Z}\left({z_{m}^{n}}\right)\! \prod_{k=m}^{n}\! \mathbb I\left[ y_{k}\,=\,x_{k}\cdot z_{k}\right]\\ &=\!\sum_{{x_{m}^{n}}\in\{-1,1\}^{n-m+1}} \frac{1}{2} \prod_{i=m}^{n-1} p_{x_{i},x_{i+1}} \cdot \varepsilon^{\#\left\{i\in [m,n]: x_{i} y_{i}=-1\right\}}\\&\qquad\times(1-\varepsilon)^{\#\left\{i\in [m,n]: x_{i} y_{i}=1\right\}}. \end{aligned}
(1.2)

Two particular cases are easy to analyze. If $$p=\frac {1}{2}$$, then {X n } is a sequence of independent identically distributed random variables with $$\mathbb {P}(X_{n}=-1)=\mathbb {P}(X_{n}=+1)=\frac {1}{2}$$, and {Y n } has the same distribution. If $$\varepsilon =\frac {1}{2}$$, then the formula above implies that

\begin{aligned} \mathbb{Q}\left({y_{m}^{n}}\right) &= \sum_{{x_{m}^{n}}\in\{-1,1\}^{n-m+1}} \frac{1}{2} \prod_{i=m}^{n-1} p_{x_{i},x_{i+1}} \left(\frac{1}{2}\right)^{n-m+1} \\&= \left(\frac{1}{2}\right)^{n-m+1}, \end{aligned}

and hence again, {Y n } is a sequence of independent random variables with $$\mathbb {Q}\left (Y_{n}=-1\right)=\mathbb {Q}\left (Y_{n}=+1\right)=\frac {1}{2}$$.

The paper is organizes as follows. In Section 2 we exploit methods of Statistical Mechanics to derive expressions for probabilities of cylindric events (1.2). In Section 3, analyzing derived expressions, we show that the measure $$\mathbb {Q}$$ has nice thermodynamic properties; in particular, it falls into the class of g-measures with exponential decay of variation (memory decay). We also obtain an novel estimate of the decay rate, which is stronger than estimates derived in previous works. In Section 4 we study two-sided conditional probabilities and show that $$\mathbb {Q}$$ is in fact a Gibbs state in the sense of Statistical Mechanics. We also discuss well-known denoising algorithm DUDE, and suggest that the Gibbs property of $$\mathbb {Q}$$ explains why DUDE performs so well in this particular example. Furthermore, we argue that the development of denoising algorithsms, relying on thermodynamic Gibbs ideas can result in a superior performance.

## Random field Ising model

It was observed in [18] that the probability $$\mathbb {Q}(y_{m},\ldots,y_{n})$$ of a cylindric event {Y m =y m ,…,Y n =y n }, mn, can be expressed via a partition function of a random field Ising model. We exploit this observation further. Assume p,ε(0,1), and put

$$J=\frac{1}{2}\log\frac{1-p}{p},\quad K=\frac{1}{2}\log \frac {1-\varepsilon}{\varepsilon}.$$

Then for any (y m ,…,y n ){−1,1}nm+1, expression for the cylinder probability (1.2) can be rewritten as

\begin{aligned} \mathbb{Q}(y_{m},\ldots,y_{n})=&\frac{c_{J}}{\lambda_{J,K}^{n-m+1}} \sum_{{x_{m}^{n}}\in\{-1,1\}^{n-m+1}}\\&\exp\left(J\sum_{i=m}^{n-1} x_{i}x_{i+1} +K\sum_{i=m}^{n} x_{i} y_{i}\right), \end{aligned}

where

\begin{aligned} c_{J} &= {\cosh(J)},\quad \lambda_{J,K} =2\left(\cosh(J+K)+ \cosh(J-K)\right)\\&=4\cosh(J)\cosh(K). \\ \end{aligned}

The non-trivial part of the cylinder probability is the sum over all hidden configurations (x m ,…,x n ):

$$\mathsf{Z}_{m,n}\left({y_{m}^{n}}\right):= \sum_{{x_{m}^{n}}\in\{-1,1\}^{n-m+1}}\exp\left(J\sum_{i=m}^{n-1} x_{i}x_{i+1} +K\sum_{i=m}^{n} x_{i} y_{i}\right)$$

is in fact the partition function of the Ising model with the signs of the external random fields given by y i ’s. Applying the recursive method of [14], the partition function can be evaluated in the following fashion [1]. Consider the following functions

\begin{aligned} A(w) &=\frac{1}{2} \log \frac{\cosh(w+J)}{\cosh(w-J)},\\ B(w) &=\frac{1}{2} \log\left[4\cdot{\cosh(w+J)}{\cosh(w-J)}\right]\\&= \frac{1}{2} \log \left[e^{2w}+e^{-2w}+e^{2J}+e^{-2J}\right]. \end{aligned}

One readily checks that if s=±1, then for all $$w\in \mathbb {R}$$

$$\exp\left(s A(w)+B(w)\right) =2\cosh(w+s J).$$
(2.1)

Now the partition function can be evaluated by summing the right-most spin. Namely, suppose m<n, $${y_{m}^{n}}\in \{-1,1\}^{n-m+1}$$, then

\begin{aligned} \mathsf{Z}_{m,n}\left({y_{m}^{n}}\right) &=\sum_{x_{m}^{n-1}\in\{-1,1\}^{m-n}} \exp\left(J\sum_{i=m}^{n-2} x_{i}x_{i+1}+ K\sum_{i=m}^{n-1} x_{i} y_{i}\right)\\&\quad\times\sum_{x_{n}\in\{-1,1\}} e^{x_{n}(Jx_{n-1}+Ky_{n})}\\ &=\sum_{x_{m}^{n-1}\in\{-1,1\}^{m-n}} \exp\left(J\sum_{i=m}^{n-2} x_{i}x_{i+1}+ K\sum_{i=m}^{n-1} x_{i} y_{i}\right)\\&\quad\times \left\{ 2\cosh(\,Jx_{n-1}+Ky_{n}) \right\}\\ &=\sum_{x_{m}^{n-1}\in\{-1,1\}^{m-n}} \exp\left(J\sum_{i=m}^{n-2} x_{i}x_{i+1}+ K\sum_{i=m}^{n-1} x_{i} y_{i}\right)\\&\quad \exp\left(x_{n-1}A\left(w_{n}^{(n)}\right)+B\left(w_{n}^{(n)}\right)\right) \end{aligned}

where

$$w_{n}^{(n)} = Ky_{n}.$$

Hence,

\begin{aligned} \mathsf{Z}_{m,n}\left({y_{m}^{n}}\right) &=\sum_{x_{m}^{n-1}\in\{-1,1\}^{m-n}} \exp\left(J\sum_{i=m}^{n-2} x_{i}x_{i+1}{\phantom{\underbrace{K y_{n-1}+A\left(w_{n}^{(n)}\right)}_{w_{n-1}^{(n)}}}}\right.\\&\quad\left.+\, K\sum_{i=m}^{n-2} x_{i} y_{i} +x_{n-1}\Big(\underbrace{K y_{n-1}+A\big(w_{n}^{(n)}\big)}_{w_{n-1}^{(n)}}\Big)\right) \\ &\quad\times\exp\left(B\left(w_{n}^{(n)}\right)\right), \end{aligned}

and thus the new sum has exactly the same form, but instead of Ky n−1, we now have $$w_{n-1}^{(n)}=Ky_{n-1}+A\left (w_{n}^{(n)}\right)$$. Continuing the summation over the remaining right-most x-spins, one gets

\begin{aligned} \mathsf{Z}_{m,n}\left({y_{m}^{n}}\right)&=2\cosh\left(w_{m}^{(n)}\right)\exp\left(\sum_{i=m+1}^{n} B\left(w_{i}^{(n)}\right)\right), \end{aligned}

where

$$w_{i}^{(n)}=Ky_{i}+A\left(w_{i+1}^{(n)}\right)\quad\text{for every}~i<n,$$

equivalently, since A(0)=0, we can define

$$w_{i}^{(n)}=0 \quad\forall i>n, \ \text{and }\ w_{i}^{(n)}=Ky_{i}+A\left(w_{i+1}^{(n)}\right)\quad\forall i\le n.$$

Therefore, we obtain the following expressions for the cylinder and conditional probabilities

\begin{aligned} \mathbb{Q}\left({y_{0}^{n}}\right) &=\frac {c_{J}}{\lambda^{n+1}_{J,K}} \cosh\left(w_{0}^{(n)}\right)\exp\left(\sum_{i=1}^{n} B\left(w_{i}^{(n)}\right)\right),\\ \mathbb{Q}\left(y_{0}|{y_{1}^{n}}\right) &=\frac 1{\lambda_{J,K}}\frac{\cosh\left(w_{0}^{(n)}\right)\exp\left(B\left(w_{1}^{(n)}\right)\right)}{\cosh\left(w_{1}^{(n)}\right)}. \end{aligned}
(2.2)

## Thermodynamic formalism

Let $$\Omega =A^{\mathbb Z_{+}}$$, where A is a finite alphabet, be the space of one-sided infinite sequences ω=(ω 0,ω 1,…) in alphabet A (ω i A for all i). We equip Ω with the metric

$$d(\boldsymbol{\omega},\boldsymbol{\tilde{\omega}})=2^{-k(\boldsymbol{\omega},\boldsymbol{\tilde{\omega}})},$$

where $$k\left (\boldsymbol {\omega },\boldsymbol {\tilde {\omega }}\right)=1$$ if $$\omega _{0}\ne \tilde \omega _{0}$$, and $$k\left (\boldsymbol {\omega },\boldsymbol {\tilde {\omega }}\right) = \max \{k\in \mathbb N: \omega _{i}=\tilde \omega _{i}\ \quad \forall i=0,\ldots,k-1\}$$, otherwise. Denote by S:ΩΩ the left shift:

$$(S\boldsymbol{\omega})_{i} = \omega_{i+1}~\text{for all}~i\in\mathbb{Z}_{+}.$$

Borel probability measure $$\mathbb {P}$$ is translation invariant if

$$\mathbb{P}(S^{-1}C) =\mathbb{P}(C)$$

for any Borel event CΩ.

Let us recall the following well-known definitions:

### Definition 3.1.

Suppose $$\mathbb {P}$$ is a fully supported translation invariant measure on $$\Omega =A^{\mathbb Z_{+}}$$, where A is a finite alphabet.

(i) The measure $$\mathbb {P}$$ is called a g -measure, if for some positive continuous function g:Ω→(0,1) satisfying the normalization condition

$$\sum_{{\bar\omega}_{0}\in A} g\left({\bar\omega}_{0},\omega_{1},\omega_{2},\ldots\right)=1$$

for all ω=(ω 0,ω 1,…)Ω, one has

$$\mathbb{P}(\omega_{0}|\omega_{1},\omega_{2},\ldots) = g(\omega_{0},\omega_{1},\ldots)$$

for $$\mathbb {P}$$-a.a. ωΩ.

(ii) The measure $$\mathbb {P}$$ is Bowen-Gibbs for a continuous potential $$\phi :\Omega \to \mathbb {R}$$, if there exist constants $$P\in \mathbb {R}$$ and C≥1 such that for all ωΩ and every $$n\in \mathbb N$$

$$\frac{1}{C} \le \frac{\mathbb{P}(\{\tilde{\boldsymbol{\omega}}\in\Omega:\, \, {\tilde\omega}_{0}=\omega_{0},\ldots {\tilde\omega}_{n-1}=\omega_{n-1}\})} {\exp\left((S_{n}\phi)(\boldsymbol{\omega})-nP \right)}\le C,$$

where $$(S_{n}\phi)(\boldsymbol {\omega })=\sum _{k=0}^{n-1} \phi (S^{k} \boldsymbol {\omega })$$.

(iii) The measure $$\mathbb {P}$$ is called an equilibrium state for continuous potential $$\phi :\Omega \to \mathbb {R}$$, if $$\mathbb {P}$$ attains maximum of the following functional

$$h(\mathbb{P})+\int \phi \, d\mathbb{P} = \sup_{\tilde{\mathbb{P}}\in\mathcal{M}_{1}^{\ast}(\Omega)} \left[h(\tilde{\mathbb{P}})+\int \phi \, d\tilde{\mathbb{P}} \right],$$
(3.1)

where $$h(\mathbb {P})$$ is the Kolmogorov-Sinai entropy of $$\mathbb {P}$$ and the supremum is taken over the set $$\mathcal M_{1}^{*}(\Omega)$$ of all translation invariant Borel probability measures on Ω.

It is known that every g-measure $$\mathbb {P}$$ is also an equilibrium state for logg; and that every Bowen-Gibbs measure $$\mathbb {P}$$ for potential ϕ is an equilibrium state for ϕ as well.

### Theorem 3.1.

Suppose p,ε(0,1). Then the measure $$\mathbb {Q}=\mathbb {Q}_{p,\varepsilon }$$ on $$\{-1,1\}^{\mathbb {Z}_{+}}$$ (c.f., (2.2)) is a g-measure. Moreover, the corresponding function g has exponential decay of variation: define the n-th variation of g by

$$\text{var}_{n}(g) := \sup_{\boldsymbol{y},\boldsymbol{\tilde{y}}: y_{0}^{n-1}={\tilde y}_{0}^{n-1}} \bigl| g(\boldsymbol{y})-g(\boldsymbol{\tilde{y}})\bigr|,\quad n\ge 1,$$

then

$$\rho(p,\varepsilon) =\limsup_{n\to\infty} \left(\text{var}_{n}(g)\right)^{\frac 1n} <1.$$
(3.2)

Furthermore, ρ(p,ε)=0 if $$p=\frac {1}{2}$$ or $$\varepsilon =\frac {1}{2}$$; for all $$p\ne \frac {1}{2}$$

$$\rho(p,\varepsilon)< |1-2p|.$$
(3.3)

Finally, the measure $$\mathbb {Q}$$ is also a Bowen-Gibbs measure for a Hölder continuous potential $$\phi :\{-1,1\}^{\mathbb {Z}_{+}}\to \mathbb R$$.

The result of Theorem 3.1 is actually true in much greater generality: namely, for distributions of functions of Markov chains {Y n }, where the underlying Markov chain {X n } has strictly positive transition probability matrix P, see [15] for review of several results of this nature. However, example considered in the present paper is rather exceptional since one is able to identify the g-function and the Gibbs potential ϕ explicitly. Another interesting question is the estimate of the decay rate ρ. In [15] a number of previously known estimates of the rate of exponential decay in (3.3) have been compared; the best known estimate for ρ

$$\rho\le |1-2p|$$

is due to [8] and [7]. Quite surprisingly this estimate does not depend on ε, and in fact, it was conjectured in [15] that the estimate could be improved, e.g., by incorporating dependence on ε. The proof of Theorem 3.1 shows that this is indeed the case and one obtains sharper estimate (3.3).

Let us start by introducing some notation and proving a technical result. Suppose p,ε(0,1), and $$p\ne \frac {1}{2}$$ and $$\varepsilon \ne \frac {1}{2}$$. Fix $$\boldsymbol {y}=(y_{0},y_{1},\ldots)\in \{-1,1\}^{\mathbb {Z}_{+}}$$. For every $$n\in \mathbb {Z}_{+}$$, define the sequence $$w^{(n)}_{i}=w^{(n)}_{i}(\boldsymbol {y})$$, $$i\in \mathbb {Z}_{+}$$, as follows:

\begin{aligned} w^{(n)}_{i}&=0,\text{for every }i\ge n+1,\\ w_{i}^{(n)}&=Ky_{i}+A\left(w_{i+1}^{(n)}\right),\text{ for }i\le n. \end{aligned}

If we introduce maps F −1, $$F_{1}:\mathbb {R}\to \mathbb {R}$$, given by

$$F_{-1}(w) = -K+A(w),\quad F_{1}(w) = K+A(w),$$

then for in,

\begin{aligned} w_{i}^{(n)}=w_{i}^{(n)}(\boldsymbol{y}) &= F_{y_{i}}(F_{y_{i+1}}(\cdots (F_{y_{n}}(0))\cdots))\\&=F_{y_{i}}\circ F_{y_{i+1}}\circ\cdots F_{y_{n}}(0). \end{aligned}
(3.4)

As we will show the maps F −1, F 1 are strict contractions, and as a result, for every i, the sequence $$\left \{w_{i}^{(n)}\right \}$$ is converging as n; in fact, with exponential speed.

### Lemma 3.2.

For every $$i\in \mathbb {Z}_{+}$$ and all $$\boldsymbol {y}\in \{-1,1\}^{\mathbb {Z}_{+}}$$ one has

$${\lim}_{n\to\infty} w_{i}^{(n)}(\boldsymbol{y}) =:w_{i}(\boldsymbol{y}).$$

Moreover, there exist constants ϱ(0,1) and C>0, both independent of y, such that

$$\left|w_{i}^{(n)}(\boldsymbol{y})-w_{i}(\boldsymbol{y})\right| \le C\varrho^{n-i}$$
(3.5)

for all ni. Furthermore, w i (y)=w 0(S i y) for all $$i\in \mathbb {Z}_{+}$$ and y, and $$w_{0}:\{-1,1\}^{\mathbb {Z}_{+}}\to \mathbb {R}$$ (and hence every w i ) is Hölder continuous

$$| w_{0}(\boldsymbol{y})-w_{0}(\boldsymbol{\tilde{y}})|\le C' \left(d(\boldsymbol{y},\boldsymbol{\tilde{y}})\right)^{\theta}$$

for some C ,θ>0 and all $$\boldsymbol {y},\boldsymbol {\tilde {y}}\in \{-1,1\}^{\mathbb {Z}_{+}}$$.

### Proof.

Suppose inm. Then

\begin{aligned} \left|w_{i}^{(n)}-w_{i}^{(m)}\right| &= \left| A\left(w_{i+1}^{(n)}\right)-A\left(w_{i+1}^{(m)}\right)\right|\\& \le\left| w_{i+1}^{(n)}-w_{i+1}^{(m)}\right| \cdot \sup_{w}\left|\frac {dA}{dw} \right|. \end{aligned}

One has

$$\frac {dA}{dw}=\frac {\sinh(2J)}{\cosh(2J)+\cosh(2w)},$$

and hence

$$\varrho:=\sup_{w} \left|\frac {dA}{dw}\right|=\left|\frac {\sinh(2J)}{\cosh(2J)+1}\right| = |1-2p|<1.$$
(3.6)

Combined with the fact that for all $$i\in \mathbb {Z}_{+}$$

\begin{aligned} \left|w_{i}^{(m)}\right|&=\left|Ky_{i}\,+\,A\left(w_{i+1}^{(m)}\right)\right| \le |K| +|\text{arctanh}(1-2p)|\\& \le |K|+|J|=:C_{1}, \end{aligned}

one therefore concludes that for inm

\begin{aligned} \left|w_{i}^{(n)}-w_{i}^{(m)}\right|&\le \varrho^{n-i+1} \left|w_{n+1}^{(n)}-w_{n+1}^{(m)}\right| = \varrho^{n-i+1} \left|w_{n+1}^{(m)}\right| \\&\le C_{1}\varrho^{n-i+1}. \end{aligned}

Hence, $${\lim }_{n\to \infty } w_{i}^{(n)}=:w_{i}$$ exists and

\begin{aligned} \left|w_{i}^{(n)}-w_{i}\right|&\le \sum_{m=n}^{\infty} \left|w_{i}^{(m)}-w_{i}^{(m+1)}\right|\le C_{1}\sum_{m=n}^{\infty} \varrho^{m-i+1}\\&= \frac {C_{1}}{1-\varrho} \varrho^{n-i+1}=:C\varrho^{n-i+1}. \end{aligned}

From representation (3.4) it is clear that for ni,

$$w_{i}^{(n)}(\boldsymbol{y}) = w_{0}^{(n-i)}(S^{i} \boldsymbol{y}),$$

and hence w i (y)=w 0(S i y).

Suppose $$\boldsymbol {y}=(y_{0},y_{1},\cdots),\boldsymbol {\tilde {y}}=(\tilde y_{0},\tilde y_{1},\cdots)\in \{-1,1\}^{\mathbb {Z}_{+}}$$ are such that $$d(\boldsymbol {y},\boldsymbol {\tilde {y}})=2^{-k}$$ for some $$k\in \mathbb N$$, i.e., $$y_{0}= \tilde y_{0}$$, …, $$y_{k-1}=\tilde y_{k-1}$$. Then

\begin{aligned} |w_{0}(\boldsymbol{y})-w_{0}(\boldsymbol{\tilde{y}})| &=\left| F_{y_{0}}\circ\ldots\circ F_{y_{k-1}}(w_{k}(\boldsymbol{y}))- F_{y_{0}}\circ\ldots\circ F_{y_{k-1}}(w_{k}(\boldsymbol{\tilde{y}}))\right|\\ &\le \sup_{w\in \mathbb{R}} \left| (F_{y_{0}}\circ\ldots\circ F_{y_{k-1}})'(w) \right|\cdot |w_{k}(\boldsymbol{y})-w_{k}(\boldsymbol{\tilde{y}})|\\ &\le \left(\sup_{w\in \mathbb{R}} |A'(w)|\right)^{k} \cdot (2C_{1})= 2C_{1}\varrho^{k}, \end{aligned}

and hence $$w_{0}:\{-1,1\}\to \mathbb {R}$$ is Hölder continuous. □

The estimate of a contraction rate in the Lemma above can be improved. If $$p=\frac {1}{2}$$, then A(w)≡0, and hence $$w_{i}^{(n)}\equiv Ky_{i}$$ for all ni. We may assume that $$p\ne \frac {1}{2}$$. Let us also assume that $$\varepsilon \ne \frac {1}{2}$$, i.e., K≠0.

Let us now consider second iterations:

\begin{aligned} \left|w_{i}^{(n)}-w_{i}^{(m)}\right| &=\left| A\left(w_{i+1}^{(n)}\right)-A\left(w_{i+1}^{(m)}\right)\right|=\left| A\left(Ky_{i+1}+A\left (w_{i+2}^{(n)}\right)\right)\right.\\&\left.\quad-\, A\left(Ky_{i+1}+A\left(w_{i+2}^{(m)}\right)\right)\right|\\ &\le \left(\sup_{w} |A'(K+A(w)) A'(w)|\right)\left|w_{i+2}^{(n)}-w_{i+2}^{(m)}\right|\\& =:\rho^{(2)} \left|w_{i+2}^{(n)}-w_{i+2}^{(m)}\right|. \end{aligned}

We are going to show that for $$p\ne \frac {1}{2}$$ and all $$\varepsilon \ne \frac {1}{2}$$, one has

$$\rho^{(2)}=\sup_{w} |A'(K+A(w)) A'(w)| <(1-2p)^{2}.$$
(3.7)

Firstly, note that

\begin{aligned} &|A'(K+A(w)) A'(w)|\\ &=\frac{\sinh^{2}(2J)}{(\cosh(2J)+\cosh(2w))(\cosh(2J)+\cosh(2K+2A(w)))}\\ & = \frac {(1-2p)^{2}} {\left(\alpha+(1-\alpha)\cosh(2K +2A(w))\right) \cdot \left(\alpha+(1-\alpha)\cosh(2w)\right)}, \end{aligned}
(3.8)

where α=(1−p)2+p 2, 1−α=2p(1−p). Let Δ>0 be sufficiently small so that for all w[−Δ,Δ] one has cosh(2K+2A(w))> cosh(K), and hence for all w [−Δ,Δ]

\begin{aligned} |A'(K+A(w)) A'(w)| \le \frac {(1-2p)^{2}} {\left(\alpha+(1-\alpha)\cosh(K)\right)\cdot 1}<(1-2p)^{2}. \end{aligned}

For w[−Δ,Δ], one has

\begin{aligned} |A'(K+A(w)) A'(w)| &\le \frac {(1-2p)^{2}} {1\cdot \left(\alpha+(1-\alpha)\cosh(\Delta)\right)}\\&<(1-2p)^{2}. \end{aligned}

Hence,

\begin{aligned} \rho^{(2)} &=\min\left\{\frac {(1-2p)^{2}} {\left(\alpha+(1-\alpha)\cosh(K)\right)}, \frac {(1-2p)^{2}} {\left(\alpha+(1-\alpha)\cosh(\Delta)\right)}\right \}\\&<(1-2p)^{2}, \end{aligned}

and thus (3.5) holds for $$\bar \varrho =\sqrt {\rho ^{(2)}}<|1-2p|$$ and some constant $$\tilde C>0$$. In particular, we are now able conclude that

\begin{aligned} &|w_{0}(\boldsymbol{y})-w_{0}(\boldsymbol{\tilde{y}})| \le C_{2} \bar\varrho^{k(\boldsymbol{y},\boldsymbol{\tilde{y}})}= C_{2} \left(d(\boldsymbol{y},\boldsymbol{\tilde{y}})\right)^{\theta},\\&\theta =-\log_{2}\, \bar\varrho>0. \end{aligned}
(3.9)

Even sharper bounds can be achieved by studying minimum of the denominator in (3.8) or higher interates of F’s.

### Proof of Theorem.

The cases $$p=\frac {1}{2}$$ or $$\varepsilon =\frac {1}{2}$$ are obvious: in these cases, $$\mathbb {Q}$$ is the Bernoulli $$\left (\frac {1}{2}, \frac {1}{2}\right)$$-measure on $$\{-1,1\}^{\mathbb {Z}_{+}}$$, and hence ρ(p,ε)=0. Thus we will assume that $$p,\varepsilon \ne \frac {1}{2}$$.

To show that $$\mathbb {Q}$$ is a g-measure it is sufficient to show that conditional probabilities $$\mathbb {Q}\left (y_{0}|{y_{1}^{n}}\right)$$ converge uniformly as n. Given that

$$\mathbb{Q}\left(y_{0}|{y_{1}^{n}}\right)=\frac 1{\lambda_{J,K}}\frac{\cosh\left(w_{0}^{(n)}\right)\exp\left(B\left(w_{1}^{(n)}\right)\right)}{\cosh\left(w_{1}^{(n)}\right)},$$
(3.10)

and using the result of Lemma 3.2: $$w_{i}^{(n)}(\boldsymbol {y}) \rightrightarrows w_{i}(\boldsymbol {y})$$ as n, we obtain uniform convergence of conditional probabilities, and hence, $$\mathbb {Q}$$ is a g-measure with g given by

$$g(\boldsymbol{y}) =\frac 1{\lambda_{J,K}}\frac{\cosh(w_{0}(\boldsymbol{y}))\exp\left(B(w_{1}(\boldsymbol{y}))\right)}{\cosh\left(w_{1}(\boldsymbol{y})\right)}.$$
(3.11)

Taking into account that w 0, w 1=w 0S are Hölder continuous functions satisfying (3.9), and that cosh, exp, and B are smooth functions, we can conclude that g is also Hölder continuous with the same decay of variation:

\begin{aligned} \mathsf{var}_{n}(g)&=\sup_{\boldsymbol{y},\boldsymbol{\tilde{y}}: y_{0}^{n-1}=\tilde y_{0}^{n-1}}|g(\boldsymbol{y})-g(\boldsymbol{\tilde{y}})|\\& \le C_{3} |w_{1}(\boldsymbol{y})-w_{1}(\boldsymbol{\tilde{y}})| \le C_{4}\bar{\varrho}^{n-1}, \end{aligned}

for some C 3>0 (C 4=C 2 C 3, c.f., (3.9)), and hence

$$\rho(p,\varepsilon)=\limsup_{n\to\infty} \left(\textsf{var}_{n}(g)\right)^{\frac 1n} \le \bar\varrho<|1-2p|.$$

Let us introduce the following functions: for $$\boldsymbol {y}\in \{-1,1\}^{\mathbb {Z}_{+}}$$, put

$$\phi(\boldsymbol{y}) = B(w_{0}(\boldsymbol{y})),\quad h(\boldsymbol{y}) =\cosh(w_{0}(\boldsymbol{y}))\exp\left(-B(w_{0}(\boldsymbol{y}))\right).$$

Taking into account that w 1(y)=w 0(S y), one has

$$g(\boldsymbol{y}) =\frac {e^{\phi(\boldsymbol{y})}}{\lambda_{J,K}} \frac {h(\boldsymbol{y})} {h(S\boldsymbol{y})}.$$

Since every g-measure is also an equilibrium state for logg, we conclude that $$\mathbb {Q}$$ is an equilibrium state for

$$\tilde \phi(\boldsymbol{y})=\phi(\boldsymbol{y}) +\log h(\boldsymbol{y})-\log h(S\boldsymbol{y}) -\log \lambda_{J,K}.$$

The difference $$\tilde \phi (\boldsymbol {y})-\phi (\boldsymbol {y})$$ has a very special form: it is a sum of a so-called coboundary (logh(y)− logh(S y)) and a constant (− logλ J,K ). Two potentials, whose difference is of a such form, have identical sets of equilibrium states. The reason is that for any translation invariant measure $$\mathbb {Q}'$$ one has

\begin{aligned} &\int\left(\log h(\boldsymbol{y})-\log h(S\boldsymbol{y})-\log\lambda_{J,K}\right)d\mathbb{Q}' =-\log \lambda_{J,K}\\&\qquad=\text{const}. \end{aligned}

Therefore, if $$\mathbb {Q}'$$ achieves maximum in the righthand side of (3.1) for $$\tilde \phi$$, then $$\mathbb {Q}'$$ achieves maximum for ϕ as well. Thus $$\mathbb {Q}$$ is also an equilibrium state for

$$\phi(\boldsymbol{y}) = B(w_{0}(\boldsymbol{y}))= \frac 12\log\left[ 4\sinh^{2}(w_{0}(\boldsymbol{y}))+ \frac 1{p(1-p)}\right].$$

Any equilibrium measure for a Hölder continuous potential ϕ is also a Bowen-Gibbs measure [3]. In our particular case, direct proof of the Bowen-Gibbs property for $$\mathbb {Q}$$ is straightforward. Indeed, using the result of (2.2) and the notation introduced above, for every y=(y 0,y 1,…) one has

\begin{aligned} \mathbb{Q}\left({y_{0}^{n}}\right) &= \frac {c_{J}}{\lambda^{n+1}_{J,K}} \exp\left(\sum_{i=1}^{n}B\left(w_{i}^{(n)}(\boldsymbol{y})\right)\right)\cosh\left(w_{0}^{(n)}(\boldsymbol{y})\right)\\ &=\frac {c_{J}\cdot\cosh\left(w_{0}^{(n)}(\boldsymbol{y})\right)}{\exp(B(w_{0}(\boldsymbol{y})))} \exp\left(\sum_{i=1}^{n} \left[B\left(w_{i}^{(n)}(\boldsymbol{y})\right)\,-\, B(w_{i}(\boldsymbol{y}))\right]\right)\\ &\quad\times\exp\left(\sum_{i=0}^{n}B(w_{i}(\boldsymbol{y}))-(n+1)\log\lambda_{J,K}\right). \end{aligned}

Therefore, for P= logλ J,K ,

\begin{aligned} \frac {\mathbb{Q}\left({y_{0}^{n}}\right)}{\exp\left((S_{n+1}\phi)(\boldsymbol{y})-(n+1)P \right)}&= \frac {c_{J}\cdot\cosh\left(w_{0}^{(n)}(\boldsymbol{y})\right)}{\exp(B(w_{0}(\boldsymbol{y})))}\times \\& \exp\left(\sum_{i=1}^{n}\! \left[\!B\left(\!w_{i}^{(n)}(\boldsymbol{y})\!\right)\,-\, B(w_{i}(\boldsymbol{y}))\!\right]\!\right) \end{aligned}

It only remains to demonstrate that the right hand side is uniformly bounded (both in n and y=(y 0,y 1,…)) from below and above by some positive constants $$\underline C,\overline C$$, respectively. Indeed, since p,ε(0,1), I=[−|K|−|J|,|K|+|J|] is a finite interval, by the result of the previous Lemma, $$w_{i}^{(n)}(\boldsymbol {y})\in I$$ for all i and n. Using (3.5), one readily checks that the following choice of constants suffices:

\begin{aligned} \overline C &= c_{J}\frac {\sup_{w\in I} \cosh(w)}{\inf_{w\in I} \exp(B(w))} \exp\left(\frac C{1-\varrho} \sup_{w\in I} \left|\frac {dB}{dw}\right| \right)<\infty,\\ \underline C&=c_{J}\frac {\inf_{w\in I} \cosh(w)}{\sup_{w\in I} \exp(B(w))} \exp\left(-\frac C{1-\varrho} \sup_{w\in I} \left|\frac {dB}{dw}\right| \right)>0. \end{aligned}

We complete this section with a curious continued fraction representation of the g-function (3.11).

### Proposition 3.3.

For every $$\boldsymbol {y} =(y_{0},y_{1},\ldots)\in \{-1,1\}^{\mathbb {Z}_{+}}$$, one has

$$2g(\boldsymbol{y})= a_{1} -\frac{b_{1}}{a_{2}-\frac{b_{2}}{a_{3}-\frac{b_{3}}{a_{4}-\ldots}}}$$
(3.12)

where for i≥1

$$q_{i}=(1-2p)y_{i-1} y_{i},\quad a_{i}= 1+q_{i}, \quad b_{i}= 4\varepsilon (1-\varepsilon) q_{i}.$$

### Proof.

Using elementary transformations, one can show that for every $$\boldsymbol {y}=(y_{0},y_{1},\ldots)\in \{-1,1\}^{\mathbb {Z}_{+}}$$ one has

\begin{aligned} g(\boldsymbol{y})&= \frac 1{\lambda_{J,K}} \frac {\cosh(w_{0}(\boldsymbol{y}))}{\cosh(w_{1}(\boldsymbol{y}))} \exp\Bigl(B(w_{1}(\boldsymbol{y}))\Bigr)\\ &=\frac{1}{2} +\frac{1}{2} (1-2p)(1-2\varepsilon)y_{0}\tanh(w_{1}(\boldsymbol{y})).\\ \end{aligned}
(3.13)

Since for all w $$\in \mathbb {R}$$

\begin{aligned} \tanh(A(w)) =\tanh(J)\tanh(w)=(1-2p)\tanh(w), \end{aligned}

for every $$i\in \mathbb {Z}_{+}$$, one has

\begin{aligned} \tanh(w_{i})&=\frac {\tanh(Ky_{i})+\tanh(A(w_{i+1}))}{1+\tanh(Ky_{i})\cdot\tanh(A(w_{i+1}))} \\ &=\frac{(1-2\varepsilon)y_{i}+(1-2p)\tanh(w_{i+1})}{1+(1-2\varepsilon)(1-2p)y_{i}\tanh(w_{i+1})}\\ &=y_{i}\frac{(1-2\varepsilon)+(1-2p)y_{i}\tanh(w_{i+1})}{1+(1-2\varepsilon)(1-2p)y_{i}\tanh(w_{i+1})}. \end{aligned}

Therefore, if we let z i =(1−2p)(1−2ε)y i−1 tanh(w i ), $$i\in \mathbb N$$, then

\begin{aligned} z_{i}&=(1-2p)y_{i-1}y_{i} -\frac {4\varepsilon(1-\varepsilon)(1-2p)y_{i-1}y_{i}} {1+z_{i+1}}\\& =q_{i}-\frac{b_{i}}{1+z_{i+1}}. \end{aligned}

Since $$g(\boldsymbol {y}) =\frac {1}{2} +\frac {1}{2}z_{1}$$, we obtain the continued fraction expansion (3.12). □

## Two-sided conditional probabilities and denoising

In the previous section we established that $$\mathbb {Q}$$ is a Bowen-Gibbs measure. The notion of a Gibbs measure originates in Statistical Mechanics, and is not equivalent to the Bowen-Gibbs definition. In Statistical Mechanics, one is interested in two-sided conditional probabilities

$$\mathbb{Q}\left(y_{0}|y_{-m}^{-1},{y_{1}^{n}}\right) \text{or } \mathbb{Q}(y_{0}|y_{<0},y_{>0}) := \mathbb{Q}\left(y_{0}|y_{-\infty}^{-1},y_{1}^{\infty}\right).$$

The method of Section 2 can be used to evaluate conditional probabilities $$\mathbb {Q}\left (y_{0}|y_{-m}^{-1},{y_{1}^{n}}\right)$$, m,n > 0 for $$\boldsymbol {y}=(\ldots,y_{-1},y_{0},y_{1},\ldots)\in \{-1,1\}^{\mathbb {Z}}$$. Indeed,

$$\mathbb{Q}\left(y_{0}|y_{-m}^{-1},{y_{1}^{n}}\right)= \frac{\mathbb{Q}\left(y_{-m}^{-1},y_{0},{y_{1}^{n}}\right)}{\mathbb{Q}\left(y_{-m}^{-1},y_{0},{y_{1}^{n}}\right)+\mathbb{Q}\left(y_{-m}^{-1},\bar y_{0},{y_{1}^{n}}\right)},$$

where $$\bar y_{0}=-y_{0}$$. We can evaluate

\begin{aligned} &\mathbb{Q}(y_{-m},\ldots,y_{-1},y_{0},y_{1},\ldots,y_{n})\\ & \qquad\quad\!\!\! = \frac {c_{J}}{\lambda_{J,K}^{n+m+1}} \sum_{x_{-m}^{n}\in\{-1,1\}^{n+m+1}} \exp\!\left(\! J\sum_{i=-m}^{n-1} x_{i}x_{i+1} +K\sum_{i=-m}^{n} x_{i} y_{i}\!\right)\\ &\qquad\quad\!\!\! =\frac {c_{J}}{\lambda_{J,K}^{n+m+1}} {\mathsf Z}_{-m,n}\left(y_{-m}^{n}\right), \end{aligned}

by first summing over spins on the right: x n ,…,x 1, and then summing over spins on the left: x m ,…,x −1. One has

\begin{aligned} {\mathsf Z}_{-m,n}\left(y_{-m}^{n}\right) &=\sum_{x_{-m},\ldots,x_{0}}\exp\left(J\sum_{i=-m}^{-1} x_{i}x_{i+1}\,+\, K\sum_{i=-m}^{0} x_{i} y_{i}+x_{0}A\left(w_{1}^{(n)}\right)\right)\\ &\quad\exp\left(\sum_{i=1}^{n} B\left(w_{i}^{(n)}\right)\right)\\ &=\exp\left(\sum_{j=-m}^{-1} B\left(w_{j}^{(-m)}\right)\right)2\cosh\left(w_{0}^{(-m,n)}\right)\\&\quad \exp\left(\sum_{i=1}^{n} B\left(w_{i}^{(n)}\right)\right)\\ \end{aligned}

where now $$w_{-m}^{(-m)}=Ky_{-m}$$,

$$w_{j+1}^{(-m)} = Ky_{j+1} +A\left(w_{j}^{(-m)}\right),\quad j=-m,\ldots,-2,$$

and

$$w_{0}^{(-m,n)} = Ky_{0}+A\left(w_{-1}^{(-m)}\right)+A\left(w_{1}^{(n)}\right).$$

Therefore,

\begin{aligned} \mathbb{Q}\left(y_{0}|y_{-m}^{-1},{y_{1}^{n}}\right)&= \frac{{\mathsf Z}_{-m,n}\left(y_{-m}^{-1},y_{0},{y_{1}^{n}}\right)}{{\mathsf Z}_{-m,n}\left(y_{-m}^{-1},y_{0},{y_{1}^{n}}\right)+{\mathsf Z}_{-m,n}\left(y_{-m}^{-1},\bar y_{0},{y_{1}^{n}}\right)}\\ &= \frac { \cosh\left(Ky_{0}+A\left(w_{-1}^{(-m)}\right)+A\left(w_{1}^{(n)}\right)\right)} { \cosh\left(Ky_{0}+A\left(w_{-1}^{(-m)}\right)+A\left(w_{1}^{(n)}\right)\right)+{ \cosh\left(-Ky_{0}+A\left(w_{-1}^{(-m)}\right)+A\left(w_{1}^{(n)}\right)\right)}.} \end{aligned}

Again, given this expression, one easily establishes uniform convergence and existence of the limits,

$$\mathbb{Q}\left(y_{0}|y_{-\infty}^{-1},y_{1}^{\infty}\right)={\lim}_{m,n\to\infty} \mathbb{Q}\left(y_{0}|y_{-m}^{-1},{y_{1}^{n}}\right).$$

Thus the two sided conditional probabilities are also regular, c.f. Theorem 3.1.

### 4.1 Denoising

Reconstruction of signals corrupted by noise during the transmission is one of the classical problems in Information Theory. Suppose we observe a sequence {y n }, n=1,…,N, given by (1.1), i.e.,

$$y_{n}=x_{n}\cdot z_{n}.$$

where {x n } is some unknown realisation of the Markov chain, and {z n } is unknown realisation of the Bernoulli sequence {Z n }. The natural question is, given the observed data y N=(y 1,…,y N ), what is the optimal choice of $$\hat X_{n}=\hat X_{n}\left (y^{N}\right)$$ – the estimate of X n , such that the empirical zero-one loss (bit error rate)

$$L_{N}=\frac 1N\sum_{n=1}^{N} \mathbb{I}\left[ \hat X_{n}\ne x_{n} \right]$$

is minimal. The corresponding standard maximum a posteriori probability (MAP) estimator (denoiser) is given by

\begin{aligned} \hat X^{n} &=\hat X^{n}\left(y^{N}\right)\\ &=\left\{\!\!\begin{array}{ll} -1,& \text{if }\mathbb{P}\left[X_{n}= -1\,|\,Y^{N}={y_{1}^{N}}\right]\ge \mathbb{P}\left[X_{n}= 1\,|\,Y^{N}={y_{1}^{N}}\right],\\ +1,& \text{if }\mathbb{P}\left[X_{n}= -1\,|\,Y^{N}={y_{1}^{N}}\right]< \mathbb{P}\left[X_{n}= 1\,|\,Y^{N}={y_{1}^{N}}\right], \end{array}\right. \\ n&=1,\ldots,N. \end{aligned}

In case, parameters of the Markov chain (i.e., P) and of the channel (i.e., Π) are known, conditional probabilities $$\mathbb {P}\left [X_{n}= x\,|\,Y^{N}=y^{N}\right ]$$ can be found using the backward-forward algorithm. Namely, one has

$$\mathbb{P}\left[X_{n}= x\,|\,Y^{N}=y^{N}\right] = \frac {\alpha_{n}(x)\beta_{n}(x)}{ \sum_{\tilde{x}\in \mathcal{A}} \alpha_{n}(\tilde{x})\beta_{n}(\tilde{x})}$$
(4.1)

where

\begin{aligned} &\alpha_{n}(x)=\mathbb{P}\left[Y_{1}^{n}={y_{1}^{n}}, X_{n}=x\right],\\ &\beta_{n}(x)=\mathbb{P}\left[Y_{n+1}^{N}=y_{n+1}^{N}|X_{n}=x\right] \end{aligned}

are the so-called forward and backward variables, satisfying simple recurrence relations:

\begin{aligned} \alpha_{n+1}(x)&= \sum_{\tilde{x}\in A} \alpha_{n}(\tilde{x})\ P_{\tilde{x},x}\ \Pi_{x,y_{n+1}},\ n=1,\ldots, N-1,\\&\quad\text{with } \alpha_{1}(x) = \mathbb{P}(X_{1}=x)\Pi_{x,y_{1}},\\ \beta_{n}(x)&=\sum_{\tilde{x}\in A} \beta_{n+1}(\tilde{x})\ P_{x,\tilde{x}}\ \Pi_{\tilde{x},y_{n+1}},\ n=1,\ldots, N-1,\\& \quad\text{with } \beta_{N}(x) =1. \end{aligned}

The key observation of [16] is that the probability distribution $$\mathbb {P}\left [X_{n}=\cdot \ | Y^{N}=y^{N}\right ]$$, viewed as a column vector, can be expressed in terms of two-sided conditional probabilities $$\mathbb {Q}\left [Y_{n}=\cdot \ | Y^{N\setminus n} = y^{N\setminus n}\right ]$$, with Nn={1,…,N}{n}, as follows

\begin{aligned} &\mathbb{P}\left[X_{n}=\cdot\ | Y^{N}=y^{N}\right]\\ &= \frac {\pi_{y_{n}}\odot \Pi^{-1} \mathbb{Q}\left[Y_{n}=\cdot\ | Y^{N\setminus n} = y^{N\setminus n}\right]}{ \langle\pi_{y_{n}}\odot \Pi^{-1} \mathbb{Q}\left[Y_{n}=\cdot\ | Y^{N\setminus n} = y^{N\setminus n}\right], \mathbf 1 \rangle}, \end{aligned}
(4.2)

where Π is the emission matrix, and π −1,π 1 are the columns of Π:

\begin{aligned} &\Pi=\left[\begin{array}{ll} 1-\epsilon&\quad \epsilon\\ \epsilon &1-\epsilon \end{array},\right] \quad\pi_{-1}=\left[\begin{array}{ll} 1-\epsilon\\ \epsilon \end{array},\right] \\&\pi_{1}=\left[\begin{array}{ll} \quad\epsilon\\ 1-\epsilon \end{array},\right]\quad \Pi^{-1}=\frac{1}{1-2\epsilon} \left[\begin{array}{ll} 1-\epsilon & \quad-\epsilon\\ -\epsilon &\quad1-\epsilon \end{array},\right] \end{aligned}

and is componentwise product of vectors of equal lengths,

$$u \odot v = (u_{1}\cdot v_{1},\ldots, u_{d}\cdot v_{d}).$$

Expression (4.2) opens a possibility of constructing denoisers when parameters of the underlying Markov chains are unknown; we continue to assume that the channel remains known. Indeed, two-sided conditional probabilities $$\mathbb {Q}\left [Y_{n}=\cdot \ | Y^{N\setminus n} = y^{N\setminus n}\right ]$$ could be estimated from the data. The Discrete Universal Denoiser algorithm (DUDE) [16] estimates conditional probabilities

\begin{aligned} &\mathbb{Q}\left(Y_{n} = c\,|\, Y_{n-k_{N}}^{n-1}= a_{-k_{N}}^{-1}, Y_{n+1}^{n+k_{N}}= b_{1}^{k_{N}}\right)\\&\quad= \frac {m\left(a_{-k_{N}}^{-1},c,b_{1}^{k_{N}}\right)}{\sum_{\bar c} m\left(a_{-k_{N}}^{-1},c,b_{1}^{k_{N}}\right)} \end{aligned}
(4.3)

where $$m\left (a_{-k_{N}}^{-1},c,b_{1}^{k_{N}}\right)$$ is the number of occurrences of the word $$a_{-k_{N}}^{-1}cb_{1}^{k_{N}}$$ in the observed sequence y N=(y 1,…,y N ); the length of right and left contexts is set to k N =c logN, c>0.

The DUDE has shown excellent performance in a number of test cases. In particular, in case of the binary memoryless channel and the symmetric Markov chain, considered in this paper, performance in comparable to the one of the backward-forward algorithm (4.1), which requires full knowledge of the source distribution, while DUDE is completely oblivious in that respect. In our opinion, the excellent performance of DUDE in this case is partially due to the fact that $$\mathbb {Q}$$ is a Gibbs measure, admitting smooth two-sided conditional probabilities, which are well approximated by (4.3) and thus can be estimated from the data. It will be interesting to evaluate performance in cases when the output measure is not Gibbs.

Invention of DUDE sparked a great interest in two-sided approaches to information-theoretic problems. It turns out that despite the fact the efficient algorithms for estimation of one-sided models exist, the analogous two-sided problem is substantially more difficult. As alternatives to (4.3), other methods to estimate two-sided conditional probabilities have been suggested, e.g., [6, 11, 17]. For example, Yu and Verdú [17] proposed a Backward-Forward Product (BFP) model:

$$\mathbb{\widetilde{Q}}(y_{0}|y_{<0},y_{>0}) \propto \mathbb{\widetilde{Q}}(y_{0}|y_{<0}) \mathbb{\widetilde{Q}}(y_{0}|y_{>0}),$$

and the one-sided conditional probabilities $$\mathbb {\widetilde {Q}}(y_{0}|y_{<0})$$, $$\mathbb {\widetilde {Q}}(y_{0}|y_{>0})$$ can be estimated using standard one-sided algorithms. Note, that in our model,

$$\begin{array}{@{}rcl@{}} &&\frac{\mathbb{\widetilde{Q}}\left(y_{0}| y_{<0}\right) \mathbb{\widetilde{Q}}\left(y_{0}| y_{>0}\right)} {\mathbb{\widetilde{Q}}(y_{0}| y_{<0})\mathbb{\widetilde{Q}}(y_{0}| y_{>0}) +\mathbb{\widetilde{Q}}(\bar y_{0}| y_{<0})\mathbb{\widetilde{Q}}(\bar y_{0}| y_{>0})}\\ &&=\frac { \cosh(Ky_{0} +A(w_{-1}))\cosh(Ky_{0} +A(w_{1}))} { \cosh(Ky_{0} +A(w_{-1}))\cosh(Ky_{0} +A(w_{1}))+ \cosh(-Ky_{0} +A(w_{-1}))\cosh(-Ky_{0} +A(w_{1}))} \end{array}$$

in general does not coincide with

\begin{aligned} &\frac { \cosh\left(Ky_{0}+A(w_{-1})+A(w_{1})\right)} {\cosh\left(Ky_{0}+A(w_{-1})+A(w_{1})\right)+ \cosh\left(-Ky_{0}+A(w_{-1})+A(w_{1})\right)}\\&\quad=\mathbb{Q}(y_{0}|y_{<0},y_{>0}). \end{aligned}

Nevertheless, the BFP model seems to perform extremely well [17].

Among other alternatives, let us mention the possibility to extend standard one-sided algorithms to produce algorithms for estimating two-sided conditional probabilities from data. This approach is investigated in a joint work with S. Berghout, where the denoising performance of the resulting Gibbsian models is evaluated. Gibbsian algorithm performs better than DUDE: bit error rates are given in the table below for noise level ε=0.2 and various values of p (smaller rates are better).

p Gibbs DUDE
0.05 5.30 % 5.58 %
0.10 9.91 % 10.48 %
0.15 13.20 % 13.77 %
0.20 18.34 % 18.77 %

One could also try to estimate the Gibbsian potential directly, e.g., using the estimation procedure proposed in [5]. This method showed promising performance in experiments on language classification and authorship attribution. In conclusion, let us also mention that the direct two-sided Gibbs modeling of stochastic processes opens possibilities for applying semi-parametric statistical procedures, as opposed to the universal (parameter free) approach of DUDE.

## References

1. 1

Behn, U, Zagrebnov, VA: One-dimensional Markovian-field Ising model: physical properties and characteristics of the discrete stochastic mapping. J. Phys. A. 21(9), 2151–2165 (1988). ISSN:0305-4470, MR952930 (89j:82024).

2. 2

Berghout, S, Verbitskiy, E: On bi-directional modeling of information sources (2015).

3. 3

Bowen, R: Some systems with unique equilibrium states. Math. Systems Theory. 8(3), 193–202. (1974/75). ISSN:0025-5661, MR0399413 (53 #3257).

4. 4

Ephraim, Y, Merhav, N: Hidden Markov processes, Special issue on Shannon theory: perspective, trends, and applications. IEEE Trans. Inform. Theory. 48(6), 1518–1569 (2002). ISSN:0018-9448, MR1909472 (2003f:94024), doi:10.1109/TIT.2002.1003838.

5. 5

Ermolaev, V, Verbitskiy, E: Thermodynamic Gibbs Formalism and Information Theory. In: The Impact of Applications on Mathematics (M. Wakayama et al, ed.), Mathematics for Industry, pp. 349–362. Springer Japan (2014).

6. 6

Fernandez, F, Viola, A, Weinberger, MJ: Efficient Algorithms for Constructing Optimal Bi-directional Context Sets. Data Compression Conference (DCC), 179–188 (2010). http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=5453460. doi:10.1109/DCC.2010.23.

7. 7

Fernández, R, Ferrari, PA, Galves, A: Coupling renewal and perfect simulation of chains of infinite order, Ubatuba (2001).

8. 8

Hochwald, BM, Jelenkovic, PŔ: State learning and mixing in entropy of hidden Markov processes and the Gilbert-Elliott channel. IEEE Trans. Inform. Theory. 45(1), 128–138 (1999). ISSN:0018-9448, MR1677853 (99k:94028).

9. 9

Jacquet, P, Seroussi, G, Szpankowski, W: On the entropy of a hidden Markov process. Theoret Comput. Sci. 395(2–3), 203–219 (2008).

10. 10

Ordentlich, E, Weissman, T: Approximations for the entropy rate of a hidden Markov process. In: Proceedings International Symposium on Information Theory 2005, pp. 2198–2202 (2005). http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1523737. doi:10.1109/ISIT.2005.1523737.

11. 11

Ordentlich, E, Weinberger, MJ, Weissman, T: Multi-directional context sets with applications to universal denoising and compression. In: ISIT 2005. Proceedings International Symposium on Information Theory. pp. 1270–1274 (2005). http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1523546. doi:10.1109/ISIT.2005.1523546.

12. 12

Pollicott M: Computing entropy rates for hidden Markov processes. In: Entropy of hidden Markov processes and connections to dynamical systems, London Math. Soc. Lecture Note Ser, pp. 223–245 (2011). Cambridge Univ. Press, Cambridge MR2866670 (2012i:37010).

13. 13

Rabiner, LR: A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE. 77, 257–286 (1989).

14. 14

Ruján, P: Calculation of the free energy of Ising systems by a recursion method. Physica A: Stat Theoretical Phys. 91(3–4), 549–562 (1978).

15. 15

Verbitskiy, EA: Thermodynamics of Hidden Markov Processes. In: Markus, B, Petersen, K, Weissman, T (eds.)Papers from the Banff International Research Station Workshop on Entropy of Hidden Markov Processes and Connections to Dynamical Systems, pp. 258–272. London Mathematical Society, Lecture Note Series (2011). http://dx.doi.org/10.1017/CBO9780511819407.010.

16. 16

Weissman, T, Ordentlich, E, Seroussi, G, Verdú, S, Weinberger, MJ: Universal discrete denoising: known channel, IEEE Trans. Inform. Theory. 51(1), 5–28 (2005). ISSN: 0018-9448 MR2234569 (2008h:94036).

17. 17

Yu, J, Verdú, S: Schemes for bidirectional modeling of discrete stationary sources. IEEE Trans. Inform. Theory. 52(11), 4789–4807 (2006). ISSN:0018-9448, MR2300356 (2007m:94144).

18. 18

Zuk, O, Domany, E, Kanter, I, Aizenman, M: From Finite-System Entropy to Entropy Rate for a Hidden Markov Process. IEEE Sig Proc. Letters. 13(9), 517–520 (2006).

## Acknowledgments

Part of the work described in this paper has been completed during author’s visit to the Institute of Mathematics for Industry, Kyushu University. The author is grateful for the hospitality during his stay and the support of the World Premier International Researcher Invitation Program.

## Author information

Authors

### Corresponding author

Correspondence to Evgeny Verbitskiy.

## Rights and permissions

Reprints and Permissions

Verbitskiy, E. Thermodynamics of the binary symmetric channel. Pac. J. Math. Ind. 8, 2 (2016). https://doi.org/10.1186/s40736-015-0021-5

• Accepted:

• Published:

### Keywords

• Hidden Markov models
• Gibbs states
• Thermodynamic formalism
• Denoising

• 37D35
• 82B20
• 82B20