Thermodynamics of the Binary Symmetric Channel

We study a hidden Markov process which is the result of a transmission of the binary symmetric Markov source over the memoryless binary symmetric channel. This process has been studied extensively in Information Theory and is often used as a benchmark case for the so-called denoising algorithms. Exploiting the link between this process and the 1D Random Field Ising Model (RFIM), we are able to identify the Gibbs potential of the resulting Hidden Markov process. Moreover, we obtain a stronger bound on the memory decay rate. We conclude with a discussion on implications of our results for the development of denoising algorithms.


Introduction
We study the binary symmetric Markov source over the memoryless binary symmetric channel.More specifically, let {X n } be a stationary two-state Markov chain with values {±1}, and P(X n+1 = X n ) = p, where 0 < p < 1; denote by P = ( p x,x ) = 1 − p p p 1 − p the corresponding transition probability matrix.Note that π = 1 2 , 1 2 is the stationary initial distribution for this chain.
The binary symmetric channel will be modelled as a sequence of Bernoulli random variables {Z n } with Finally, put for all n.The process {Y n } is a Hidden Markov process, because Y n ∈ {−1, 1} is chosen independently for any n from an emission distribution π X n on {−1, 1}: π 1 = (ε, 1 − ε) and π −1 = (1 − ε, ε).More generally, the Hidden Markov Processes are random functions of discrete-time Markov chains, where the value Y n is chosen according to the distribution which depends on the value X n = x n of the underlying Markov chain, independently for any n.The applications of Hidden Markov processes include automatic character and speech recognition, information theory, statistics and bioinformatics, see [4,13].The particular example (1.1) we consider in the present paper is probably one of simplest examples, and often used as a benchmark for testing algorithms.In particular, this example has been studied rather extensively in connection to computation of entropy of the output process {Y n }, see e.g., [8][9][10]12].
The law Q of the process {Y n } is the push-forward of For every m ≤ n, and y n m := (y m , . . ., y n ) ∈ {−1, 1} n−m+1 , the measure of the corresponding cylindric set is given by (1.2) Two particular cases are easy to analyze.If p = 1 2 , then {X n } is a sequence of independent identically distributed random variables with P(X n = −1) = P(X n = +1) = 1 2 , and {Y n } has the same distribution.If ε = 1 2 , then the formula above implies that , and hence again, The paper is organizes as follows.In Section 2 we exploit methods of Statistical Mechanics to derive expressions for probabilities of cylindric events (1.2).In Section 3, analyzing derived expressions, we show that the measure Q has nice thermodynamic properties; in particular, it falls into the class of g-measures with exponential decay of variation (memory decay).We also obtain an novel estimate of the decay rate, which is stronger than estimates derived in previous works.In Section 4 we study two-sided conditional probabilities and show that Q is in fact a Gibbs state in the sense of Statistical Mechanics.We also discuss well-known denoising algorithm DUDE, and suggest that the Gibbs property of Q explains why DUDE performs so well in this particular example.Furthermore, we argue that the development of denoising algorithsms, relying on thermodynamic Gibbs ideas can result in a superior performance.

Random field Ising model
It was observed in [18] that the probability Q(y m , . . ., y n ) of a cylindric event Y m = y m , . . ., Y n = y n , m ≤ n, can be expressed via a partition function of a random field Ising model.We exploit this observation further.Assume p, ε ∈ (0, 1), and put Then for any (y m , . . ., y n ) ∈ {−1, 1} n−m+1 , expression for the cylinder probability (1.2) can be rewritten as where The non-trivial part of the cylinder probability is the sum over all hidden configurations (x m , . . ., x n ): is in fact the partition function of the Ising model with the signs of the external random fields given by y i 's.Applying the recursive method of [14], the partition function can be evaluated in the following fashion [1].Consider the following functions One readily checks that if s = ±1, then for all w ∈ R exp (sA(w) + B(w)) = 2 cosh(w + sJ). (2.1) Now the partition function can be evaluated by summing the right-most spin.Namely, suppose m < n, y n m ∈ {−1, 1} n−m+1 , then Hence, and thus the new sum has exactly the same form, but instead of Ky n−1 , we now have w . Continuing the summation over the remaining right-most x-spins, one gets , where for every i < n, equivalently, since A(0) = 0, we can define Therefore, we obtain the following expressions for the cylinder and conditional probabilities (2.2)

Thermodynamic formalism
Let = A Z + , where A is a finite alphabet, be the space of one-sided infinite sequences ω = (ω 0 , ω 1 , . ..) in alphabet A ( ω i ∈ A for all i).We equip with the metric where k ω, ω = 1 if ω 0 = ω0 , and k ω, ω = max{k ∈ N : ω i = ωi ∀i = 0, . . ., k − 1}, otherwise.Denote by S : → the left shift: Borel probability measure P is translation invariant if Let us recall the following well-known definitions: Definition 3.1.Suppose P is a fully supported translation invariant measure on = A Z + , where A is a finite alphabet.
(ii) The measure P is Bowen-Gibbs for a continuous potential φ : → R, if there exist constants P ∈ R and C ≥ 1 such that for all ω ∈ and every n ∈ N where The measure P is called an equilibrium state for continuous potential φ : → R, if P attains maximum of the following functional where h(P) is the Kolmogorov-Sinai entropy of P and the supremum is taken over the set M * 1 ( ) of all translation invariant Borel probability measures on .
It is known that every g-measure P is also an equilibrium state for log g; and that every Bowen-Gibbs measure P for potential φ is an equilibrium state for φ as well.
Theorem 3.1.Suppose p, ε ∈ (0, 1).Then the measure ) is a g-measure.Moreover, the corresponding function g has exponential decay of variation: define the n-th variation of g by Finally, the measure Q is also a Bowen-Gibbs measure for a Hölder continuous potential φ : The result of Theorem 3.1 is actually true in much greater generality: namely, for distributions of functions of Markov chains {Y n }, where the underlying Markov chain {X n } has strictly positive transition probability matrix P, see [15] for review of several results of this nature.However, example considered in the present paper is rather exceptional since one is able to identify the g-function and the Gibbs potential φ explicitly.Another interesting question is the estimate of the decay rate ρ.In [15] a number of previously known estimates of the rate of exponential decay in (3.3) have been compared; the best known estimate for ρ ρ ≤ |1 − 2p| is due to [8] and [7].Quite surprisingly this estimate does not depend on ε, and in fact, it was conjectured in [15] that the estimate could be improved, e.g., by incorporating dependence on ε.The proof of Theorem 3.1 shows that this is indeed the case and one obtains sharper estimate (3.3).
Let us start by introducing some notation and proving a technical result.Suppose p, ε ∈ (0, 1), and p = 1 2 and ε = 1 2 .Fix y = (y 0 , y 1 , . ..) ∈ {−1, 1} Z + .For every n ∈ Z + , define the sequence w (n)  i = w (n) i (y), i ∈ Z + , as follows: If we introduce maps F −1 , F 1 : R → R, given by As we will show the maps F −1 , F 1 are strict contractions, and as a result, for every i, the sequence w (n) i is converging as n → ∞; in fact, with exponential speed.

Lemma 3.2. For every i ∈ Z + and all y
Moreover, there exist constants ∈ (0, 1) and C > 0, both independent of y, such that for all n ≥ i.Furthermore, w i (y) = w 0 (S i y) for all i ∈ Z + and y, and w 0 : {−1, 1} Z + → R (and hence every w i ) is Hölder continuous One has and hence Combined with the fact that for all i ∈ Z + w Hence, lim n→∞ w From representation (3.4) it is clear that for n ≥ i, and hence w i (y) = w 0 (S i y).Suppose y = (y 0 , y 1 , and hence w 0 : {−1, 1} → R is Hölder continuous. The estimate of a contraction rate in the Lemma above can be improved.If p = 1 2 , then A(w) ≡ 0, and hence w (n) i ≡ Ky i for all n ≥ i.We may assume that p = 1 2 .Let us also assume that ε = 1 2 , i.e., K = 0.
Let us now consider second iterations: We are going to show that for p = 1 2 and all ε = 1 2 , one has (3.7) Firstly, note that where Let > 0 be sufficiently small so that for all w ∈[ − , ] one has cosh(2K + 2A(w)) > cosh(K), and hence for all w ∈ [ − , ] For w ∈[ − , ], one has Hence, and thus (3.5) holds for ¯ = ρ (2) < |1 − 2p| and some constant C > 0. In particular, we are now able conclude that Even sharper bounds can be achieved by studying minimum of the denominator in (3.8) or higher interates of F's.
To show that Q is a g-measure it is sufficient to show that conditional probabilities Q y 0 |y n 1 converge uniformly as n → ∞.Given that and using the result of Lemma 3.2: w (n)  i (y) ⇒ w i (y) as n → ∞, we obtain uniform convergence of conditional probabilities, and hence, Q is a g-measure with g given by Taking into account that w 0 , w 1 = w 0 • S are Hölder continuous functions satisfying (3.9), and that cosh, exp, and B are smooth functions, we can conclude that g is also Hölder continuous with the same decay of variation: for some C 3 > 0 (C 4 = C 2 C 3 , c.f., (3.9)), and hence Let us introduce the following functions: for y ∈ {−1, 1} Z + , put φ(y) = B(w 0 (y)), h(y) = cosh(w 0 (y)) exp −B(w 0 (y)) .
Taking into account that w 1 (y) = w 0 (Sy), one has Since every g-measure is also an equilibrium state for log g, we conclude that Q is an equilibrium state for φ(y) = φ(y) + log h(y) − log h(Sy) − log λ J,K .
The difference φ(y) − φ(y) has a very special form: it is a sum of a so-called coboundary (log h(y) − log h(Sy)) and a constant (− log λ J,K ).Two potentials, whose difference is of a such form, have identical sets of equilibrium states.The reason is that for any translation invariant measure Therefore, if Q achieves maximum in the righthand side of (3.1) for φ, then Q achieves maximum for φ as well.Thus Q is also an equilibrium state for Any equilibrium measure for a Hölder continuous potential φ is also a Bowen-Gibbs measure [3].In our particular case, direct proof of the Bowen-Gibbs property for Q is straightforward.Indeed, using the result of (2.2) and the notation introduced above, for every y = (y 0 , y 1 , . ..) one has Therefore, for P = log λ J,K , It only remains to demonstrate that the right hand side is uniformly bounded (both in n y = (y 0 , y 1 , . ..)) from below and above by some positive constants C, C, respectively.Indeed, since ε ∈ (0, 1), I = [−|K| − |J|, |K| + |J|] is a finite interval, by the result of the previous Lemma, w (n) i (y) ∈ I for all i and n.Using (3.5), one readily checks that the following choice of constants suffices: We complete this section with a curious continued fraction representation of the g-function (3.11).

Two-sided conditional probabilities and denoising
In the previous section we established that Q is a Bowen-Gibbs measure.The notion of a Gibbs measure originates in Statistical Mechanics, and is not equivalent to the Bowen-Gibbs definition.In Statistical Mechanics, one is interested in two-sided conditional probabilities The method of Section 2 can be used to evaluate conditional probabilities where ȳ0 = −y 0 .We can evaluate by first summing over spins on the right: x n , . . ., x 1 , and then summing over spins on the left: x −m , . . ., x −1 .One has where now w and Therefore, Again, given this expression, one easily establishes uniform convergence and existence of the limits, Thus the two sided conditional probabilities are also regular, c.f. Theorem 3.1.

Denoising
Reconstruction of signals corrupted by noise during the transmission is one of the classical problems in Information Theory.Suppose we observe a sequence {y n }, n = 1, . . ., N, given by (1.1), i.e., where {x n } is some unknown realisation of the Markov chain, and {z n } is unknown realisation of the Bernoulli sequence {Z n }.The natural question is, given the observed data y N = (y 1 , . . ., y N ), what is the optimal choice of Xn = Xn y N -the estimate of X n , such that the empirical zero-one loss (bit error rate) is minimal.The corresponding standard maximum a posteriori probability (MAP) estimator (denoiser) is given by In case, parameters of the Markov chain (i.e., P) and of the channel (i.e., ) are known, conditional probabilities P X n = x | Y N = y N can be found using the backwardforward algorithm.Namely, one has where are the so-called forward and backward variables, satisfying simple recurrence relations: with α 1 (x) = P(X 1 = x) x,y 1 , The key observation of [16] is that the probability distribution P X n = • |Y N = y N , viewed as a column vector, can be expressed in terms of two-sided conditional probabilities Q Y n = • |Y N\n = y N\n , with N \n = {1, . . ., N}\ {n}, as follows where is the emission matrix, and π −1 , π 1 are the columns of : and is componentwise product of vectors of equal lengths, Expression (4.2) opens a possibility of constructing denoisers when parameters of the underlying Markov chains are unknown; we continue to assume that the channel remains known.Indeed, two-sided conditional probabilities Q Y n = • |Y N\n = y N\n could be estimated from the data.The Discrete Universal Denoiser algorithm (DUDE) [16] estimates conditional probabilities is the number of occurrences of the word a −1 −k N cb k N 1 in the observed sequence y N = (y 1 , . . ., y N ); the length of right and left contexts is set to The DUDE has shown excellent performance in a number of test cases.In particular, in case of the binary memoryless channel and the symmetric Markov chain, considered in this paper, performance in comparable to the one of the backward-forward algorithm (4.1), which requires full knowledge of the source distribution, while DUDE is completely oblivious in that respect.In our opinion, the excellent performance of DUDE in this case is partially due to the fact that Q is a Gibbs measure, admitting smooth two-sided conditional probabilities, which are well approximated by (4.3) and thus can be estimated from the data.It will be interesting to evaluate performance in cases when the output measure is not Gibbs.
Invention of DUDE sparked a great interest in twosided approaches to information-theoretic problems.It turns out that despite the fact the efficient algorithms for estimation of one-sided models exist, the analogous two-sided problem is substantially more difficult.As alternatives to (4.3), other methods to estimate two-sided conditional probabilities have been suggested , e.g., [6,11,17].For example, Yu and Verdú [17]  Nevertheless, the BFP model seems to perform extremely well [17].
Among other alternatives, let us mention the possibility to extend standard one-sided algorithms to produce algorithms for estimating two-sided conditional probabilities from data.This approach is investigated in a joint work with S. Berghout, where the denoising performance of the resulting Gibbsian models is evaluated.Gibbsian algorithm performs better than DUDE: bit error rates are given in the table below for noise level = 0.2 and various values of p (smaller rates are better).
p Gibbs DUDE 0.05 5.30 % 5.58 % 0.10 9.91 % 10.48 % 0.15 13.20 % 13.77 % 0.20 18.34 % 18.77 % One could also try to estimate the Gibbsian potential directly, e.g., using the estimation procedure proposed in [5].This method showed promising performance in experiments on language classification and authorship attribution.In conclusion, let us also mention that the direct two-sided Gibbs modeling of stochastic processes opens possibilities for applying semi-parametric statistical procedures, as opposed to the universal (parameter free) approach of DUDE.