A computable lower bound for the KL from Hammersley-Chapman-Robbins inequality

I first read of this bound from:

Nishiyama, T., 2019. A New Lower Bound for Kullback-Leibler Divergence Based on Hammersley-Chapman-Robbins Bound. arXiv:1907.00288 [cs, math, stat].

The following notes are not very polished, but present the general idea. I hope they are useful.

The strategy of the paper is quite nice; first, Nishiyama shows that if we have two distributions \(P,Q\) and define the mixture \(R_t(x):=P(x)+t(Q(x)-P(x)), t\in [0,1]\), then:

\[\frac{d}{dt}D_a(P|R_t)=\frac{1-a}{t}D_a(P|R_t)+\frac{1+a}{t}D_{a+1}(P|R_t),\]

for any \(a\) and \(D_a\) being the alpha-divergence. Setting \(a=1\) leaves us with the KL divergence and \(\chi^2\), which recovers the nice identity:

\[\frac{d}{dt}KL(P|R_t)=\frac{1}{t}\chi^2(P|R_t).\]

Now, fix a function \(f\) for which \(E_Q, E_P, V_P, V_{R_t}\) (expectations and variances) are finite and \(V_{R_t}>0\) for all \(t\in [0,1]\). Then, applying the HCR inequality gives:

\[\frac{d}{dt}KL(P|R_t)\geq \frac{(E_{R_t}-E_P)^2}{tV_{R_t}}.\]

Integrating the above in \([0,1]\) gives the result of the paper as \(\int_0^1 KL(P|R_t)dt=KL(P|Q).\)

We can also show that

\[KL(P|Q)=\int_0^1 \frac{\chi^2(P|R_t)}{t}dt=\int_0^1 t\int_{\Omega}\frac{(Q-P)^2}{P+t(Q-P)}dxdt.\]

Assuming everything exists, we can exchange the integrals (ala. Fubini) and then expand the \(t\) function around \(t=1\) to introduce the chi-square divergence:

\[KL(P|Q)=\chi^2(P|Q)+...\]

Actual lower bound

The lower bound is:

\[KL(P|Q)\geq \int_0^1\frac{(E_{R_t}-E_P)^2}{tV_{R_t}}dt,\]

where the integral depends on \(E_P, E_Q, V_P, V_Q\) and can be computed analytically and written as a sum of logarithms (by using partial fractions).

Tighter lower-bounds

Starting from the equality:

\[KL(P|Q)=\int_0^1 \frac{\chi^2(P|R_t)}{t}dt\]

we can derive tighter lower bounds for the KL. First, we write the variational representation of \(\chi^2\)

\[\chi^2(P|R_t)=\sup_{h}\left \{ 2E_P[h]-E_Q[h^2]-1\right \}.\]

Suppressing the family of functions leads to lower bounds of chi-square and thus to lower bounds of the KL. For example, the HCR bound can be derived by considering first degree polynomials.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Bounds on joint probabilities - Part I
  • Quick notes on the "Enhanced Cauchy Schwarz inequality and some of its statistical applications
  • Rewriting the Kullback-Leibler with an integral transform
  • Stein and the Normal Distribution
  • Control variates and hypothesis testing