Kostis GourgouliasMy website, served with jekyll.
http://kgourgou.me
Stein and the Normal Distribution<p>Hello and happy 2021!</p>
<p>Inspired from this
<a href="https://twitter.com/docmilanfar/status/1312936010393640961?s=20">tweet</a>, I
wanted to understand the basics of Stein’s characterization of the Normal
distribution.</p>
<p>With Stein’s idea, we can identify the distribution of a random variable by
checking that it satisfies some condition in expectation. For example, for the
standard normal, we have that \(X\sim N(0,1)\) if and only if for all \(f\) with
\(E[f']<\infty\) we have:</p>
\[\mathbb{E}[xf(x)-f'(x)]=0.\]
<p>The operator \(Af:=xf-f'\) is then called the Stein operator and we can rewrite
the result with this operator as: \(\mathbb{E}_{P}[Af]=0\) for all \(f\in C^1_b\) iff
\(P\) is \(N(0,1)\).</p>
<p>This operator is not unique as we can always add P-measure zero parts; see
\(Bf:=xf-f'+x\) which satisfies</p>
\[\mathbb{E}[Bf]=\mathbb{E}[Af]+\mathbb{E}[x]=\mathbb{E}[Af].\]
<p>How can we show \(A\) characterises the standard Normal? A key identity is that
the probability density function (PDF) of the \(N(0,1)\) satisfies \(P'+xP=0\). So,
if \(P\) is indeed the PDF of the standard normal, then applying integration by
parts to \(E_{P}[f']\) and using the differential equation gives us Stein’s
formula.</p>
<p>Now, if \(P\) is the PDF of any other distribution and it satisfies Stein’s
formula for all \(f\in C^1_b\), then integration by parts on \(E_{P}[f']\) leads us
back to the \(P'+xP=0\). That ODE is separable with solution \(P\propto
\exp(-x^2/2)\), which, up to the normalisation, is the PDF of the standard
normal!</p>
<p>This strategy of deriving an ODE for the density function, getting its weak form
by multiplying with a smooth function \(f\) and integrating can be repeated to get
<a href="https://en.wikipedia.org/wiki/Stein%27s_method#The_Stein_operator">Stein
operators</a>
for other distributions, e.g., the exponential, etc.</p>
Fri, 22 Jan 2021 00:00:00 +0000
http://kgourgou.me//Stein-and-the-Normal-Distribution/
http://kgourgou.me//Stein-and-the-Normal-Distribution/Rediscovering a function from samples<p>A friend shared this nice problem with me. Suppose you have a fixed function, \(f\), and a family of probability distributions, defined by, say, prob. density functions, \(P_t\), \(t \in A\). If we know \(E_t=E_{P_t}[f]\) for every \(t\in A\), can we recover \(f\)?</p>
<p>Clearly the answer depends on both how rigid \(f\) is and the family \(P_t\). We can cast this problem as a functional analysis problem by defining \(k(x,t)=P_t(x)\) to be a kernel and the expectation to be an integral transform. Then the question becomes: is there an inverse kernel, say, \(k^{-1}(x,t)\), such that
\(\int k^{-1}(x,t)E(t)dt=f(x)?\)
When does that exist and when is it unique? Hints can be taken from the Laplace transform, i.e., \(P_t(x)\propto e^{-tx}\) - up to a normalizing constant this is the just the exponential distribution. In general, this can be a hard problem though.</p>
<h2 id="fredholm-equations">Fredholm equations</h2>
<p>If we know \(E(t)=E_{P_t}[f]\) for every \(t\in A\), can we recover \(f\)? Formally, we have the equation:
\(E(t)=\int k(x,t)f(x)dx,\)
with appropriate limits for the integral. This equation is called a <a href="https://en.wikipedia.org/wiki/Fredholm_integral_equation">“Fredholm Equation of the first kind”</a> and is closely studied in functional analysis and signal processing.</p>
<h2 id="practical-stuff">Practical stuff</h2>
<p>If we assume the existence of an inverse kernel, how can we approximate it? One idea — which is also kind of a standard approach — is to fix a set of orthonormal basis functions, describe everything in terms of them, and then resolve them to arrive to a linear algebra problem.</p>
Mon, 28 Dec 2020 00:00:00 +0000
http://kgourgou.me//Rediscovering-a-function-from-samples/
http://kgourgou.me//Rediscovering-a-function-from-samples/A computable lower bound for the KL from Hammersley-Chapman-Robbins inequality<p>I first read of this bound from:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Nishiyama, T., 2019. A New Lower Bound for Kullback-Leibler Divergence Based on Hammersley-Chapman-Robbins Bound. arXiv:1907.00288 [cs, math, stat].
</code></pre></div></div>
<p>The following notes are not very polished, but present the general idea. I hope they are useful.</p>
<p>The strategy of the paper is quite nice; first, Nishiyama shows that if we have two distributions \(P,Q\) and define the mixture \(R_t(x):=P(x)+t(Q(x)-P(x)), t\in [0,1]\), then:</p>
\[\frac{d}{dt}D_a(P|R_t)=\frac{1-a}{t}D_a(P|R_t)+\frac{1+a}{t}D_{a+1}(P|R_t),\]
<p>for any \(a\) and \(D_a\) being the alpha-divergence. Setting \(a=1\) leaves us with the KL divergence and \(\chi^2\), which recovers the nice identity:</p>
\[\frac{d}{dt}KL(P|R_t)=\frac{1}{t}\chi^2(P|R_t).\]
<p>Now, fix a function \(f\) for which \(E_Q, E_P, V_P, V_{R_t}\) (expectations and variances) are finite and \(V_{R_t}>0\) for all \(t\in [0,1]\). Then, applying the HCR inequality gives:</p>
\[\frac{d}{dt}KL(P|R_t)\geq \frac{(E_{R_t}-E_P)^2}{tV_{R_t}}.\]
<p>Integrating the above in \([0,1]\) gives the result of the paper as
\(\int_0^1 KL(P|R_t)dt=KL(P|Q).\)</p>
<p>We can also show that</p>
\[KL(P|Q)=\int_0^1 \frac{\chi^2(P|R_t)}{t}dt=\int_0^1 t\int_{\Omega}\frac{(Q-P)^2}{P+t(Q-P)}dxdt.\]
<p>Assuming everything exists, we can exchange the integrals (ala. Fubini) and then expand the \(t\) function around \(t=1\) to introduce the chi-square divergence:</p>
\[KL(P|Q)=\chi^2(P|Q)+...\]
<h2 id="actual-lower-bound">Actual lower bound</h2>
<p>The lower bound is:</p>
\[KL(P|Q)\geq \int_0^1\frac{(E_{R_t}-E_P)^2}{tV_{R_t}}dt,\]
<p>where the integral depends on \(E_P, E_Q, V_P, V_Q\) and can be computed analytically and written as a sum of logarithms (by using partial fractions).</p>
<h1 id="tighter-lower-bounds">Tighter lower-bounds</h1>
<p>Starting from the equality:</p>
\[KL(P|Q)=\int_0^1 \frac{\chi^2(P|R_t)}{t}dt\]
<p>we can derive tighter lower bounds for the KL. First, we write the variational representation of \(\chi^2\)</p>
\[\chi^2(P|R_t)=\sup_{h}\left \{ 2E_P[h]-E_Q[h^2]-1\right \}.\]
<p>Suppressing the family of functions leads to lower bounds of chi-square and thus to lower bounds of the KL.
For example, the HCR bound can be derived by considering first degree polynomials.</p>
Mon, 28 Dec 2020 00:00:00 +0000
http://kgourgou.me//A-computable-lower-bound-for-the-KL-from-Hammersley-Chapman-Robbins-inequality/
http://kgourgou.me//A-computable-lower-bound-for-the-KL-from-Hammersley-Chapman-Robbins-inequality/Rewriting the Kullback-Leibler with an integral transform<p>I recently read this nice integral representation of the logarithm in 1912.05812v1 by Neri Merhan and Igar Sason. Most ideas in this post are from there. The transform is:
\(\log(x)=\int_{0}^{\infty}\frac{e^{-t}-e^{-tx}}{t}dt.\)</p>
<p>This can be shown by using \(\log(x)=\int_{1}^{x}1/t\cdot dt\) along with \(\frac{1}{x}=\int_{0}^{\infty}e^{-xu}du\). It all becomes more interesting when we take an expectation though:</p>
\[\mathbb{E}[\log(X)]=\int_{0}^{\infty}\frac{e^{-t}-\mathbb{E}[e^{-tX}]}{t}dt\]
<p>This allows us to express the expectation of the logarithm in terms of the moment generating function of \(X\). For instance, if \(X\) is normal, then such a representation will probably be simpler. It is not obvious that it will help with computation, but it does suggest that we can use stuff like concentration bounds for expectations of logarithms.</p>
<p>It’s fun to apply the same idea to the KL.</p>
\[R(Q|P)=\int Q(x) \log \frac{Q(x)}{P(x)}dx.\]
<p>For simplicity, let \(w(x):=\log\frac{Q(x)}{P(x)}\). Then, from (2):</p>
\[\mathbb{E}_Q[\log w]=\int_{0}^{\infty}\frac{e^{-t}-\mathbb{E}_Q[e^{-tw}]}{t}dt.\]
<p>The last equation is a different way to express the KL divergence. It’s not particularly useful as is (to my eyes), as the MGF of \(w\) is a tough cookie to compute. However, with a lower bound to the MGF we could get an upper bound to the KL that is not trivial.</p>
Mon, 20 Jan 2020 00:00:00 +0000
http://kgourgou.me//Rewriting-the-Kullback-Leibler-with-an-integral-transform/
http://kgourgou.me//Rewriting-the-Kullback-Leibler-with-an-integral-transform/The simplest Bayesian optimization example<p>After a really interesting paper discussion session, <a href="https://github.com/kgourgou/baeysian-opt-for-fun">I decided to implement</a> Bayesian-opt. with the <code class="language-plaintext highlighter-rouge">expected-improvement</code> acquisition function. I’ve already fixed a few bugs, but can’t promise it is bug-free. It will probably work as long as you stick to my examples.</p>
Sun, 10 Nov 2019 00:00:00 +0000
http://kgourgou.me//The-simplest-Bayesian-optimization-example/
http://kgourgou.me//The-simplest-Bayesian-optimization-example/Research<p>Just a short post on past and current research.</p>
<p>A theme of my research has been about the various uncertainties in the data and how those affect the rest of the modeling process, e.g., the selection of a parametric vs non-parametric family for fitting, the metrics used, etc. What mathematical tools do we need to be able to carry out inference confidently when almost every component of the model is noisy?</p>
<p>The titles of a few papers I have published are below. You can find a more up-to-date list on <a href="https://scholar.google.com/citations?user=V1S7npsAAAAJ&hl=en">Google Scholar</a> which also includes a list of patents I have authored with colleagues from Babylon Health.</p>
<p><code class="language-plaintext highlighter-rouge">2019</code>
<em>Tuning the semantic consistency of active medical diagnosis: a walk on the semantic simplex</em>, with A. Buchard, A. Navarro, et al. Presented at the Stanford Symposium “Fronters of AI-assisted care”</p>
<p><code class="language-plaintext highlighter-rouge">2018</code>
<em>Universal Marginalizer for Amortised Inference and Embedding of Generative Models</em>, with R. Walecki, A. Buchard, et al. Submitted to AISTATS. arXiv: 1811.04727</p>
<p><code class="language-plaintext highlighter-rouge">2017</code>
<em>A Universal Marginalizer for Amortized Inference in Generative Models</em>, NeurIPS workshop on Advances in Approximate Bayesian Inference, 2017, with L. Douglas, I. Zarov, et al. arXiv: 1711.00695.</p>
<p><code class="language-plaintext highlighter-rouge">2017</code>
<em>Information criteria for quantifying loss of reversibility in parallelized KMC</em>, with M. Katsoulakis, L. Rey-Bellet. Accepted at the Journal of Computational Physics 328, 438-454.</p>
<p><code class="language-plaintext highlighter-rouge">2017</code>
<em>How biased is your model? Concentration Inequalities, Information and Model Bias</em>, with M. Katsoulakis, L. Rey-Bellet and J. Wang. Accepted at the IEEE Transactions on Information Theory.</p>
<p><code class="language-plaintext highlighter-rouge">2016</code>
<em>Information metrics for long-time errors in splitting schemes for stochastic dynamics and parallel Kinetic Monte Carlo</em>, with M. Katsoulakis and L. Rey-Bellet. Accepted at the SIAM Journal on Scientific Computing 38 (6), A3808-A3832.</p>
Wed, 04 Sep 2019 00:00:00 +0000
http://kgourgou.me//Research/
http://kgourgou.me//Research/Bounds on joint probabilities - Part I<p>Here are some notes on bounding joint probability distributions. Enjoy! This was
converted from \(\LaTeX\) with pandoc, so typos, missing figures, etc., to be expected.</p>
<p>Consider the binary random variables \(X_1, \ldots, X_n\) following the
distribution \(P\). For some collection of values, say,
\(x_1, \ldots, x_n\), we are interested in computing
\(P(X_1=x_1,\ldots, X_n=x_n)\).</p>
<p>There is rich literature on bounding joint probabilities, say, \(P(X_1,X_2,X_3)\), if one has of
knowledge of the marginals, \(P(X_i),\) \(i=1,2,3\), \(P(X_{i},X_j)\),
\(i\neq j\), or of the moments of the marginal distributions. Some
examples of such inequalities follow below.</p>
<p>When the bounds only use \(P(X_i)\), we will say that they utilize
<em>first-order</em> information. Similarly, if \(P(X_i, X_j)\) are used in the
bounds, they are of second-order, then third-order, etc.</p>
<h2 id="bonferroni-inequalities">Bonferroni inequalities</h2>
<p>We start with a classical result, inspired from the inclusion-enclusion
formula, known as the <em>Bonferroni</em>
inequalities [@galambos1977bonferroni]. The notation \(X^c\) corresponds
to the negation of the \(X\) variable, i.e., if \(X=x\), \(X^c=1-x\) for
\(x\in \{0,1\}\). First, we define:</p>
\[\begin{aligned}
S_1&:=\sum_{i}P(X_i^c),\\
S_k&:=\sum_{1\leq i_1< \ldots < i_k\leq n} P(X_{i_1}^c,\ldots, X_{i_k}^c),\\\end{aligned}\]
<p>Then, for every odd \(k\) in \(\{1,\ldots, n\}\):</p>
\[\begin{aligned}
P(X_1,\ldots, X_n)&\geq 1 -\sum_{j=1}^{k} (-1)^{j-1}S_j.
\end{aligned}\]
<p>We can also get an upper bound for every even \(k\):</p>
\[\begin{aligned}
P(X_1,\ldots, X_n)&\leq 1 -\sum_{j=1}^{k} (-1)^{j-1}S_j.\end{aligned}\]
<p>By the inclusion-exclusion formula, the inequalities become equalities
when \(k=n\). Thus, the inequalities can be made sharper by including more
marginals. However, the upper (and lower) bounds don’t necessarily
become sharper monotonically as \(k\) increases; see work
by [@schwager1984bonferroni]. Also, although the inequalities are valid
for all \(k\), they can be uninformative, that is, smaller than zero or
greater than one.</p>
<h2 id="frechet-bounds">Frechet bounds</h2>
<p>An alternative upper bound for the joint is the Frechet-type bound:</p>
\[\begin{aligned}
\label{eq:frechet}
P(X_1,\ldots, X_n)\leq \min_{i}P(X_i).
\end{aligned}\]
<p>This can be
simply derived by observing that, for any \(i\),</p>
\[P(X_1,\ldots, X_n)=P(X_1,\ldots,X_{i-1},X_{i+1},\ldots, X_n|X_i)P(x_i)\leq P(X_i)\]
<p>and then picking the tightest bound. We can also include terms like
\(P(X_i,X_j)\) to the upper bound, if known, to get an even tighter bound.
As an upper bound, this may be more suitable than the Bonferroni bound;
it is always a valid probability and can be tight when dealing with rare
events. Like the Bonferroni bound, this is distribution-independent.</p>
<p>Now, if all we know about the \(X_i\) are the \(P(X_i)\), then the tightest
bounds[^1] we can get are:</p>
\[\begin{aligned}
\label{eq:frechet-first}
\max\{0,1-\sum_i(1-P(X_i))\} \leq P(X_1,\ldots, X_n)\leq \min_{i} P(X_i).\end{aligned}\]
<p>The lower bound comes from the first Bonferroni lower bound. However, it
can be further sharpened by adding second-order information, that is,
some of the \(P(X_i^c,
X_j^c)\), as discussed by [@hochbergsome]. One example of such a
sharpening is known as the <em>Kounias</em> inequality:</p>
\[\begin{aligned}
\label{eq:kounias}
1-\sum_{i}(1-P(X_i))+\max_j \sum_{i\neq j}P(X_i^c,X_j^c)\leq P(X_1,\ldots, X_n).\end{aligned}\]
<p>This can be further sharpened by replacing the max term in by</p>
\[\sum_{i,j:(i,j)\in T} P(X_i^c, X_j^c),\]
<p>where $T$ is the maximal
spanning tree, i.e., the tree that maximizes the sum of the
probabilities[^2]. The new bound then is:</p>
\[\begin{aligned}
\label{eq:wolfe}
1-\sum_{i}(1-P(X_i))+\sum_{i,j:(i,j)\in T} P(X_i^c, X_j^c)\leq P(X_1,\ldots, X_n).\end{aligned}\]
<p>This bound was first derived in work by [@hunter1976upper] and has been
subsequently generalized to work with more events via the construction
of multi-trees; see work by [@bukszar2001upper].</p>
<h2 id="multiplicative-bounds">Multiplicative bounds</h2>
<p>In some cases, multiplicative bounds, that is,</p>
\[P(X_1,\ldots X_n)\geq P(X_1)\ldots P(X_n),\]
<p>may also be applicable when the random variables show positive association; see work
by [@esary1967association] for details on that. Those bounds are easier
to apply and often tighter but may not always be correct as they are
distribution dependent. Especially for Bernoulli variables, Theorem 4.
in [@esary1967association] shows that association of the
\(X_1,\ldots, X_n\) implies only that</p>
\[\begin{aligned}
P(X_1=1,\ldots, X_n=1)&\geq P(X_1=1)\ldots P(X_n=1),\\
P(X_1=0,\ldots, X_n=0)&\geq P(X_1=0)\ldots P(X_n=0).\end{aligned}\]
Fri, 29 Jun 2018 00:00:00 +0000
http://kgourgou.me//Bounds-on-joint-probabilities/
http://kgourgou.me//Bounds-on-joint-probabilities/New manuscript: how biased is your model?<p>A few days ago myself along with co-authors Prof. Katsoulakis, Prof. Rey-Bellet, and PhD candidate Jie Wang, pushed on arXiv our latest manuscript titled: How biased is your model? Concentration Inequalities, Information and Model Bias.</p>
<p><strong>Abstract</strong>:
We derive tight and computable bounds on the bias of statistical estimators, or more generally of quantities of interest, when evaluated on a baseline model P rather than on the typically unknown true model Q. Our proposed method combines the scalable information inequality derived by P. Dupuis, K.Chowdhary, the authors and their collaborators together with classical concentration inequalities (such as Bennett’s and Hoeffding-Azuma inequalities). Our bounds are expressed in terms of the Kullback-Leibler divergence R(Q||P) of model Q with respect to P and the moment generating function for the statistical estimator under P. Furthermore, concentration inequalities, i.e. bounds on moment generating functions, provide tight and computationally inexpensive model bias bounds for quantities of interest. Finally, they allow us to derive rigorous confidence bands for statistical estimators that account for model bias and are valid for an arbitrary amount of data.</p>
<p>You can find the full manuscript <a href="https://arxiv.org/abs/1706.10260">here</a>.</p>
Sat, 08 Jul 2017 00:00:00 +0000
http://kgourgou.me//New-paper-on-arxiv/
http://kgourgou.me//New-paper-on-arxiv/PhD defense is scheduled<p>My PhD defense is scheduled!</p>
<p>Date: 24 of March, 2017.
Time: 10:00 AM.
Place: LGRT 1634.</p>
<p>The title of the thesis is “Information Metrics for Predictive Modeling and
Machine Learning”.</p>
<p>Feel free to join if you are curious!</p>
Thu, 09 Feb 2017 00:00:00 +0000
http://kgourgou.me//PhD-defense-scheduled/
http://kgourgou.me//PhD-defense-scheduled/