Kostis Gourgoulias

Quick notes on the “Enhanced Cauchy Schwarz inequality and some of its statistical applications

2024-04-11T00:00:00+00:00

I briefly read this nice paper by Sergio Scarlatti.

In this, the author proves an intermediate term between the classical bounds of the CS inequality.

Classical CS inequality

First things first, for all \(x,y\) in some Hilbert space \(H\) with inner product \((.,.)\) we have:

\[|(x,y)|\leq \|x\|\|y\|.\]

The author shares a nice proof of CS that I’m not sure if I’ve seen before (but looks neat). We define the matrix \(C=C(x,y)\) with \(c_{ij}=\frac{1}{\sqrt{2}}(x_iy_j-x_jy_i),\ i,j=1,\ldots,n\). Then, its second norm is

\[\|C\|_2=\sqrt{\sum_{ij}c_{ij}^2}.\]

If we substitute the definition of \(c_{ij}\) to the above and carry out the algebra, we have \(\|C\|^2_2=\|x\|^2\|y\|^2-(x,y)^2,\) which proves CS as the norm is non-negative.

Enhanced CS

Now, suppose \(V\subseteq H\) is some closed subspace of \(H\). Then if \(P\) is the orthogonal projection onto \(V\) (i.e., \(PH=V\)), the author defines:

\[D(x,y|P):=\|Px\|\cdot \|Py\|+\|P^{\perp}x\|\cdot \|P^{\perp}y\|,\]

for all \(x,y\in H\) and where \(P^{\perp}\) is the projection on the orthogonal complement of \(V\). Then, \(|(x,y)|\leq D(x,y|P)\leq \|x\|\|y\|.\)

I like inequalities with free terms! We can pick \(P\) depending on the problem and the bounds would adapt correspondingly. If \(P\) is a trivial projection to \(H\) or its complement, we recover usual CS.

Proof: It’s a short argument.

\[|(x,y)|=|(Px,Py)+(P^{\perp}x, P^{\perp}y)|\leq |(Px,Py)|+|(P^{\perp}x, P^{\perp}y)|\leq \|Px\|\|Py\|+\|P^{\perp}x\|\|P^{\perp}y\|,\]

where we used bilinearity of inner product, triangle inequality, and CS inequality for each term. Now, if \(a=\|Px\|, b=\|Py\|, c=\|P^{\perp}x\|, d=\|P^{\perp}y\|\), the author shows that

\[ab+cd=\sqrt{(ab+cd)^2}\leq \sqrt{(ab+cd)^2+(ac-bd)^2}=\sqrt{(a^2+c^2)(d^2+b^2)},\]

which is important because with the definitions of \(a,b,c,d\) above, \(a^2+c^2=\|x\|^2\) and similarly for \(y\).

The author shows this with an appeal to algebra (as shown above), but you will notice a simpler way; this bound is just CS but applied to the vectors \(u=(a,c)\) and \(v=(b,d)\). \(\square\)

Going beyond \(P\)

What if we have more than one subspace? Suppose \(V, U\subseteq H\) with corresponding projections \(P,Q\), then \(x=Px+P^{\perp}x\) and similarly for \(y\), which gives

\[\begin{align} |(x,y)|&\leq |(Px,Qy)|+|(Px, Q^{\perp}y)|+|(P^{\perp}x,Qy)|+|(P^{\perp}x, Q^{\perp}y)|\\ &\leq (\|Px\|+\|P^{\perp}x\|)(\|Qy\|+\|Q^{\perp}y\|), \end{align}\]

Again, by bilinearity, triangle ineq., and CS.

The last bound is not as good; setting \(P=Q\) there does not recover the previous results (note \(\|x\|\leq \|Px\|+\|P^{\perp}x\|\)). That’s because some terms would be cancelled from the first bound but are not cancelled from the second bound. For example, \((Px, P^{\perp}y)=0\) regardless of \(x,y\), and so on.

References

Scarlatti, S., 2024. Enhanced Cauchy Schwarz inequality and some of its statistical applications. arXiv preprint arXiv:2403.13964.

A brief analysis of automerger data, feat. SLERP and DARE-TIES LLM merging

2024-04-05T00:00:00+00:00

An article I wrote as a community blog post at HuggingFace.

Comparing what different operations do to the H_0 homology of a point cloud.

2024-02-16T00:00:00+00:00

I’ve recently become interested in persistent homology and using its statistics to understand how different operations change the shape of data manifolds.

The math are really interesting (and are for another post), but I like to have some visuals that I can share with interested parties when I do presentations, so I created a notebook for this.

The theoretical minimum

The densities below are kernel-density-estimates (aka., probability densities) of the “death times” of the \(H_0\) homology for the \(X\) point cloud (the one created with make_classification below). But what does “death time” mean here?

Persistent homology (PH) is all about understanding the aspects of the shape of a manifold from sampled points (aka, a point cloud). In this note, we are looking at only one attribute that is captured by PH, the connected components of the manifold. PH looks at the point cloud at various scales, from the scale of the individual points to the scale of the entire dataset. As PH works through the different scales, it identifies when connected components get created (birth) and when they merge (death, for some of them). As every point on its own initially constitutes a connected component, the birth times are all equal to zero.

The density plots below are tracking the death times and how those change as we manipulate the point cloud. For each case, I’m showing both the death times as they are (Normalised=False) and what happens if we normalise by the maximum finite persistence time. Normalising them makes the death times invariant to point cloud scaling (as you will see below).

The plots

N_FEAT = 50
X, _ = datasets.make_classification(
    n_classes=2,
    n_samples=100,
    n_features=N_FEAT,
    n_redundant=0,
    n_informative=N_FEAT,
    random_state= 0,
    n_clusters_per_class=4,
    class_sep=1.0
)

# change X to have mean 2 and std 3
X = 2 + 3 * X

# contraction mapping
Xcontr = X / 2

# expansion mapping
Xexp = X * 2

Things are as expected up to this point. A few more interesting operations follow.

# generate a random affine contraction matrix A
A = np.random.rand(N_FEAT, N_FEAT)
A = A  / (np.linalg.norm(A)+1e-10)
b = np.random.rand(N_FEAT)

Xaff = X @ A + b

# map to a lower dimension
Xlow = X[:, :2]

# map to lower dimensions with a random affine map
A = np.random.rand(N_FEAT, 10)
b = np.random.rand(10)


Xaff = X @ A + b

# same affine transformation but with a relu function applied to the output

Xaff_relu = np.maximum(0, X @ A + b)

# two layer neural network with relu activation
N_OUTPUT = 10

A1 = np.random.rand(N_FEAT, 20)
b1 = np.random.rand(20)
A2 = np.random.rand(20, N_OUTPUT)
b2 = np.random.rand(N_OUTPUT)

Xnn = X @ A1 + b1
Xnn = np.maximum(0, Xnn)
Xnn = Xnn @ A2 + b2

# layernorm

Xlayernorm = (X - X.mean(axis=1, keepdims=True)) / X.std(axis=1, keepdims=True)

SetFit and Integrated Gradients

2023-07-30T00:00:00+00:00

I’m a fan of both SetFit and integrated gradients, so I wrote this tiny library to combine them. The code is available here under MIT license for further hacking by others. I’m fixing bugs, but otherwise not actively maintaining it.

What is SetFit ?

It’s a language-model-tuning paradigm for few-shot learning with language models without using prompts. It relies on contrastive learning and the authors published a really nice library that makes the method plug & play with sentence-transformers.

I have been contributing some time to that library on Github as well.

What are integrated gradients ?

Integrated gradients (IG) are a method for attributing the output of a neural network to its inputs. It was first developed to explain the output of image classifiers, but it can be used for any model that takes a vector as input.

It occured to me by building this that there are various places one could perturb to apply IG and that the perturbation path probably also matters a lot.

Control variates and hypothesis testing

2023-01-01T00:00:00+00:00

I haven’t looked at this in a while, so thought I would revise a bit.

To start with, I have a variable, say, \(Y\sim P\). Given data from this variable, \(y_i\), we can estimate the mean \(E[Y]\approx S_N=\frac{1}{N}\sum_{i=1}^{N}y_i.\) The estimator we all know and use. It’s unbiased, but it may have large variance, which means that for a fixed N, most random sums could fall away from \(E[Y]\). What can we do to improve on this?

One idea is to introduce a new variable, \(X\). Then, we adjust the data as

\[y'_i=y_i+a(x_i-E[x_i]),\]

where \(a\) is some parameter that we can pick later. What’s the advantage of this? First, this adjustment doesn’t change the mean: \(E[Y']=E[Y]+a(E[X]-E[X])=E[Y].\) So an estimator based on \(Y'\) is unbiased. Not bad.

What is the variance of this new estimator?

\[\mathrm{Var}[Y'] = \mathrm{Var}[Y] + a^2\mathrm{Var}[X] + 2\mathrm{Cov}[Y,aX].\]

Now the first two terms are non-negative. However, the last one can be negative as long as \(X\) and \(Y\) have negative correlation. For instance, picking \(X=-Y\),

\[\mathrm{Var}[Y'] = \mathrm{Var}[Y] + \mathrm{Var}[Y] - 2\mathrm{Cov}[Y,Y] = 0,\]

which is the best possible variance. Realistically, we are back to where we started if we set \(X=-Y\), however if we have other variables that are close to \(-Y\), then this idea can get quite useful (and dramatically reduce variance).

In hypothesis testing

Why would all this be useful for hypothesis testing?

To conduct it, we first split the population to two groups, traditionally called the “treatment” and “control”. We have intervened on the treatment group in some way, say via a variable \(Z\), and we wish to understand whether the magnitude of \(E[Y_T-Y_C]\) (this is also called the average treatment effect (ATE)) is due to \(Z\) or random chance.

Suppose that the difference is indeed due to \(Z\). If we use data to estimate \(E[Y_{T}-Y_{C}]\), the estimator can have variance and that will affect how small an effect we can confidently separate from random chance.

We are now thinking about the “power “ of a hypothesis test, and there are at least three ways to improve our situation from here:

Give up on the small effect size and go for a larger one.
Get more \(Y_T, Y_C\) data for the estimator, i.e., make the groups larger.
Use other covariates, \(X\), to reduce the variance of the ATE estimator. Only covariates that are independent of the way the groups are split can be used, for example pre-experiment data.

Now we can discuss two ways to increase the power of the test by using 3.

CUPED and CUPAC

You can find implementations of CUPED and CUPAC in this notebook.

CUPED stands for “Controlled-Experiment using Pre-experiment data”; see Deng, Xu, Kohavi, Walker, 2013.

At its core, CUPED is a proposal for how to use pre-experiment data with control variates to reduce the variance of the ATE estimator. The authors propose using any covariates that we have before the experiment took place, \(X_i\), \(i=1,\ldots, n\), as well as past values of \(Y\) to fit a linear model: \(\hat{Y}=X\beta +\epsilon\)

Then, we can use the linear model to get predictions for the \(Y\) variables in the control and experiment groups and adjust the true values as \(Y'_T=Y_{T}-(\hat{Y}_T-E[\hat{Y}_T]),\) and similarly for \(Y_C\). Instead of the original ATE, we will then construct an estimator for \(E[Y'_T-Y'_C]\) which, because of our control variates method, will have smaller variance!

CUPAC, aka., Control Using Predictions As Covariates, introduced by DoorDash’s engineering, Li, Tang, and Bauman, takes this one step further: there’s nothing special about using a linear model. One can use a more expressive ML model, get a closer fit to \(Y\), and reduce variance further.

Stein and the Normal Distribution

2021-01-22T00:00:00+00:00

Hello and happy 2021!

Inspired from this tweet, I wanted to understand the basics of Stein’s characterization of the Normal distribution.

With Stein’s idea, we can identify the distribution of a random variable by checking that it satisfies some condition in expectation. For example, for the standard normal, we have that \(X\sim N(0,1)\) if and only if for all \(f\) with \(E[f']<\infty\) we have:

\[\mathbb{E}[xf(x)-f'(x)]=0.\]

The operator \(Af:=xf-f'\) is then called the Stein operator and we can rewrite the result with this operator as: \(\mathbb{E}_{P}[Af]=0\) for all \(f\in C^1_b\) iff \(P\) is \(N(0,1)\).

This operator is not unique as we can always add P-measure zero parts; see \(Bf:=xf-f'+x\) which satisfies

\[\mathbb{E}[Bf]=\mathbb{E}[Af]+\mathbb{E}[x]=\mathbb{E}[Af].\]

How can we show \(A\) characterises the standard Normal? A key identity is that the probability density function (PDF) of the \(N(0,1)\) satisfies \(P'+xP=0\). So, if \(P\) is indeed the PDF of the standard normal, then applying integration by parts to \(E_{P}[f']\) and using the differential equation gives us Stein’s formula.

Now, if \(P\) is the PDF of any other distribution and it satisfies Stein’s formula for all \(f\in C^1_b\), then integration by parts on \(E_{P}[f']\) leads us back to the \(P'+xP=0\). That ODE is separable with solution \(P\propto \exp(-x^2/2)\), which, up to the normalisation, is the PDF of the standard normal!

This strategy of deriving an ODE for the density function, getting its weak form by multiplying with a smooth function \(f\) and integrating can be repeated to get Stein operators for other distributions, e.g., the exponential, etc.

A computable lower bound for the KL from Hammersley-Chapman-Robbins inequality

2020-12-28T00:00:00+00:00

I first read of this bound from:

Nishiyama, T., 2019. A New Lower Bound for Kullback-Leibler Divergence Based on Hammersley-Chapman-Robbins Bound. arXiv:1907.00288 [cs, math, stat].

The following notes are not very polished, but present the general idea. I hope they are useful.

The strategy of the paper is quite nice; first, Nishiyama shows that if we have two distributions \(P,Q\) and define the mixture \(R_t(x):=P(x)+t(Q(x)-P(x)), t\in [0,1]\), then:

\[\frac{d}{dt}D_a(P|R_t)=\frac{1-a}{t}D_a(P|R_t)+\frac{1+a}{t}D_{a+1}(P|R_t),\]

for any \(a\) and \(D_a\) being the alpha-divergence. Setting \(a=1\) leaves us with the KL divergence and \(\chi^2\), which recovers the nice identity:

\[\frac{d}{dt}KL(P|R_t)=\frac{1}{t}\chi^2(P|R_t).\]

Now, fix a function \(f\) for which \(E_Q, E_P, V_P, V_{R_t}\) (expectations and variances) are finite and \(V_{R_t}>0\) for all \(t\in [0,1]\). Then, applying the HCR inequality gives:

\[\frac{d}{dt}KL(P|R_t)\geq \frac{(E_{R_t}-E_P)^2}{tV_{R_t}}.\]

Integrating the above in \([0,1]\) gives the result of the paper as \(\int_0^1 KL(P|R_t)dt=KL(P|Q).\)

We can also show that

\[KL(P|Q)=\int_0^1 \frac{\chi^2(P|R_t)}{t}dt=\int_0^1 t\int_{\Omega}\frac{(Q-P)^2}{P+t(Q-P)}dxdt.\]

Assuming everything exists, we can exchange the integrals (ala. Fubini) and then expand the \(t\) function around \(t=1\) to introduce the chi-square divergence:

\[KL(P|Q)=\chi^2(P|Q)+...\]

Actual lower bound

The lower bound is:

\[KL(P|Q)\geq \int_0^1\frac{(E_{R_t}-E_P)^2}{tV_{R_t}}dt,\]

where the integral depends on \(E_P, E_Q, V_P, V_Q\) and can be computed analytically and written as a sum of logarithms (by using partial fractions).

Tighter lower-bounds

Starting from the equality:

\[KL(P|Q)=\int_0^1 \frac{\chi^2(P|R_t)}{t}dt\]

we can derive tighter lower bounds for the KL. First, we write the variational representation of \(\chi^2\)

\[\chi^2(P|R_t)=\sup_{h}\left \{ 2E_P[h]-E_Q[h^2]-1\right \}.\]

Suppressing the family of functions leads to lower bounds of chi-square and thus to lower bounds of the KL. For example, the HCR bound can be derived by considering first degree polynomials.

Rediscovering a function from samples

2020-12-28T00:00:00+00:00

A friend shared this nice problem with me. Suppose you have a fixed function, \(f\), and a family of probability distributions, defined by, say, prob. density functions, \(P_t\), \(t \in A\). If we know \(E_t=E_{P_t}[f]\) for every \(t\in A\), can we recover \(f\)?

Clearly the answer depends on both how rigid \(f\) is and the family \(P_t\). We can cast this problem as a functional analysis problem by defining \(k(x,t)=P_t(x)\) to be a kernel and the expectation to be an integral transform. Then the question becomes: is there an inverse kernel, say, \(k^{-1}(x,t)\), such that \(\int k^{-1}(x,t)E(t)dt=f(x)?\) When does that exist and when is it unique? Hints can be taken from the Laplace transform, i.e., \(P_t(x)\propto e^{-tx}\) - up to a normalizing constant this is the just the exponential distribution. In general, this can be a hard problem though.

Fredholm equations

If we know \(E(t)=E_{P_t}[f]\) for every \(t\in A\), can we recover \(f\)? Formally, we have the equation: \(E(t)=\int k(x,t)f(x)dx,\) with appropriate limits for the integral. This equation is called a “Fredholm Equation of the first kind” and is closely studied in functional analysis and signal processing.

Practical stuff

If we assume the existence of an inverse kernel, how can we approximate it? One idea — which is also kind of a standard approach — is to fix a set of orthonormal basis functions, describe everything in terms of them, and then resolve them to arrive to a linear algebra problem.

Rewriting the Kullback-Leibler with an integral transform

2020-01-20T00:00:00+00:00

I recently read this nice integral representation of the logarithm in 1912.05812v1 by Neri Merhan and Igar Sason. Most ideas in this post are from there. The transform is: \(\log(x)=\int_{0}^{\infty}\frac{e^{-t}-e^{-tx}}{t}dt.\)

This can be shown by using \(\log(x)=\int_{1}^{x}1/t\cdot dt\) along with \(\frac{1}{x}=\int_{0}^{\infty}e^{-xu}du\). It all becomes more interesting when we take an expectation though:

\[\mathbb{E}[\log(X)]=\int_{0}^{\infty}\frac{e^{-t}-\mathbb{E}[e^{-tX}]}{t}dt\]

This allows us to express the expectation of the logarithm in terms of the moment generating function of \(X\). For instance, if \(X\) is normal, then such a representation will probably be simpler. It is not obvious that it will help with computation, but it does suggest that we can use stuff like concentration bounds for expectations of logarithms.

It’s fun to apply the same idea to the KL.

\[R(Q|P)=\int Q(x) \log \frac{Q(x)}{P(x)}dx.\]

For simplicity, let \(w(x):=\log\frac{Q(x)}{P(x)}\). Then, from (2):

\[\mathbb{E}_Q[\log w]=\int_{0}^{\infty}\frac{e^{-t}-\mathbb{E}_Q[e^{-tw}]}{t}dt.\]

The last equation is a different way to express the KL divergence. It’s not particularly useful as is (to my eyes), as the MGF of \(w\) is a tough cookie to compute. However, with a lower bound to the MGF we could get an upper bound to the KL that is not trivial.

The simplest Bayesian optimization example

2019-11-10T00:00:00+00:00

After a really interesting paper discussion session, I decided to implement Bayesian-opt. with the expected-improvement acquisition function. I’ve already fixed a few bugs, but can’t promise it is bug-free. It will probably work as long as you stick to my examples.