Parts of these notes are based on the new book “Deep Learning” by Bishop & Bishop^{1}, which I recommend. They also have some comments from me (and, of course, all mistakes / typos are my own).
Let’s get started. We have a matrix \(X\in\mathbb{R}^{n\times m}\) which represents a sequence of \(n\) \(m\)-dimensional vectors. For the language modeling example, each one of those could be the embedding of a single token.
Let’s suppose that we are in the business of modeling, so we would like to map \(X\) to a \(Y\in \mathbb{R}^{n\times m}\) such that each m-dimensional vector in \(Y\) contains information from all \(X_j\) (\(X_j\) being the \(j\)-th column vector). Perhaps the simplest way is \(Y_i=\sum_jA_{ij}X_j,\) where we assume that \(A_{ij}\in [0,1]\) for all \(i,j\). Restricting \(A_{ij}\) such that \(\sum_{j}A_{ij}=1\) for all \(i\) has some nice properties, we can now think of \(Y_i\) as a weighted mean of the \(X_j\), and we just get to decide how much of each \(X_j\) to use.
Following this recipe further, we can pick \(A_{ij}\) according to how relevant each \(X_j\) is to every other. One way to capture relevance is through similarity, leading to “dot-product self-attention”^{2}:
\(A_{ij}=\text{softmax}(XX^T)_{ij}=\frac{\exp(X_i^TX_j)}{\sum_{k}\exp(X_i^TX_k)}.\)
As \(X\in\mathbb{R}^{n\times m}\), \(XX^T\) has dimensions \(n \times n\), i.e., it is quadratic on sequence size.
To get \(Y\), we can just do \(Y=\text{softmax}(XX^T)X.\)
This is fine, but:
We can address these points by introducing a new matrix, \(U\in\mathbb{R}^{D\times D}\), with learnable parameters such that \(\tilde{X} = XU\), and so
\[Y=\text{softmax}(\tilde{X}\tilde{X}^T)\tilde{X}=\text{softmax}(XUU^TX^T)XU.\]Progress, but now \(\tilde{X}\tilde{X}^T\) is always a symmetric matrix regardless of \(U\), so we cannot capture asymmetric relationships. This motivates using different parameters for the parts of the attention matrix and the final mapping:
\[\begin{align} Q&=XW_Q,\ W_Q\in\mathbb{R}^{m\times D_K},\\ K&=XW_k,\ W_K\in\mathbb{R}^{m\times D_K},\\ V&=XW_V,\ W_V\in\mathbb{R}^{m\times D_V}. \end{align}\]Those are the celebrated query, key, and value matrices^{3}, respectively, all learnable. Typically, \(D=D_K=D_V\) makes it easier to work things out. With those matrices, we adjust attention as \(Y=\text{softmax}(QK^T)V.\) Quick dimensionality check:
While \(Y=\text{softmax}(QK^T)V\) is very close to the usual self-attention, we are missing a scaling constant. Let’s derive this here.
Suppose that you have two \(D_K\)-dimensional vectors \(q,k\), each one with elements that have zero mean and unit variance and are independent. Then: \(\text{Var}[(q,k)]=\sum_{i=1}^{D_K} \text{Var}[q_ik_i]=\sum_{i=1}^{D_K}1=D_K.\)We used independence to split up \(\text{Var}[(q,k)]\). Therefore, the standard deviation of \((q,k)\) is \(\sqrt{D_K}\). This is what we need to make sure that the parts of \(QK^{T}\) have unit variance. This helps us control how big the products are, which makes learning easier. So, finally \(Y=\text{softmax}(QK^T/\sqrt{D_K})V,\) which is the usual form of dot-product attention.
Zero mean and unit variance are a matter of pre-processing, but independence is not. In fact, you would hope a sequence would not have independent elements, as otherwise there is no information to use to predict the next element. I now see this scaling as a way to control the size of the terms and help with learning, but it’s important to remember this point as it not always explicitly stated^{4}.
We need to calculate the matrix product \(QK^T\) which has computational cost \(O(nD^2)\) if we assume \(D=D_V=D_K\) and a sequence of length \(n\). Then, the matrix product \(QK^{T}V\) has cost \(O(n^2 D)\) (I’m ignoring the application of the softmax here and the scaling).
After this part, when dealing with a transformer block, we have an MLP layer that takes as input each output from the attention layer (\(n\) of them in total). This layer has cost \(O(n^2D)\). Therefore, the total cost is \(\max\{O(nD^2), O(n^2D)\}\).
\(D\) is fixed at the time the transformer is designed, whereas \(n\) is the length of the input sequence, so you can see which of the two is going to be a challenge during inference with large inputs.
Bishop, C.M. and Bishop, H., 2023. Deep learning: Foundations and concepts. Springer Nature. ↩
There are so many variants of attention now: grouped attention, linearised attention, etc. ↩
Query / Key / Value is a retrieval reference; see for example on cross-validated. ↩
Though, the authors of “Attention is all you need” do mention this assumption in the celebrated “footnote 4”. ↩
In this, the author proves an intermediate term between the classical bounds of the CS inequality.
First things first, for all \(x,y\) in some Hilbert space \(H\) with inner product \((.,.)\) we have:
\[|(x,y)|\leq \|x\|\|y\|.\]The author shares a nice proof of CS that I’m not sure if I’ve seen before (but looks neat). We define the matrix \(C=C(x,y)\) with \(c_{ij}=\frac{1}{\sqrt{2}}(x_iy_j-x_jy_i),\ i,j=1,\ldots,n\). Then, its second norm is
\[\|C\|_2=\sqrt{\sum_{ij}c_{ij}^2}.\]If we substitute the definition of \(c_{ij}\) to the above and carry out the algebra, we have \(\|C\|^2_2=\|x\|^2\|y\|^2-(x,y)^2,\) which proves CS as the norm is non-negative.
Now, suppose \(V\subseteq H\) is some closed subspace of \(H\). Then if \(P\) is the orthogonal projection onto \(V\) (i.e., \(PH=V\)), the author defines:
\[D(x,y|P):=\|Px\|\cdot \|Py\|+\|P^{\perp}x\|\cdot \|P^{\perp}y\|,\]for all \(x,y\in H\) and where \(P^{\perp}\) is the projection on the orthogonal complement of \(V\). Then, \(|(x,y)|\leq D(x,y|P)\leq \|x\|\|y\|.\)
I like inequalities with free terms! We can pick \(P\) depending on the problem and the bounds would adapt correspondingly. If \(P\) is a trivial projection to \(H\) or its complement, we recover usual CS.
Proof: It’s a short argument.
\[|(x,y)|=|(Px,Py)+(P^{\perp}x, P^{\perp}y)|\leq |(Px,Py)|+|(P^{\perp}x, P^{\perp}y)|\leq \|Px\|\|Py\|+\|P^{\perp}x\|\|P^{\perp}y\|,\]where we used bilinearity of inner product, triangle inequality, and CS inequality for each term. Now, if \(a=\|Px\|, b=\|Py\|, c=\|P^{\perp}x\|, d=\|P^{\perp}y\|\), the author shows that
\[ab+cd=\sqrt{(ab+cd)^2}\leq \sqrt{(ab+cd)^2+(ac-bd)^2}=\sqrt{(a^2+c^2)(d^2+b^2)},\]which is important because with the definitions of \(a,b,c,d\) above, \(a^2+c^2=\|x\|^2\) and similarly for \(y\).
The author shows this with an appeal to algebra (as shown above), but you will notice a simpler way; this bound is just CS but applied to the vectors \(u=(a,c)\) and \(v=(b,d)\). \(\square\)
What if we have more than one subspace? Suppose \(V, U\subseteq H\) with corresponding projections \(P,Q\), then \(x=Px+P^{\perp}x\) and similarly for \(y\), which gives
\[\begin{align} |(x,y)|&\leq |(Px,Qy)|+|(Px, Q^{\perp}y)|+|(P^{\perp}x,Qy)|+|(P^{\perp}x, Q^{\perp}y)|\\ &\leq (\|Px\|+\|P^{\perp}x\|)(\|Qy\|+\|Q^{\perp}y\|), \end{align}\]Again, by bilinearity, triangle ineq., and CS.
The last bound is not as good; setting \(P=Q\) there does not recover the previous results (note \(\|x\|\leq \|Px\|+\|P^{\perp}x\|\)). That’s because some terms would be cancelled from the first bound but are not cancelled from the second bound. For example, \((Px, P^{\perp}y)=0\) regardless of \(x,y\), and so on.
The math are really interesting (and are for another post), but I like to have some visuals that I can share with interested parties when I do presentations, so I created a notebook for this.
The densities below are kernel-density-estimates (aka., probability densities) of the “death times” of the \(H_0\) homology for the \(X\) point cloud (the one created with make_classification
below). But what does “death time” mean here?
Persistent homology (PH) is all about understanding the aspects of the shape of a manifold from sampled points (aka, a point cloud). In this note, we are looking at only one attribute that is captured by PH, the connected components of the manifold. PH looks at the point cloud at various scales, from the scale of the individual points to the scale of the entire dataset. As PH works through the different scales, it identifies when connected components get created (birth) and when they merge (death, for some of them). As every point on its own initially constitutes a connected component, the birth times are all equal to zero.
The density plots below are tracking the death times and how those change as we manipulate the point cloud. For each case, I’m showing both the death times as they are (Normalised=False
) and what happens if we normalise by the maximum finite persistence time. Normalising them makes the death times invariant to point cloud scaling (as you will see below).
N_FEAT = 50
X, _ = datasets.make_classification(
n_classes=2,
n_samples=100,
n_features=N_FEAT,
n_redundant=0,
n_informative=N_FEAT,
random_state= 0,
n_clusters_per_class=4,
class_sep=1.0
)
# change X to have mean 2 and std 3
X = 2 + 3 * X
# contraction mapping
Xcontr = X / 2
# expansion mapping
Xexp = X * 2
Things are as expected up to this point. A few more interesting operations follow.
# generate a random affine contraction matrix A
A = np.random.rand(N_FEAT, N_FEAT)
A = A / (np.linalg.norm(A)+1e-10)
b = np.random.rand(N_FEAT)
Xaff = X @ A + b
# map to a lower dimension
Xlow = X[:, :2]
# map to lower dimensions with a random affine map
A = np.random.rand(N_FEAT, 10)
b = np.random.rand(10)
Xaff = X @ A + b
# same affine transformation but with a relu function applied to the output
Xaff_relu = np.maximum(0, X @ A + b)
# two layer neural network with relu activation
N_OUTPUT = 10
A1 = np.random.rand(N_FEAT, 20)
b1 = np.random.rand(20)
A2 = np.random.rand(20, N_OUTPUT)
b2 = np.random.rand(N_OUTPUT)
Xnn = X @ A1 + b1
Xnn = np.maximum(0, Xnn)
Xnn = Xnn @ A2 + b2
# layernorm
Xlayernorm = (X - X.mean(axis=1, keepdims=True)) / X.std(axis=1, keepdims=True)
It’s a language-model-tuning paradigm for few-shot learning with language models without using prompts. It relies on contrastive learning and the authors published a really nice library that makes the method plug & play with sentence-transformers.
I have been contributing some time to that library on Github as well.
Integrated gradients (IG) are a method for attributing the output of a neural network to its inputs. It was first developed to explain the output of image classifiers, but it can be used for any model that takes a vector as input.
It occured to me by building this that there are various places one could perturb to apply IG and that the perturbation path probably also matters a lot.
]]>To start with, I have a variable, say, \(Y\sim P\). Given data from this variable, \(y_i\), we can estimate the mean
\[E[Y]\approx S_N=\frac{1}{N}\sum_{i=1}^{N}y_i.\]The estimator we all know and use. It’s unbiased, but it may have large variance, which means that for a fixed N, most random sums could fall away from \(E[Y]\). What can we do to improve on this?
One idea is to introduce a new variable, \(X\). Then, we adjust the data as
\[y'_i=y_i+a(x_i-E[x_i]),\]where \(a\) is some parameter that we can pick later. What’s the advantage of this? First, this adjustment doesn’t change the mean: \(E[Y']=E[Y]+a(E[X]-E[X])=E[Y].\) So an estimator based on \(Y'\) is unbiased. Not bad.
What is the variance of this new estimator?
\[\mathrm{Var}[Y'] = \mathrm{Var}[Y] + a^2\mathrm{Var}[X] + 2\mathrm{Cov}[Y,aX].\]Now the first two terms are non-negative. However, the last one can be negative as long as \(X\) and \(Y\) have negative correlation. For instance, picking \(X=-Y\),
\[\mathrm{Var}[Y'] = \mathrm{Var}[Y] + \mathrm{Var}[Y] - 2\mathrm{Cov}[Y,Y] = 0,\]which is the best possible variance. Realistically, we are back to where we started if we set \(X=-Y\), however if we have other variables that are close to \(-Y\), then this idea can get quite useful (and dramatically reduce variance).
Why would all this be useful for hypothesis testing?
To conduct it, we first split the population to two groups, traditionally called the “treatment” and “control”. We have intervened on the treatment group in some way, say via a variable \(Z\), and we wish to understand whether the magnitude of \(E[Y_T-Y_C]\) (this is also called the average treatment effect (ATE)) is due to \(Z\) or random chance.
Suppose that the difference is indeed due to \(Z\). If we use data to estimate \(E[Y_{T}-Y_{C}]\), the estimator can have variance and that will affect how small an effect we can confidently separate from random chance.
We are now thinking about the “power “ of a hypothesis test, and there are at least three ways to improve our situation from here:
Now we can discuss two ways to increase the power of the test by using 3.
You can find implementations of CUPED and CUPAC in this notebook.
CUPED stands for “Controlled-Experiment using Pre-experiment data”; see Deng, Xu, Kohavi, Walker, 2013.
At its core, CUPED is a proposal for how to use pre-experiment data with control variates to reduce the variance of the ATE estimator. The authors propose using any covariates that we have before the experiment took place, \(X_i\), \(i=1,\ldots, n\), as well as past values of \(Y\) to fit a linear model: \(\hat{Y}=X\beta +\epsilon\)
Then, we can use the linear model to get predictions for the \(Y\) variables in the control and experiment groups and adjust the true values as \(Y'_T=Y_{T}-(\hat{Y}_T-E[\hat{Y}_T]),\) and similarly for \(Y_C\). Instead of the original ATE, we will then construct an estimator for \(E[Y'_T-Y'_C]\) which, because of our control variates method, will have smaller variance!
CUPAC, aka., Control Using Predictions As Covariates, introduced by DoorDash’s engineering, Li, Tang, and Bauman, takes this one step further: there’s nothing special about using a linear model. One can use a more expressive ML model, get a closer fit to \(Y\), and reduce variance further.
]]>Inspired from this tweet, I wanted to understand the basics of Stein’s characterization of the Normal distribution.
With Stein’s idea, we can identify the distribution of a random variable by checking that it satisfies some condition in expectation. For example, for the standard normal, we have that \(X\sim N(0,1)\) if and only if for all \(f\) with \(E[f']<\infty\) we have:
\[\mathbb{E}[xf(x)-f'(x)]=0.\]The operator \(Af:=xf-f'\) is then called the Stein operator and we can rewrite the result with this operator as: \(\mathbb{E}_{P}[Af]=0\) for all \(f\in C^1_b\) iff \(P\) is \(N(0,1)\).
This operator is not unique as we can always add P-measure zero parts; see \(Bf:=xf-f'+x\) which satisfies
\[\mathbb{E}[Bf]=\mathbb{E}[Af]+\mathbb{E}[x]=\mathbb{E}[Af].\]How can we show \(A\) characterises the standard Normal? A key identity is that the probability density function (PDF) of the \(N(0,1)\) satisfies \(P'+xP=0\). So, if \(P\) is indeed the PDF of the standard normal, then applying integration by parts to \(E_{P}[f']\) and using the differential equation gives us Stein’s formula.
Now, if \(P\) is the PDF of any other distribution and it satisfies Stein’s formula for all \(f\in C^1_b\), then integration by parts on \(E_{P}[f']\) leads us back to the \(P'+xP=0\). That ODE is separable with solution \(P\propto \exp(-x^2/2)\), which, up to the normalisation, is the PDF of the standard normal!
This strategy of deriving an ODE for the density function, getting its weak form by multiplying with a smooth function \(f\) and integrating can be repeated to get Stein operators for other distributions, e.g., the exponential, etc.
]]>Nishiyama, T., 2019. A New Lower Bound for Kullback-Leibler Divergence Based on Hammersley-Chapman-Robbins Bound. arXiv:1907.00288 [cs, math, stat].
The following notes are not very polished, but present the general idea. I hope they are useful.
The strategy of the paper is quite nice; first, Nishiyama shows that if we have two distributions \(P,Q\) and define the mixture \(R_t(x):=P(x)+t(Q(x)-P(x)), t\in [0,1]\), then:
\[\frac{d}{dt}D_a(P|R_t)=\frac{1-a}{t}D_a(P|R_t)+\frac{1+a}{t}D_{a+1}(P|R_t),\]for any \(a\) and \(D_a\) being the alpha-divergence. Setting \(a=1\) leaves us with the KL divergence and \(\chi^2\), which recovers the nice identity:
\[\frac{d}{dt}KL(P|R_t)=\frac{1}{t}\chi^2(P|R_t).\]Now, fix a function \(f\) for which \(E_Q, E_P, V_P, V_{R_t}\) (expectations and variances) are finite and \(V_{R_t}>0\) for all \(t\in [0,1]\). Then, applying the HCR inequality gives:
\[\frac{d}{dt}KL(P|R_t)\geq \frac{(E_{R_t}-E_P)^2}{tV_{R_t}}.\]Integrating the above in \([0,1]\) gives the result of the paper as \(\int_0^1 KL(P|R_t)dt=KL(P|Q).\)
We can also show that
\[KL(P|Q)=\int_0^1 \frac{\chi^2(P|R_t)}{t}dt=\int_0^1 t\int_{\Omega}\frac{(Q-P)^2}{P+t(Q-P)}dxdt.\]Assuming everything exists, we can exchange the integrals (ala. Fubini) and then expand the \(t\) function around \(t=1\) to introduce the chi-square divergence:
\[KL(P|Q)=\chi^2(P|Q)+...\]The lower bound is:
\[KL(P|Q)\geq \int_0^1\frac{(E_{R_t}-E_P)^2}{tV_{R_t}}dt,\]where the integral depends on \(E_P, E_Q, V_P, V_Q\) and can be computed analytically and written as a sum of logarithms (by using partial fractions).
Starting from the equality:
\[KL(P|Q)=\int_0^1 \frac{\chi^2(P|R_t)}{t}dt\]we can derive tighter lower bounds for the KL. First, we write the variational representation of \(\chi^2\)
\[\chi^2(P|R_t)=\sup_{h}\left \{ 2E_P[h]-E_Q[h^2]-1\right \}.\]Suppressing the family of functions leads to lower bounds of chi-square and thus to lower bounds of the KL. For example, the HCR bound can be derived by considering first degree polynomials.
]]>Clearly the answer depends on both how rigid \(f\) is and the family \(P_t\). We can cast this problem as a functional analysis problem by defining \(k(x,t)=P_t(x)\) to be a kernel and the expectation to be an integral transform. Then the question becomes: is there an inverse kernel, say, \(k^{-1}(x,t)\), such that
\[\int k^{-1}(x,t)E(t)dt=f(x)?\]When does that exist and when is it unique? Hints can be taken from the Laplace transform, i.e., \(P_t(x)\propto e^{-tx}\) - up to a normalizing constant this is the just the exponential distribution. In general, this can be a hard problem though.
If we know \(E(t)=E_{P_t}[f]\) for every \(t\in A\), can we recover \(f\)? Formally, we have the equation:
\[E(t)=\int k(x,t)f(x)dx,\]with appropriate limits for the integral. This equation is called a “Fredholm Equation of the first kind” and is closely studied in functional analysis and signal processing.
If we assume the existence of an inverse kernel, how can we approximate it? One idea — which is also kind of a standard approach — is to fix a set of orthonormal basis functions, describe everything in terms of them, and then resolve them to arrive to a linear algebra problem.
]]>This can be shown by using \(\log(x)=\int_{1}^{x}1/t\cdot dt\) along with \(\frac{1}{x}=\int_{0}^{\infty}e^{-xu}du\). It all becomes more interesting when we take an expectation though:
\[\mathbb{E}[\log(X)]=\int_{0}^{\infty}\frac{e^{-t}-\mathbb{E}[e^{-tX}]}{t}dt\]This allows us to express the expectation of the logarithm in terms of the moment generating function of \(X\). For instance, if \(X\) is normal, then such a representation will probably be simpler. It is not obvious that it will help with computation, but it does suggest that we can use stuff like concentration bounds for expectations of logarithms.
It’s fun to apply the same idea to the KL.
\[R(Q|P)=\int Q(x) \log \frac{Q(x)}{P(x)}dx.\]For simplicity, let \(w(x):=\log\frac{Q(x)}{P(x)}\). Then, from (2):
\[\mathbb{E}_Q[\log w]=\int_{0}^{\infty}\frac{e^{-t}-\mathbb{E}_Q[e^{-tw}]}{t}dt.\]The last equation is a different way to express the KL divergence. It’s not particularly useful as is (to my eyes), as the MGF of \(w\) is a tough cookie to compute. However, with a lower bound to the MGF we could get an upper bound to the KL that is not trivial.
]]>