Demystifying Noise Contrastive Estimation

J jxmo.io ↗

▲ 5 points • 0 comments • by jxmorris12 • 1w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

0 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 7 of 7

SEGMENTS · AI 0 of 7

WORD COUNT 1,752

PEAK AI % 0% · §5

Analyzed

Jun 17

backend: pangram/v3.3

Segments scanned

7 windows

avg 250 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 1,752 words · 7 segments analyzed

Human AI-generated

§1 Human · 0%

Introduction

This document describes two machine learning methods, Noise Contrastive Estimation (NCE) [5] and its follow-up InfoNCE [15]. We discuss two variants of NCE as well as InfoNCE and a related technique called partition function estimation. NCE-based methods are used for estimating the parameters of a statistical distribution by differentiating between “real data” and “noise”.

We will try to keep our terminology consistent. The original NCE method is sometimes referred to as Local NCE [3,13] or Binary NCE . We’ll call it Local NCE. The follow-up InfoNCE is very similar to an NCE variant called both Global NCE [13] and Ranking NCE [8], which we’ll call Global NCE.

Applications. Local NCE and Global NCE are both computationally inexpensive methods for learning a conditional likelihood $p_\theta(x \mid c)$. NCE is useful in general when the number of possible $x$’s is very large, like in language modeling [6,8], where $|X|$ is the number of words in a vocabulary. InfoNCE is a method for maximizing the mutual information between two variables, which forms the foundation of joint text-to-image learning methods like CLIP [11], where $x$ is an image and $c$ the text of its caption, as well as general contrastive learning-based techniques like SimCLR [1], where $x$ and $c$ are two different ‘views’ of the same image.

Overview. We begin by discussing how NCE works to approximate $p(x \mid c)$. We then consider how we could approximate $p(x \mid c)$ via importance sampling, a method sometimes known as partition function estimation. Then we connect partition function estimation to Global NCE and InfoNCE. We end with discussion of the differences between the methods and suggestions for which to use in practice.

What are x and c?

Local NCE and Global NCE focus on learning $p(x \mid c)$, the likelihood of some $x$ given some ‘context’ $c$. InfoNCE maximizes $\frac{p(x \mid c)}{p(x)}$, a proxy for mutual information between $x$ and $c$. But what are $x$ and $c$ anyway?

§2 Human · 0%

Here are examples of x’s and c’s from the literature:

In NLP: Local NCE and Global NCE are both used for language modeling, where $x$ is a word and $c$ is the word’s context (the other words in a window around $x$). Here, we want to learn $p(x \mid c)$, the probability of the next word given context. [6] In speech recognition: Local NCE is used for learning $p(x \mid c)$ where $x$ is a word predicted from audio $c$ in which the word was spoken. [18] Also in speech recognition: InfoNCE is used to maximize the mutual information between representations of the same word in different contexts. Here, $x$ is a word and $c$ is the audio context. [19] In reinforcement learning: InfoNCE is used as a regularizer, where $x$ and $c$ are representations of the same game state at different times. Here, we want to maximize the mutual information between $x$ and $c$, which is proportional to $\frac{p(x \mid c)}{p(x)}$. [16] In computer vision: InfoNCE is used for contrastive learning, where $x$ and $c$ are different views (different crops with random augmentations like color filters and image-stretching applied) of the same image, and we again want to maximize the mutual information between $x$ and $c$. [1] In generative adversarial networks (GANs): Local NCE is similar to the ‘discriminator’ of a GAN, except that with a GAN, the ‘generator’ $q(x)$ is learned, and with Local NCE the generator $q(x)$ is fixed throughout training. [17]

Local NCE

How can we estimate $p_{\theta}(x \mid c)$ where $c$ is some context and $x$ is one of a large number of classes?

§3 Human · 0%

We can rewrite the objective like this:

\[p_{\theta}(x \mid c) = \dfrac{f_\theta (x, c)}{\sum_{x'} f_\theta (x', c)} = \dfrac{f_\theta (x, c)}{Z_\theta (c)}\]

where $f_\theta(x, c)$ assigns a score to $x$ in context and $Z_\theta (c)$ is the normalizing constant or partition function. $Z$ is difficult to compute in this case of many possible classes, because we have to sum over every possible $x’$.

Local NCE reduces the problem of learning $p_\theta(x \mid c)$ to learning a binary classifier $p(d \mid x, c)$, where $d$ tells us whether data point $c$ is “real”, i.e. sampled from $p(x \mid c)$, rather than “noise”, sampled from some noise distribution $q(x)$. Technically, $d$ is an indicator variable where $[d=0]$ implies that $x \sim q(x)$.

In Local NCE, we sample $k$ real samples (or negative samples) from $q(x)$ and assign them the label $D=0$. We sample one true data point (or positive sample) from $p(x \mid c)$. We can write the conditional probability of $d$ having observed $x$ and $c$:

\[\begin{align*} p(D = 0 \mid x, c) &= \dfrac{k \cdot q(x)}{p(x \mid c) + k \cdot q(x)} \\ p(D = 1 \mid x, c) &= \dfrac{p(x \mid c)}{p(x \mid c) + k \cdot q(x)} \end{align*}\]

Now we want to write these probabilities in terms of the scoring function $f_\theta$. Local NCE typically makes a key assumption that we can assume $Z_\theta(c) \approx 1$; this assumption is called self-normalization and tends to work well in practice, since neural networks are expressive enough to learn self-normalized scoring functions.

§4 Human · 0%

Self-normalization implies that $p(x \mid c) = f_\theta (x, c)$. With this assumption, we can rewrite the conditional likelihood in terms of $f_\theta$:

\[\begin{align*} p(D = 0 \mid x, c) &= \dfrac{k \cdot q(x)}{f_\theta(x,c) + k \cdot q(x)} \\ p(D = 1 \mid x, c) &= \dfrac{f_\theta(x,c)}{f_\theta(x,c) + k \cdot q(x)} \end{align*}\]

This is a typical binary classification problem: learn a $\theta$ that maximizes the conditional log-likelihood of $p(D \mid x, c)$. So we can write the loss as the binary classification error:

\[\begin{equation} \mathcal{L}_{\text{LocalNCE}} = \sum_{(x, c) \in D} \left[ \\ \log p(D = 1 \mid x, c) \\ + k \mathbb{E}_{x' \sim q} \log p(D = 0 \mid x', c) \right] \end{equation}\]

This loss $\mathcal{L}_{\text{LocalNCE}}$ isn’t actually useful yet because it still requires computing an expectation over $q(x)$, which would still loop over all possible classes for $x$! So, finally, we approximate the inner expectation Monte-Carlo sampling, using $k$ samples from $q(x)$:

\[\begin{equation*} \mathcal{L}_{\text{LocalNCE,MC}} = \sum_{(x, c) \in D} \left( \log p(D = 1 \mid x, c) + \sum_{i=1, x' \sim q}^{k} \log p(D = 0 \mid x', c) \right) \end{equation*}\]

The original NCE paper [5] shows how, as $k \rightarrow \infty$, the gradient of $\mathcal{L}$ is only zero once $f_\theta(x, c) = p(x \mid c)$. See the appendix for more detail.

§5 Human · 0%

So the main idea is that we can optimize $p(D \mid x, c)$ instead of $p(x \mid c)$, and the self-normalized scoring function we learn $f_\theta (x \mid c)$ will approximate $p_\theta(x \mid c)$.

Application: Negative sampling

NCE gives us a way to learn $f(x, c) \approx p(x \mid c)$ without computing the partition function, which can be expensive or even intractable. But it still requires some engineering; in particular, we have to choose $k$, the number of noise examples, and $q(x)$, the noise distribution.

Let’s choose some reasonable defaults for these parameters in the case of language modeling. We might choose $k=|V|$, so the probability of choosing a real point (as opposed to noise) is roughly $\frac{1}{|V|}$. And (without any prior knowledge of language) it seems reasonable to set $q(x)$ to a uniform distribution over the vocabulary, i.e. $q(x) = \frac{1}{|V|}$.

Given these choices for $k$ and $q$, we can simplify the NCE objective:

\[\begin{equation*} p(D = 0 \mid x, c) = \dfrac{ |V \mid \cdot \frac{1}{|V|} }{f_\theta(x,c) + |V \mid \cdot \frac{1}{|V|}} = \dfrac{1}{f_\theta(x,c) + 1} \end{equation*}\]

\[\begin{equation*} p(D = 1 \mid x, c) = \dfrac{f_\theta(x,c) }{f_\theta(x,c) + |V \mid \cdot \frac{1}{|V|}} = \dfrac{f_\theta(x,c) }{f_\theta(x,c) + 1} \end{equation*}\]

Although these are the equations for the loss in negative sampling, the actual sampling process uses a different $q(x)$, so negative sampling isn't truly using NCE. Consequently, negative sampling isn't modeling $p(x \mid c)$ either, but a different quantity related to $p(x, c)$. See [4].

§6 Human · 0%

This variation of NCE with $k = |V|$ and $q(x) = \frac{1}{|V|}$ is used in negative sampling [9,4], the method underlying the popular word2vec algorithm.

Approximating Z

Another way to learn $p(x \mid c)$ is to directly approximate it via sampling. When deriving NCE, we noted that:

\[\begin{equation*} p_\theta(x \mid c) = \dfrac{f_\theta(x, c)}{Z_\theta(c)} \end{equation*}\]

And with NCE, we avoided computing the partition function $Z$ by forcing the scoring function $f$ to “include” the partition function, via self-normalization. Suppose that instead of avoiding computing the partition function (by forcing the model to self-normalize) we decided to approximate $Z$ by sampling $x \sim q(x)$ $k$ times:

\[\begin{align*} Z_\theta(c) = \sum_j f_\theta(x_j, c) &= \sum_j f_\theta(x_j, c) \dfrac{q(x_j)}{q(x_j)} \\ &= \mathbb{E}_{x_j \in X} \left[ \dfrac{f_\theta(x_j, c)}{q(x)} \right] \\ &\approx \sum_{j=1, x_j \sim q(x)}^{k}\dfrac{f_\theta(x_j, c)}{q(x_j)} \end{align*}\]

This type of approximation via sampling from $q(x)$ is called importance sampling. We can plug the approximation of $Z_\theta$ in to approximate the conditional likelihood:

\[\begin{equation} p(x \mid c) \approx \dfrac{f_\theta(x, c)}{\sum_{j=1, x_j \sim q(x)} {f_\theta (x_j, c) / q(x_j)}} \end{equation}\]

Note that the approximation done here is in computing $Z_\theta(x, c)$, which is why this approach is also called partition function estimation.

Global NCE

Given Equation 2, one way to learn $f$ would be via maximum likelihood estimation to directly maximize $p(x \mid c)$. However, in our case, we don’t know the true value of $p(x \mid c)$, or it’s too expensive to compute directly.

§7 Human · 0%

So, as with Local NCE, we need to consider some surrogate objective to help us learn $p(x \mid c)$ in the setting where we can sample from $p(x \mid c)$, but can’t evaluate it directly.

Global NCE / Ranking NCE

So how can we find a surrogate objective? With Local NCE, we created a binary classification task where the classifier determines whether some $x$ came from context $p(x \mid c)$ or noise $q(x)$. The insight behind Global NCE is very similar. With Global NCE, the classifier identifies the real data point from multiple examples $x_1, …, x_k$ where one data point $x_i \sim p(x \mid c)$ is real and the remaining data points $x_{j \neq i} \sim q(x)$ are noise.

With these $x$’s, let’s create a categorical indicator variable $d$ where $[d = i]$ tells us that $x_i$ is the positive example. Then we can use $d$ to write the following objective, which is the cross-entropy of identifying the real example correctly (see the appendix for a proof):

\[\begin{align} p(d = i \mid x_i, c; k) &= \dfrac {p(x_i \mid c) \prod_{l \neq i} p(x_l)} {\sum_{n=1}^{k} p(x_n \mid c) \prod_{l \neq k} p(x_l)} \\ &= \dfrac { \frac{p(x_i \mid c)}{p(x_i)} } { \sum_{j=1}^{k} \frac{p(x_j \mid c)}{p(x_j)} } \end{align}\]

Finally, we want