VC Dimension and the Fundamental Theorem of Statistical Learning — from Scratch
Pangram verdict · v3.3
We believe that this document is primarily AI-generated with some human-written content
AI likelihood · overall
AIArticle text · 1,593 words · 6 segments analyzed
May 2026 This post answers a single question: when does learning from data actually work? You train a model on samples, it performs well on those samples, and you hope it performs well on new data. When is that hope justified? The answer turns out to be a clean equivalence: a hypothesis class is learnable if and only if it has finite VC dimension. This is the Fundamental Theorem of Statistical Learning. We'll build every piece from scratch: Markov's inequality, Hoeffding's lemma, concentration inequalities, the union bound, VC dimension, the Sauer–Shelah lemma, symmetrization, and the generalization bounds that tie it all together. Nothing is assumed beyond basic probability (expectations, independence, indicator random variables). We prove both directions of the equivalence: the upper bound (finite VC dimension implies learnability) and the lower bound (via information theory: total variation, KL divergence, Pinsker's inequality, and Le Cam's method). This is Part 1 of two. Part 2 continues the story with Rademacher complexity — a probabilistic refinement of VC dimension that gives tighter, data-dependent bounds — and shows how all the pieces connect in one beautiful chain. 1. The Setup: What Does “Learning” Mean? We work in the simplest interesting setting: binary classification. Definition (The Learning Problem) Instance space \(\mathcal{X}\): the set of all possible inputs (images, documents, feature vectors, etc.). Label space \(\mathcal{Y} = \{0, 1\}\): binary labels. Distribution \(\mathcal{D}\): an unknown probability distribution over \(\mathcal{X} \times \mathcal{Y}\). Nature draws \((x, y) \sim \mathcal{D}\). Hypothesis class \(\mathcal{H} \subseteq \{0,1\}^{\mathcal{X}}\): a fixed set of classifiers \(h : \mathcal{X} \to \{0,1\}\) that the learner is allowed to pick from. Training sample \(S = ((x_1, y_1), \ldots, (x_m, y_m))\): drawn i.i.d. from \(\mathcal{D}\). Definition (True Risk and Empirical Risk) The true risk (or population loss) of a hypothesis \(h\) is $$ L_{\mathcal{D}}(h) = \Pr_{(x,y)\sim\mathcal{D}}[h(x) \neq y].
$$ The empirical risk (or training error) on a sample \(S\) of size \(m\) is $$ L_S(h) = \frac{1}{m}\sum_{i=1}^m \mathbf{1}[h(x_i) \neq y_i]. $$ Example (Spam Classification) Suppose \(\mathcal{X}\) = all possible emails and \(\mathcal{Y} = \{0, 1\}\) where 1 = spam. Nature has some distribution \(\mathcal{D}\) over (email, label) pairs — this captures both which emails show up and how they're labeled. We don't know \(\mathcal{D}\); we only see a training set of \(m\) labeled emails. The true risk \(L_{\mathcal{D}}(h)\) is the probability that classifier \(h\) misclassifies a random future email. We never know this exactly — it involves all emails that could ever arrive. The empirical risk \(L_S(h)\) is the fraction of training emails that \(h\) gets wrong. This we can compute. The fundamental question: when does small \(L_S(h)\) guarantee small \(L_{\mathcal{D}}(h)\)? The learner sees only \(S\) and wants to find \(h \in \mathcal{H}\) with small \(L_{\mathcal{D}}(h)\). The most natural strategy is Empirical Risk Minimization (ERM): pick any \(h \in \mathcal{H}\) that minimizes \(L_S(h)\). In plain English: choose the hypothesis that makes the fewest mistakes on the training data. But ERM has an obvious failure mode: overfitting. If \(\mathcal{H}\) is expressive enough, some \(h \in \mathcal{H}\) will have \(L_S(h) = 0\) purely by memorizing the training data, while having terrible true risk. So we need to understand when ERM is trustworthy. This leads to two questions that the rest of the post will answer: When is learning possible at all? Under what conditions on \(\mathcal{H}\) can any algorithm generalize? When does ERM specifically work? When does the training error track the true error simultaneously for every hypothesis? These questions are formalized by two definitions: PAC learnability (question 1) and uniform convergence (question 2). Let's build up to each one. What Should “Learnable” Mean? We want a definition that captures: "given enough data, the learner will probably do well."
There are three things to get right: How well? The learner's error should be within \(\epsilon\) of the best possible error in \(\mathcal{H}\). We use \(\epsilon\) ("accuracy parameter") because we can't demand perfection — with finite data, there's always some approximation error. How probably? Since \(S\) is random, the learner might get unlucky (e.g., all training emails are spam). We allow a failure probability \(\delta\) ("confidence parameter"). Against what? The guarantee must hold for every distribution \(\mathcal{D}\). We don't know how nature generates data, so we can't assume anything about it. Definition (PAC Learning) A hypothesis class \(\mathcal{H}\) is PAC learnable (Probably Approximately Correct) if there exists a learning algorithm \(A\) and a function \(m_{\mathcal{H}} : (0,1)^2 \to \mathbb{N}\) such that for every \(\epsilon, \delta \in (0,1)\) and every distribution \(\mathcal{D}\) over \(\mathcal{X} \times \{0,1\}\): if \(m \geq m_{\mathcal{H}}(\epsilon, \delta)\), then with probability at least \(1 - \delta\) over the draw of \(S \sim \mathcal{D}^m\), $$ L_{\mathcal{D}}(A(S)) \leq \min_{h \in \mathcal{H}} L_{\mathcal{D}}(h) + \epsilon. $$ The function \(m_{\mathcal{H}}\) is called the sample complexity. Let's unpack this piece by piece: \(A(S)\) is the hypothesis that the algorithm outputs after seeing training data \(S\). \(\min_{h \in \mathcal{H}} L_{\mathcal{D}}(h)\) is the error of the best hypothesis in the class. Call this \(h^*\). We can't beat \(h^*\) (we're restricted to picking from \(\mathcal{H}\)), so we aim to be within \(\epsilon\) of it. The "\(\text{for every } \mathcal{D}\)" part is crucial: the same algorithm must work regardless of how data is distributed. This is what makes PAC learning distribution-free. \(m_{\mathcal{H}}(\epsilon, \delta)\) tells us how much data we need. Smaller \(\epsilon\) or \(\delta\) require more data.
This is the agnostic PAC model: we don't assume any \(h \in \mathcal{H}\) is perfect (i.e., we don't assume \(\min_h L_{\mathcal{D}}(h) = 0\)). Real-world problems are almost always agnostic — no simple model perfectly captures spam, disease, or fraud. Example (PAC Learning in Practice) Suppose \(\mathcal{H}\) = linear classifiers in \(\mathbb{R}^2\) and we want: \(\epsilon = 0.05\): the learner's error is within 5% of the best linear classifier. \(\delta = 0.01\): this guarantee holds 99% of the time. PAC learnability means there's a magic number \(m_{\mathcal{H}}(0.05, 0.01)\) such that if we collect at least that many labeled examples, any distribution over \(\mathbb{R}^2 \times \{0,1\}\) will produce a training set from which a good classifier can be found. If no such \(m\) exists (i.e., for any sample size, some adversarial distribution defeats the learner), then \(\mathcal{H}\) is not PAC learnable. Uniform Convergence: A Sufficient Condition for ERM PAC learnability asks whether some algorithm works. But we'd like to know when the simplest algorithm — ERM — works. For ERM to succeed, we need something stronger: the training error must be a reliable estimate of the true error, not just for the hypothesis ERM picks, but for every hypothesis simultaneously. Why "simultaneously"? Because ERM searches over \(\mathcal{H}\) to find the hypothesis with the lowest training error. If training error is only accurate for a few hypotheses but wildly off for others, ERM might pick one of the "wildly off" ones — a hypothesis that looks great on training data but is actually terrible. Definition (Uniform Convergence) \(\mathcal{H}\) has the uniform convergence property if for every \(\epsilon, \delta > 0\), there exists \(m_0\) such that for all \(m \geq m_0\) and all distributions \(\mathcal{D}\), $$ \Pr_S\!\left[\sup_{h \in \mathcal{H}} |L_S(h) - L_{\mathcal{D}}(h)| > \epsilon\right] < \delta.
$$ The \(\sup\) over \(\mathcal{H}\) is doing the heavy lifting: it says the worst-case gap between training and true error, over all hypotheses, is small. Compare this to pointwise convergence, where we'd only require \(|L_S(h) - L_{\mathcal{D}}(h)| < \epsilon\) for each individual \(h\) separately. Example (Uniform vs. Pointwise) Imagine \(\mathcal{H} = \{h_1, h_2\}\) with two hypotheses, and you have \(m = 100\) training points. Pointwise: \(L_S(h_1)\) is close to \(L_{\mathcal{D}}(h_1)\), AND separately \(L_S(h_2)\) is close to \(L_{\mathcal{D}}(h_2)\). Each guarantee might hold with probability 0.99. Uniform: BOTH gaps are small at the same time, with probability 0.99. This is stronger because bad luck on \(h_1\) and bad luck on \(h_2\) might coincide. With two hypotheses, the difference is minor (union bound costs a factor of 2). But when \(\mathcal{H}\) is infinite, pointwise convergence is essentially free (law of large numbers) while uniform convergence may or may not hold — and it's uniform convergence that determines whether ERM works. Why Uniform Convergence Makes ERM Work The punchline is a short chain of three inequalities. But to see why each one is true, let's first walk through a concrete scenario, then state the general argument. Concrete Scenario Suppose \(\mathcal{H} = \{h_1, h_2, h_3\}\) and the true risks (which we can't see) are: \(L_{\mathcal{D}}(h_1) = 0.20\) — this is the best hypothesis (\(h^* = h_1\)). \(L_{\mathcal{D}}(h_2) = 0.35\). \(L_{\mathcal{D}}(h_3) = 0.50\). Now suppose uniform convergence holds with \(\epsilon = 0.03\). This means every hypothesis's training error is within 0.03 of its true error.
One possible training set might give: \(L_S(h_1) = 0.22\) (true: 0.20, gap: 0.02 \(\leq\) 0.03 \(\checkmark\)) \(L_S(h_2) = 0.33\) (true: 0.35, gap: 0.02 \(\leq\) 0.03 \(\checkmark\)) \(L_S(h_3) = 0.48\) (true: 0.50, gap: 0.02 \(\leq\) 0.03 \(\checkmark\)) ERM picks \(h_1\) (lowest training error: 0.22). The true error of ERM's pick is \(L_{\mathcal{D}}(h_1) = 0.20\). The best possible true error is also 0.20. ERM nailed it. But what if the training errors shift around? Uniform convergence allows gaps up to \(\epsilon = 0.03\), so another training set might give: \(L_S(h_1) = 0.23\) \(L_S(h_2) = 0.32\) \(L_S(h_3) = 0.52\) Now ERM picks \(h_1\) again (0.23 is still the lowest). True error of the pick is still 0.20. Fine. Worst case: The gaps conspire maximally against us: \(L_S(h_1) = 0.23\) (true 0.20, shifted UP by 0.03) \(L_S(h_2) = 0.32\) (true 0.35, shifted DOWN by 0.03) \(L_S(h_3) = 0.47\) (true 0.50, shifted DOWN by 0.03) Now ERM picks \(h_1\) (0.23 vs 0.32 vs 0.47 — still the smallest). Even in this worst case, ERM picks the right answer.