The quadratic sandwich

F fedemagnani.github.io ↗

▲ 129 points • 11 comments • by cpp_frog • 4d ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully AI-generated

97 %

AI likelihood · overall

0% human-written 100% AI-generated

SEGMENTS · HUMAN 0 of 6

SEGMENTS · AI 6 of 6

WORD COUNT 1,928

PEAK AI % 100% · §1

Analyzed

May 23

backend: pangram/v3.3

Segments scanned

6 windows

avg 321 words each

Distribution

0 / 100%

human / AI fraction

Verdict

Pangram v3.3

Article text · 1,928 words · 6 segments analyzed

Human AI-generated

§1 AI · 100%

2026-04-08

If you have ever tried to minimize a function with gradient descent, you probably noticed that some functions are a joy to optimize and others are a nightmare. The difference often boils down to two properties: strong convexity and L-smoothness. These two concepts define a “sandwich” of quadratic bounds around your function that tells you exactly how well-behaved it is. If the sandwich is tight, life is good. If one slice of bread is missing, things get ugly fast.

In this post we’ll build up both concepts from scratch, see how they combine into the quadratic sandwich, understand what happens at the level of the Hessian’s eigenvalues, and pick up a neat trick to verify L-smoothness without ever computing an eigenvalue.

Strong convexity — the function can’t be too flat

A differentiable function \(f:\mathbb{R}^n\to\mathbb{R}\) is \(\mu\)-strongly convex (with \(\mu > 0\)) if for all \(x, y\)

\[f(y) \geq f(x) + \langle \nabla f(x), y - x \rangle + \frac{\mu}{2} \|y - x\|^2\]

If this looks familiar, it’s because the first two terms on the right are the first-order Taylor expansion of \(f\) at \(x\). For a plain convex function, the Taylor expansion is already a global underestimator (that’s the subgradient inequality). But strong convexity asks for more: the function must stay above the tangent plus a quadratic gap. The parameter \(\mu\) controls how aggressive this gap is — the bigger \(\mu\), the more the function curves upward and away from its linear approximation.

The intuition is that a strongly convex function has a guaranteed minimum curvature of \(\mu\) in every direction. It can’t flatten out, it can’t plateau, it can’t have a degenerate valley where one direction is basically flat. There is always a force pulling you toward the minimum, and that force grows linearly with the distance from the minimizer.

§2 AI · 100%

L-smoothness — the function can’t be too steep

A differentiable function \(f\) is \(L\)-smooth if its gradient is Lipschitz continuous:

\[\|\nabla f(x) - \nabla f(y)\| \leq L \|x - y\| \quad \forall \; x, y\]

Read this carefully: the change in the gradient between any two points is always dominated by a rescaled version of the change in the input. No matter how far apart \(x\) and \(y\) are, the gradient difference \(\|\nabla f(x) - \nabla f(y)\|\) can never outpace \(L\) times the input difference \(\|x - y\|\). The constant \(L\) acts as a leash on the gradient: it can move, but it can’t jerk. No abrupt turns, no sudden spikes in curvature.

Now here’s the equivalent characterization that will matter for the sandwich, sometimes called the descent lemma: if \(f\) is convex and \(L\)-smooth, then for all \(x, y\)

\[f(y) \leq f(x) + \langle \nabla f(x), y - x \rangle + \frac{L}{2}\|y - x\|^2\]

This is not obvious from the Lipschitz condition alone — it requires a short derivation that we work out in the Appendix. The key idea is to integrate the gradient along the segment from \(x\) to \(y\) and use Cauchy-Schwarz together with the Lipschitz bound to control the error.

Look at the structure: same shape as the strong convexity condition, but with the inequality flipped and \(\mu\) replaced by \(L\). The function now stays below its tangent plus a quadratic term. In other words, the function can bend, but not more than a quadratic with curvature \(L\).

The parameter \(L\) caps the maximum curvature in any direction. If strong convexity sets a floor on curvature, L-smoothness sets a ceiling.

The quadratic sandwich

Now we put both slices of bread together.

§3 AI · 100%

If \(f\) is \(\mu\)-strongly convex and \(L\)-smooth, then both inequalities hold simultaneously and for all \(x, y\) we get

\[\begin{cases} f(x) + \langle \nabla f(x), y - x \rangle + \frac{\mu}{2}\|y - x\|^2 \leq f(y) \\[6pt] f(y) \leq f(x) + \langle \nabla f(x), y - x \rangle + \frac{L}{2}\|y - x\|^2 \end{cases}\]

The function is trapped between two parabolas centered at any point \(x\): a tighter one from below (curvature \(\mu\)) and a wider one from above (curvature \(L\)). This is the quadratic sandwich.

The proof of the sandwich is worked out in the Appendix.

The condition number

The ratio

\[\kappa = \frac{L}{\mu}\]

is called the condition number and it measures how thick the sandwich is. By construction, since \(L \geq \mu\) (the maximum curvature can’t be smaller than the minimum curvature), the condition number is always bounded below by 1: \(\kappa \geq 1\). A small \(\kappa\) (close to 1) means the function is “almost quadratic” — both bounds are tight, and the function is easy to optimize. A large \(\kappa\) means the curvature varies wildly across directions, and this is where gradient descent starts to suffer. Think about it in two scenarios:

Some directions lack curvature (\(\mu\) is tiny): a small \(\mu\) doesn’t necessarily mean the function is flat — you could still have a nonzero slope. But it means you have no “acceleration” to exploit: the gradient barely changes from one iterate to the next, so you can’t gauge whether you’re getting closer to the minimum or how to adjust your step size. Think of walking on a constant slope — you’re moving, but the terrain gives you no feedback about your progress. The gradient changes too abruptly (\(L\) is huge): contrary to the lack of strong convexity — where the problem is that the gradient doesn’t change — here the problem is that it changes too much. All of a sudden, between one iterate and the next, the gradient spikes or flips direction in a way that your step size didn’t anticipate.

§4 AI · 100%

Note that a large \(L\) alone is not necessarily catastrophic: if \(\mu\) is also large (high curvature everywhere), the function is just a steep but well-behaved bowl, and you can simply use a small step size \(1/L\) that works uniformly.

The real trouble comes from the spread between \(L\) and \(\mu\) — a large condition number \(\kappa = L/\mu\). When the gap is significant, some directions have high curvature (where the gradient changes fast) and others have low curvature (where the gradient is nearly constant). A single step size cannot serve both: sized for the high-curvature directions it’s too conservative for the low-curvature ones, and vice versa. This mismatch is what causes the classic zigzagging behavior of gradient descent — it’s not \(L\) alone or \(\mu\) alone, but the ill-conditioning between them.

When \(\kappa = 1\) (that is, \(\mu = L\)), the sandwich is perfectly balanced: the same quadratic function bounds \(f\) from above and from below. Since \(f\) is squeezed between two identical parabolas, it must be that parabola. The sandwich collapses into an equality: for all \(x, y\)

\[f(y) = f(x) + \langle \nabla f(x), y - x \rangle + \frac{\mu}{2}\|y - x\|^2\]

The function is exactly its own second-order approximation everywhere — no error, no slack. Now evaluate this at the minimizer \(x^\star\) where \(\nabla f(x^\star) = 0\):

\[f(y) = f(x^\star) + \frac{\mu}{2}\|y - x^\star\|^2\]

So \(f\) is a perfect quadratic bowl centered at \(x^\star\). But we can squeeze out more. Differentiating the equality with respect to \(y\) gives \(\nabla f(y) = \mu(y - x^\star)\): the gradient at any point is just a rescaled vector pointing radially away from the minimizer.

§5 AI · 100%

There is no zigzagging, no misalignment — gradient descent with step size \(1/\mu\) reaches the minimum in exactly one step:

\[y - \frac{1}{\mu}\nabla f(y) = y - (y - x^\star) = x^\star\]

In other words, the only functions with a perfect sandwich are quadratics, and quadratics are the only functions where gradient descent doesn’t need to iterate at all.

What goes wrong without one slice of bread

Without strong convexity

Set \(\mu = 0\) and the lower bound degenerates into the tangent hyperplane — you lose the quadratic pull toward the minimum. The condition number \(\kappa = L/\mu\) blows up to \(+\infty\), which is the mathematical way of saying “gradient descent is going to have a bad time”.

The secret sauce of strong convexity is that the gradient is guaranteed to change in an appreciable way at every step. For a \(\mu\)-strongly convex function, \(\|\nabla f(x)\|\) grows at least proportionally to \(\|x - x^\star\|\): a big gradient means you’re far from the optimum, a small gradient means you’re close. At every iterate, the gradient norm lets you prelude how close you are to the minimum — it’s a calibrated signal. This has concrete algorithmic consequences: if you know roughly how far you are, you can take appropriately sized steps — large when far, small when close (assuming the field changes smoothly under your feet, which is what L-smoothness guarantees).

Without strong convexity, the gradient loses this calibration. Consider \(f(x) = \|x\|_1\): the gradient is \(\pm 1\) everywhere except the origin — it gives you the direction but says nothing about the distance. Whether you’re at \(x = 100\) or \(x = 0.001\), the gradient screams with the same intensity. The gradient doesn’t change at all as you move, so you have no way to gauge your progress. The same happens with the Huber loss: once you’re in the linear regime, the gradient is constant and you can’t tell if you’re close or far. Without a gradient that scales with the distance, the solver is flying blind — it has no way to modulate its step size based on proximity to the minimum.

Even worse, without strong convexity you lose the guarantee that the minimizer is unique.

§6 AI · 100%

The function might have a whole subspace of minimizers, or a flat region where the gradient vanishes but you’re nowhere near the optimum.

Without L-smoothness

Imagine you’re kicking a ball blindly toward a target — you can’t see the field, but someone tells you the direction to kick. As long as you’re kicking in the mud, life is predictable: the ball moves a little bit each time, the friction keeps things under control, and you can reasonably expect where it will land after each kick. You make steady progress, kick after kick. But now imagine the mud suddenly disappears and you’re on ice. You kick with the same energy as before, but this time the ball flies away — it overshoots the target and lands even further from it than where you started. The ground changed under your feet, but your kicking force didn’t adapt.

That’s the core problem without L-smoothness: you find a step size that works well in one region (where the gradient is moderate, the “mud”), you proceed confidently, but you end up in a region where the gradient has exploded (the “ice”) — the change in the gradient was way bigger than the change in the input. Using the same step size now produces a catastrophic overshoot, and you might end up further from the minimum than where you began.

Mathematically: remove the upper bound and the function is free to spike arbitrarily. A solver that takes a step based on the current gradient has no guarantee about what the function value will be at the new point — it could be much higher than expected because the curvature exploded between the current point and the next.

Consider \(f(x) = -\ln(x)\) for \(x > 0\): the second derivative is \(\frac{1}{x^2}\), which is unbounded as \(x \to 0\). Far from the origin the function is gentle (at \(x = 10\), the curvature is just \(0.01\)), but near the origin the curvature explodes (at \(x = 0.1\), the curvature is \(100\)). A step size calibrated for the gentle region will massively overshoot if it carries you into the steep region, and a step size conservative enough for the steep region will crawl everywhere else.