Polynomial autoencoder

I ivanpleshkov.dev ↗

▲ 106 points • 36 comments • by timvisee • 2w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is primarily human-written, with a small amount of AI content detected

23 %

AI likelihood · overall

Human

93% human-written 7% AI-generated

SEGMENTS · HUMAN 5 of 8

SEGMENTS · AI 0 of 8

WORD COUNT 1,648

PEAK AI % 45% · §6

Analyzed

May 8

backend: pangram/v3.3

Segments scanned

8 windows

avg 206 words each

Distribution

93 / 7%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 1,648 words · 8 segments analyzed

Human AI-generated

§1 Human · 9%

The most direct way to compress an embedding (other than quantization) is to fit PCA on the corpus and keep the top-d eigenvectors. It works, but PCA is a linear projection, and neural-network embeddings on the sphere are structurally nonlinear — the well-known cone effect in transformers. Some of the variance lives in a nonlinear tail that a linear decoder can’t reach.This post is about a closed-form way to add a quadratic decoder on top of PCA, to capture part of that nonlinear tail. The encoder stays as plain PCA. The decoder is a degree-2 polynomial lift plus Ridge OLS (ordinary linear regression with L2 regularization), also closed-form. No SGD, no epochs, no hyperparameter search. One np.linalg.solve over corpus statistics.The construction itself isn’t mine. “PCA encoder + quadratic decoder + least-squares fit” appears in the dynamical-systems literature under the name quadratic manifold (see Jain 2017, Geelen-Willcox 2022/2023, Schwerdtner-Peherstorfer 2024+ — more in §9). I only stumbled onto these papers after running my experiments and believing the construction was new. The polynomial lift doesn’t come up much in modern ML conversations, and this post is a note about a useful trick from an adjacent discipline carrying over to retrieval.The concrete result. BEIR/FiQA, mxbai-embed-large-v1 (1024d), per-vector budget 512 bytes. Metric is NDCG@10 (Normalized Discounted Cumulative Gain over the top-10, the standard retrieval ranking-quality measure; range [0, 1], higher is better):methodNDCG@10Δ vs raw 1024draw 1024d (4096 bytes)0.4525—PCA top-2560.4168-3.58 p.p.poly-AE 256d0.4441-0.85 p.p.matryoshka top-2560.4039-4.86 p.p.

§2 Mixed · 39%

PCA already gives 4× per-vector memory compression at -3.58 p.p. NDCG. A quadratic decoder on top of PCA pulls another +2.73 p.p. — closing almost the entire gap to raw, at the same byte budget. Matryoshka is in the table as another familiar baseline (here it drops more than PCA — a known side observation, not the central claim of this post; see §3 footnote and §4).Measured on four models (nomic-v1.5, mxbai-large, bge-base, e5-base). Poly-AE over PCA: +1 to +4.4 p.p. at d=128, +0.03 to +2.7 p.p. at d=256. Full table in §3.Full implementation — ~150 lines of numpy, MIT, repository github.com/IvanPleshkov/poly-autoencoder. The BEIR eval script is beir_eval.py in the same repo. Reproduces in 30–40 minutes on an M-series MacBook (10–15 min to encode the corpus, ~15 min for the Ridge solve at d=256).Contents: §1 what we’re comparing §2 setup §3 headline table §4 where the quadratic decoder helps §5 why linear projection loses §6 method §7 small-corpus caveat §8 residual compression §9 where the method came from §10 limits §11 what to try next 1. What we’re comparingFour lines in every measurement: raw — the full embedding, no compression. Quality ceiling and the most bytes per vector. The “expensive” baseline that gives a fair ceiling. matryoshka — embedding[:d] plus L2 normalization. On models trained with a matryoshka loss (nomic, mxbai in our sample), this is a valid matryoshka vector. On models without matryoshka training (bge-base, e5-base) it’s a test of what happens when you naively slice a non-MRL model — the scenario users of the bge family, e5, and custom-fine-tuned embeddings actually fall into.

§3 Mixed · 43%

PCA — top-d eigenvectors of the corpus covariance. Vectors live in d-dimensional PCA coordinates. poly-AE — our method. Encode with PCA into p ∈ ℝ^d, decode with a quadratic polynomial back to a full D-dimensional V̂, retrieve on V̂. At a fixed d, all four methods store 2d bytes per vector (fp16 coordinates).2. Experimental setupBEIR is the standard set of retrieval datasets (SciFact, FiQA, NFCorpus, TREC-COVID, etc.). The metric is NDCG@10. Corpus and queries are encoded with the chosen model, top-10 are retrieved by cosine similarity, and NDCG is computed against the labeled qrels.PCA and poly-AE are fit transductively: statistics are computed on the corpus we want to compress. Queries never participate in the fit — they hit a fixed encoder/decoder at inference time. This matches a production deployment: an index operator computes PCA + Ridge once on their data and then serves queries.For the main runs we use FiQA — 57K documents, 648 queries, 1706 qrels.3. The headline table — four modelsNDCG@10 on FiQA at budgets of 256 fp16 (512 bytes/vector) and 128 fp16 (256 bytes):ModelDdrawmatryoshka†PCApoly-AEpoly over matryoshkapoly vs rawnomic-embed-text-v1.57681280.37460.31900.32730.3380+1.90 p.p.-3.65 p.p.nomic-embed-text-v1.57682560.37460.35080.36700.3673+1.65 p.p.-0.73 p.p.mxbai-embed-large-v110241280.45250.35030.36890.4129+6.26 p.p.-3.97

§4 Human · 9%

p.p.mxbai-embed-large-v110242560.45250.40390.41680.4441+4.02 p.p.-0.85 p.p.bge-base-en-v1.5*7681280.40620.29140.32660.3654+7.40 p.p.-4.09 p.p.bge-base-en-v1.5*7682560.40620.35740.36880.3958+3.84 p.p.-1.05 p.p.e5-base-v2*7681280.39870.24980.30650.3317+8.18 p.p.-6.70 p.p.e5-base-v2*7682560.39870.33330.36180.3852+5.19 p.p.-1.35 p.p.† The “matryoshka” column is embedding[:d] plus L2 normalization. On nomic and mxbai it’s a valid matryoshka vector. On models marked with * (bge-base, e5-base) the model wasn’t trained for matryoshka, and the slice here is a test of what happens when you naively slice a non-MRL model. This is the scenario that users of the bge family, e5, and custom-fine-tuned embeddings actually fall into — and we measure it here honestly.What this shows: Poly-AE is consistently ahead of PCA across all four models. Lift: +1 to +4.4 p.p. NDCG at d=128, +0.03 to +2.7 p.p. at d=256. Where the quadratic decoder actually helps and at what d — discussed in §4. At d=256, poly-AE loses 0.7–1.4 p.p.

§5 Human · 25%

NDCG vs raw 768/1024 on all four models. 4× per-vector memory compression for less than 1.5 p.p. lost — the main number of the post. On non-matryoshka-trained models, the matryoshka column drops more than PCA — up to -15 p.p. NDCG at d=128. This is a side observation: the post compares PCA and poly-AE, not PCA vs matryoshka. If the matryoshka numbers in the table look surprising, there’s a short pointer in §4. 4. Where the quadratic decoder actually helpsPCA is the linear baseline. The quadratic decoder adds the nonlinear piece that the linear one can’t reach (mechanics in §5). How much does that actually help on retrieval, and at what d?Poly-AE lift over PCA, by model and d:Modelpoly over PCA, d=128poly over PCA, d=256nomic-v1.5+1.07 p.p.+0.03 p.p.mxbai-large+4.40 p.p.+2.73 p.p.bge-base+3.88 p.p.+2.70 p.p.e5-base-v2+2.52 p.p.+2.34 p.p.

§6 Mixed · 45%

The picture:

At d=128 (8× compression) poly is consistently 1–4 p.p. ahead of PCA. This is the regime where the linear decoder starts dropping noticeable variance into the nonlinear tail, and the quadratic correction pulls it back. Sweet spot for the method.

At d=256 (4× compression) the gap is uneven. On mxbai/bge/e5 — a stable +2.3–2.7 p.p. On nomic — close to zero (+0.03). Likely reason: nomic was carefully trained with multi-slice contrastive loss, its latent is more isotropic, and at d=256 the linear projection already takes most of what’s there. On non-MRL models the nonlinear tail is bigger → poly helps more.

More anisotropy → bigger lift. The stronger the cone effect, the more variance lives in the nonlinear tail that PCA can’t reach but poly can. That’s the geometry §5 unpacks.

Side: where matryoshka sits in the tableIn §3 you can see that on non-MRL models (bge, e5) the matryoshka column drops more than PCA — i.e. on a random non-MRL-trained model, naive slicing works worse than a corpus-side linear projection. This is a known result; the “MRL vs PCA on retrieval” question has been discussed independently of this post — see Matryoshka-Adaptor 2024, SMEC “Rethinking MRL” 2025, CoRECT 2025, and the YouTube video literally titled «Is PCA enough?».

§7 Human · 5%

This post compares PCA and poly-AE; matryoshka is in the §3 table as a third reference point.Where poly-AE doesn’t applyA corpus-side PCA fit is a required part of the method. That means poly-AE doesn’t work when the corpus isn’t available: multi-tenant SaaS: one model, thousands of clients with different corpora — fitting PCA per client is operational pain; streaming indices: statistics drift over time, PCA needs periodic refits; edge inference: phone, browser, embedded — you don’t want to ship a per-client PCA matrix alongside the model. In those settings you want an MRL-trained model and embedding[:d], and poly-AE isn’t an alternative — it also needs corpus statistics.Practical takeawayIn the operator-fit setting (fixed corpus, the operator fits compression once), you have two working modes: d=256 gives 4× compression at -0.7 to -1.4 p.p. NDCG vs raw. Poly over PCA: from +0.03 (nomic) to +2.7 p.p. (mxbai/bge/e5). On an MRL-trained model like nomic the gap to PCA is minimal; on non-MRL models poly is clearly ahead. d=128 gives 8× compression. Poly over PCA: +1 to +4.4 p.p. on any model. Sweet spot for the method. 5. Why a linear projection loses informationPCA is the best possible linear projection of data into a d-dimensional subspace. But “best linear” doesn’t mean “good enough”: if the data has nonlinear structure, a linear decoder can’t reach it, period.Transformer embeddings have such structure, well-studied — the cone effect. The point cloud is concentrated inside a narrow cone on the unit sphere and is heavily nonlinearly structured inside that cone. Left: isotropic data. Right: one example of what anisotropic data can look like. drag to rotate PCA only catches projections along orthogonal eigenvectors — i.e. the linear ellipsoid that the cloud doesn’t actually fit in. Whatever lives in the curvature of the manifold is structurally invisible to a linear decoder.

§8 Human · 19%

The stronger the anisotropy (the narrower the cone), the more variance sits in this nonlinear tail that PCA structurally can’t reach.So what we want is clear. A decoder that can deal in quadratic combinations of coordinates — i.e. a decoder that captures local curvature of the manifold. Then we’d recover some of the information that PCA loses.6. A polynomial decoder via a linear lift§5 made it clear that we need a nonlinear decoder. The straightforward route is to train a neural network. But then we lose the closed-form pipeline: SGD, learning rate, batch size, convergence, early stopping. We’d like a nonlinear decoder solvable by a single formula.Standard regression trick: the polynomial lift. Take vector p and lift it into all monomials up to degree 2 — bias, linear terms, squares, and pairwise products. On a 2D example:lift([p₁, p₂]) = [1, p₁, p₂, p₁², p₁·p₂, p₂²] ↑ ↑ ↑ ↑ ↑ ↑ bias linear quadraticAny linear combination of these six = a quadratic function of the original p₁, p₂. So a linear regression on the lift = a quadratic regression in the original space.