the-stats-duck v0.6.0 — statistics that live in your SQL

K kolistat.com ↗

▲ 133 points • 19 comments • by caerbannogwhite • 3d ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully AI-generated

98 %

AI likelihood · overall

0% human-written 100% AI-generated

SEGMENTS · HUMAN 0 of 4

SEGMENTS · AI 4 of 4

WORD COUNT 989

PEAK AI % 99% · §1

Analyzed

Jun 24

backend: pangram/v3.3

Segments scanned

4 windows

avg 247 words each

Distribution

0 / 100%

human / AI fraction

Verdict

Pangram v3.3

Article text · 989 words · 4 segments analyzed

Human AI-generated

§1 AI · 99%

the-stats-duck — our open-source DuckDB extension — just shipped v0.6.0, and in keeping with the whole operation, the release is named i-m-not-dead. It very much isn't! In case you haven't met it before: the-stats-duck is the statistics engine that Bedevere and KoliLang lean on. It allows DuckDB to do real statistics — distributions, tests, regression, even plots — without ever leaving SQL. MIT-licensed, and it runs anywhere DuckDB runs — including a browser! Which is quite handy, because every demo below is a live Bedevere instance running the-stats-duck right here in your browser. The dataset is the famous Palmer Penguins — change the SQL and re-run it. SELECT * FROM 'penguins';

The first thing anyone would usually do with a new dataset is squint at it. meta() squints for you the full profile — one row per column: SELECT column_name, kind, n_missing , n_distinct, mean, median, stddev, top FROM meta('penguins') ;

It overlaps with DuckDB's built-in SUMMARIZE, but meta() is a table function — so you can join it, filter it, and compose it in CTEs. Do you need to know "how many numeric columns, and how many missing values total?" That's just aggregation over meta(). Regression without leaving SQL: lm + an R-style formula Yes, ordinary least squares, in a SQL query, with the formula syntax you already know: SELECT * FROM lm_summary( 'penguins' , formula := 'body_mass_g ~ flipper_length_mm + bill_length_mm' );

lm() gives you the coefficient table (estimate, std_error, t_statistic, p_value per term); lm_summary() gives the model-level line (R², adjusted R², F and its p-value, residual df, sigma). The formula language handles additive predictors and intercept removal (- 1); interactions and inline transforms aren't in v0.6 yet, but we might add them soon! Under the hood it's a Cholesky solve of X'X, complete-case filtered, and cross-verified against R on cars and mtcars to four decimals. Confidence intervals by brute force: bootstrap() Don't want to assume normality?

§2 AI · 97%

Resample: WITH b AS ( SELECT bootstrap(body_mass_g, 'mean', 2000, 42) AS samples FROM penguins ) SELECT list_aggregate(samples, 'quantile_cont', 0.025) AS lo , list_aggregate(samples, 'quantile_cont', 0.975) AS hi FROM b ;

bootstrap(value, statistic, n_iters [, seed]) resamples with replacement and hands back a LIST<DOUBLE> of the statistic from each resample (mean, median, stddev, …). Pair it with list_aggregate(list, 'quantile_cont', p) for a percentile interval. Pass a seed and it's reproducible — even across GROUP BY groups. Charts, straight from SQL: VISUALIZE … DRAW the-stats-duck also speaks a small — for now — plot grammar (ggsql) that compiles to Vega-Lite, which Bedevere renders. v0.6.0 adds violin marks, 2-D facets (FACET BY row, col), and a per-layer STAT smooth (LOESS) — so DRAW point DRAW line STAT smooth is your scatter-with-trend in one line. VISUALIZE species AS x , body_mass_g AS y FROM penguins DRAW violin TITLE 'Body Mass (g) by species' ;

Scatter with a fitted line: STAT smooth The other half of v0.6.0's chart story is the per-layer STAT modifier — appended after a mark as DRAW <mark> STAT <name>. It transforms that layer only, so you can stack a raw mark under a statistical one. The canonical use is a scatter with a LOESS overlay: VISUALIZE bill_depth_mm AS x , bill_length_mm AS y , species AS color FROM penguins DRAW point DRAW line STAT smooth SCALE x ZERO false SCALE x LABEL 'Bill Depth (mm)' SCALE y ZERO false SCALE y LABEL 'Bill Length (mm)' TITLE 'Penguins Bill Depth VS Bill Length (Smooth)' ;

Three modifiers ship.

§3 AI · 97%

smooth injects a Vega-Lite LOESS transform on (x, y), grouped by color when it's mapped — one fitted curve per species above. summary rewrites the layer's data SQL to AVG(y) GROUP BY x [, color, facet, facet2] ORDER BY x, collapsing each cell to its mean. identity is the explicit no-op. smooth is rejected on marks that already emit their own transform — regression, density, violin, histogram — which bake the statistic in themselves. A straight-line fit: DRAW regression Where STAT smooth traces the data's local shape, the regression mark draws a single straight line — a least-squares y ~ x fit, grouped by color. Same axes as above, but each species gets a linear trend instead of a LOESS curve: VISUALIZE bill_depth_mm AS x , bill_length_mm AS y , species AS color FROM penguins DRAW point DRAW regression SCALE x ZERO false SCALE x LABEL 'Bill Depth (mm)' SCALE y ZERO false SCALE y LABEL 'Bill Length (mm)' TITLE 'Penguins Bill Depth VS Bill Length (Regression)' ;

Ten new distributions (and the full d/p/q/r) The distribution zoo grew: negative binomial, hypergeometric, Weibull, log-normal, and Poisson, each with the R-style d / p / q triple — plus a complete set of random samplers: rnorm, rt, rchisq, rf, rgamma, rbeta, rexp, rweibull, rlnorm, rpois. Every one cross-checked against R to six decimals. Simulate, fit, and test without round-tripping to another language. Here's the Poisson(λ = 3) probability mass function — dpois evaluated over 0…10 and drawn as a bar chart, all in SQL: WITH pois AS ( SELECT k, dpois(k, 3) AS pmf FROM range(0, 11) AS t(k) ) VISUALIZE k AS x , pmf AS y FROM

§4 AI · 96%

pois DRAW bar ;

And it got fast: read_stat is 52x quicker Last, and this one's for the XPT crowd: reading SAS / SPSS / Stata files used to be accidentally O(N²): the reader re-parsed the file from byte zero for every 2,048-row chunk! 😱 v0.6.0 parses each file exactly once into a buffered, spillable column store and streams from it. Here are the numbers:

a 200k-row, 7 MB XPT: ~67 s → ~1.3 s (52x) the real CDISC pilot qs.xpt (122k rows): ~39 s → ~1.1 s (36x) 1.6M rows: ~70 min → ~15 s

Go play the-stats-duck is open source: github.com/KoliStat/the-stats-duck (full v0.6.0 notes here). It powers Bedevere and KoliLang, but it's just a DuckDB extension: drop it into any DuckDB and your SQL grows a statistics department. Not dead. Doing your statistics.