DuckDB Internals: Why is DuckDB Fast? (Part 1)

G greybeam.ai ↗

▲ 479 points • 151 comments • by marklit • 1w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is a mix of AI-generated, AI-assisted, and human-written content

32 %

AI likelihood · overall

Mixed

68% human-written 24% AI-generated

SEGMENTS · HUMAN 3 of 6

SEGMENTS · AI 3 of 6

WORD COUNT 1,793

PEAK AI % 98% · §4

Analyzed

Jun 19

backend: pangram/v3.3

Segments scanned

6 windows

avg 299 words each

Distribution

68 / 24%

human / AI fraction

Verdict

Mixed

Pangram v3.3

Article text · 1,793 words · 6 segments analyzed

Human AI-generated

§1 Human · 4%

DuckDB has gone from a research project at CWI Amsterdam in 2019 to one of the most widely adopted databases of the past decade. The list of places it shows up is long: notebooks, ETL pipelines, dashboards, CI test runners, embedded analytics inside SaaS products, even an iPhone running TPC-H at scale factor 100.iPhone in a box of dry ice, running TPC-H. (source)Companies have started building real products around it. MotherDuck is wrapping DuckDB into a cloud data warehouse. BI and data app platforms like Hex, Omni, and Evidence use it as an in-app execution engine and cache. Fivetran's Managed Data Lake Service uses DuckDB inside its data-lake writer for merging and compaction. Rill builds an open-source BI tool on top of it. We use it at Greybeam too, powering millions of queries for BI and analytics workloads.What is DuckDB?DuckDB is an in-process analytical SQL database. Analytical means it's optimized for the kind of queries that scan millions of rows to filter, aggregate, and join — not the kind that look up a single record by primary key. In-process means there's no server. You don't connect to DuckDB; you load it as a library inside your program, the same way you'd load NumPy or Polars.DuckDB has received widespread adoption because it's just so damn easy to use. It ships as a single binary under 20 MB with no external dependencies. You install it with pip install duckdb, brew install duckdb, or by linking libduckdb into a C++ project. It opens any directory of Parquet, CSV, or JSON files like they were already a SQL database.DuckDB also happens to be one of the fastest single-node analytical engines available, regularly holding its own against entire clusters that cost millions of dollars per year.This is the first post in a three-part deep dive into DuckDB internals. We'll follow a query from the moment it enters the engine to the moment the result is returned, and at each stage we'll look at the design choice that makes it fast.

§2 Human · 11%

DuckDB's speed comes from a handful design choices: In-process execution Columnar, compressed storage with zonemaps Vectorized execution Morsel-driven parallelism Snapshot isolation with optimistic MVCC And much more!

This post covers the path from your SQL to the moment the engine is ready to run the query, plus the storage layer the query will read from. By the end you'll have a clear mental model of DuckDB's setup work and storage layout. Query execution is covered in Part 2 so make sure to subscribe!Queries Run In-ProcessYou point DuckDB at a 6 GB Parquet file on your laptop. The results come back in under a second. No cluster, no setup, no migration, no CREATE TABLE. How does that work?SELECT * FROM 'orders.parquet';Most analytical databases are servers. Snowflake, Postgres, BigQuery, Redshift. You open a connection, send SQL over TCP (a protocol to send data over a network), and wait for results to come back. Along the way, every record in the result is serialized into a wire protocol, transmitted across the network, and deserialized on the other end.Serializing and DeserializingInside a database, a query result lives as typed values at specific memory addresses.

§3 AI · 94%

A 64-bit integer here, a pointer to a string there. Those addresses only exist in that process. To send the result to a client on another machine, the database has to rewrite every value into an agreed byte format (Postgres has its own, MySQL has another, with ODBC and JDBC as client-side APIs that drivers expose on top) so it can be pushed through a TCP socket. The client then parses those bytes back into its own native types. Every value may be touched multiple times, once to encode and once to decode, and on a large result set, that work often takes longer than the query itself.DuckDB is not a server. It's a library. There is no DuckDB daemon, no port, no cluster. You load libduckdb into your program and call functions directly against it.In 2017, Mark Raasveldt and Hannes Mühleisen published Don't Hold My Data Hostage, a paper measuring what actually happens when you pull a result set out of a warehouse. They found that the client protocol itself — ODBC, JDBC, and similar row-by-row value APIs — was often the slowest single step in the entire query, sometimes dwarfing the time the database spent computing the answer.Two costs drive this. The first is raw bandwidth: a typical gigabit Ethernet link caps you at around 125 MB/s, and a large result set can take longer to transmit than it took to compute. The second is per-value overhead. ODBC and JDBC hand back results one row and one value at a time, which means the client makes a separate function call for every field in every row. On a 100-million-row result, that's hundreds of millions of function calls, each one doing its own little memory copy, type check, and string allocation.ADBC transfers data between systems in columnar Arrow format, which avoids the row-by-row serialization/deserialization that ODBC and JDBC require. Our friends at Columnar are making this commonplace.Difference between ODBC and ADBC when connecting to Snowflake.DuckDB sidesteps both bottlenecks by living in the same process as the client.When a Python script runs con.sql("SELECT ... FROM my_df") against a pandas dataframe, DuckDB can use a feature called a replacement scan.

§4 AI · 98%

Instead of copying the dataframe into an internal table first, DuckDB replaces the table reference with a function that reads from the dataframe when the query runs.In the best case, DuckDB can read the same underlying buffers the Python process already owns, so it avoids materializing a second full copy of the data. This is zero-copy! If NumPy says "here's a buffer (contiguous chunk of memory) of 1 million int64 values," DuckDB can often read that same buffer directly because it understands the same physical layout. In practice, whether the path is truly zero-copy depends on the dataframe’s physical layout, column types, null representation, and string storage. If the types or layouts do not line up, DuckDB may allocate converted buffers for some columns.Arrow is the cleanest version of this story because Arrow is already a columnar, typed memory format designed for sharing data between systems. That is why returning DuckDB results as Arrow, or querying Arrow-backed data, can avoid much of the row-by-row conversion overhead that traditional APIs impose.From SQL to Logical PlanOnce your SQL reaches DuckDB, it goes through the usual stages: parse, bind, plan, optimize.ParsingThe first step is to parse SQL into an abstract syntax tree (AST). DuckDB uses a fork of the Postgres parser, which is part of why DuckDB's dialect feels so familiar. An AST is a tree representation of your query where each node is a syntactic construct: a SELECT statement, a column reference, a function call, a join, a literal. Parsing turns the flat string SELECT sum(l_quantity) FROM lineitem WHERE l_shipdate > '2024-01-01' into a structured object the engine can actually reason about.Select( expressions=[ Sum( this=Column( this=Identifier(this=l_quantity, quoted=False)))], from_=From( this=Table( this=Identifier(this=lineitem, quoted=False))), where=Where( this=GT( this=Column( this=Identifier(this=l_shipdate, quoted=False)), expression=Literal(this='2024-01-01', is_string=True))))AST from the SQLGlot library.A tree structure is what lets the rest of the engine do its job. The binder walks the nodes to resolve l_quantity to a specific column in a specific table.

§5 AI · 95%

The optimizer pattern-matches subtrees to recognize that the WHERE predicate can be pushed down into the scan. The physical planner maps function call nodes to executable operators. None of these passes can operate on raw SQL. They need to traverse, pattern-match, and rewrite a typed structure. BindingThe next step is binding, which resolves every name in the AST against the catalog. lineitem becomes a specific table with a known schema. l_quantity becomes a specific column with a known type. sum becomes a specific aggregate function whose input type matches that column. Type checking happens here too: comparing l_shipdate to the string '2024-01-01' works because the binder coerces the literal to a date.The output is a bound tree where every node knows what it refers to and what type it produces. Errors like unresolved columns, ambiguous references, and type mismatches surface at this stage.At this point, DuckDB has turned raw SQL text into a typed tree. The engine no longer sees l_quantity as just a string in a query; it sees a specific column with a specific type from a specific table.The OptimizerIn DuckDB, the optimizer consists of a sequence of small, focused transformations that you can, in fact, inspect and disable individually.D SELECT * FROM duckdb_optimizers(); ┌────────────────────────────┐ │ name │ │ varchar │ ├────────────────────────────┤ │ expression_rewriter │ │ filter_pullup │ │ filter_pushdown │ │ empty_result_pullup │ │ cte_filter_pusher │ │ regex_range │ │ in_clause │ │ join_order │ │ deliminator │ │ unnest_rewriter │ │ unused_columns │ │ statistics_propagation │ │ common_subexpressions │ │ common_aggregate │ │ column_lifetime │ │ limit_pushdown │ │ row_group_pruner │

§6 Human · 13%

│ top_n │ │ top_n_window_elimination │ │ build_side_probe_side │ │ compressed_materialization │ │ duplicate_groups │ │ reorder_filter │ │ sampling_pushdown │ │ join_filter_pushdown │ │ extension │ │ materialized_cte │ │ sum_rewriter │ │ late_materialization │ │ cte_inlining │ │ common_subplan │ │ join_elimination │ │ window_self_join │ └────────────────────────────┘ 33 rows Running SET disabled_optimizers = 'filter_pullup, join_order' turns specific passes off so you can see what they were doing. Here are a few interesting optimizers:Filter pushdownThis is a classic database optimization: move WHERE predicates as close to the scan as possible so you prune data as early as possible. DuckDB first pulls filters up to the top of the plan so they can be combined and reorganized, then pushes them back down as far as possible.Read from bottom up. Filter pushdown moves filter earlier in tree when possible.Subquery unnestingCorrelated subqueries traditionally force a database to run the inner query once per outer row, which is slow. DuckDB implements techniques from the Unnesting Arbitrary Queries paper to rewrite these as joins, which are dramatically faster.Dynamic join-filter pushdownDuring a hash join (more on hash joins here), the build side has to be fully read before the probe side starts. DuckDB takes advantage of that ordering: once the build side is in memory, it computes the min and max of the join key values it actually contains, then pushes those bounds back into the probe-side scan as a runtime filter. If the build side turned out to contain values only between 100 and 200, the probe scan can use the table's zonemaps to skip any row groups outside that range before reading them.When the build side has fewer than 50 distinct join key values, the filter becomes an IN list instead of a min-max range, which is more precise and skips even more rows.Join order optimizationJoin order is the most consequential decision the optimizer makes. The order in which joins run determines how big each intermediate result is.