ggsql: A grammar of graphics for SQL

O opensource.posit.co ↗

▲ 452 points • 85 comments • by thomasp85 • 5w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

0 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 5 of 5

SEGMENTS · AI 0 of 5

WORD COUNT 1,725

PEAK AI % 0% · §1

Analyzed

Apr 20

backend: pangram/v3.3

Segments scanned

5 windows

avg 345 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 1,725 words · 5 segments analyzed

Human AI-generated

§1 Human · 0%

Today, we are super excited to announce the alpha-release of

ggsql . As the name suggests, ggsql is an implementation of the grammar of graphics based on SQL syntax, bringing rich, structured visualization support to SQL. It is ready for use in Quarto, Jupyter notebooks, Positron and VS Code among others.In this post we will go over some of the motivations that lead us to develop this tool, as well as give you ample examples of its use; so you can hopefully get as excited about it as we are.Meet ggsql# Before we discuss the why, let’s see what ggsql is all about with some examples.The first plot# To get our feet wet, lets start with the hello-world of visualizations: A scatterplot, using the built-in penguins dataset:VISUALIZE bill_len AS x, bill_dep AS y FROM ggsql:penguins DRAW pointThat wasn’t too bad. Sure, it has the verbosity of SQL, but that also means that you can speak your plot code out loud and understand what it does. We can break down what is going on here line-by-line: We initiate the visual query with VISUALIZE and provide a mapping from the built-in penguins dataset, relating x to the data in the bill_len column, and y in the bill_dep column. We draw a point layer that, by default, uses the mapping we defined at the top. With this in place, we can begin to add to the visualization:VISUALIZE bill_len AS x, bill_dep AS y, species AS color FROM ggsql:penguins DRAW pointWe see that a single addition to the mappings adds colored categories to the plot. This gradual evolution of plot code is one of the biggest strengths of the grammar of graphics. There are no predefined plot types, only modular parts that can be combined, added, and removed. To further emphasize this, let’s add a smooth regression line to the plot:VISUALIZE bill_len AS x, bill_dep AS y, species AS color FROM ggsql:penguins DRAW point DRAW smoothWe add a new layer on top of the point layer. This layer also borrows the same mapping as the point layer. Since we color by species, the smooth line is split into one for each species.

§2 Human · 0%

We can continue doing this, adding more mappings, adding or swapping layers, controlling how scales are applied etc until we arrive at the plot we need, however simple or complicated it may be. In the above example we may well end up deciding we are more interested in looking at the distribution of species across the three islands the data was collected from:VISUALIZE island AS x, species AS color FROM ggsql:penguins DRAW barWhile a completely different plot, you can see how much of the code from the previous plot carries over.A complete example# With our first couple of plots under the belt, let’s move on to a complete example. It will contains parts we have not seen before, but don’t worry, we will go through it below, even the parts we’ve already seen before. The example is an adaptation of a visualization created by

Jack Davison for TidyTuesday.WITH astronauts AS ( SELECT * FROM 'astronauts.parquet' QUALIFY ROW_NUMBER() OVER ( PARTITION BY name ORDER BY mission_number DESC ) = 1 ) SELECT *, year_of_selection - year_of_birth AS age, 'Age at selection' AS category FROM astronauts UNION ALL SELECT *, year_of_mission - year_of_birth AS age, 'Age at mission' AS category FROM astronauts

VISUALIZE age AS x, category AS fill DRAW histogram SETTING binwidth => 1, position => 'identity' PLACE rule SETTING x => (34, 44), linetype => 'dotted' PLACE text SETTING x => (34, 44, 60), y => (66, 49, 20), label => ( 'Mean age at selection = 34', 'Mean age at mission = 44', 'John Glenn was 77\non his last mission -\nthe oldest person to\ntravel in space!' ), hjust => 'left', vjust => 'top', offset => (10, 0) SCALE fill TO accent LABEL title => 'How old are astronauts on their most recent mission?',

§3 Human · 0%

subtitle => 'Age of astronauts when they were selected and when they were sent on their mission', x => 'Age of astronaut (years)', fill => nullThat was a lot of code, but on the flip-side we have now covered a lot of the most important aspects of the syntax with one example.At the topmost level there are two parts to this query: The SQL query, and the visualization query. The SQL query is anything from the beginning to the VISUALIZE clause. It is your standard SQL, and it accepts anything your backend accepts (in this blog post we use a DuckDB backend). The result of the query is funnelled directly into the visualization rather than being returned as a table like you’d normally expect.Since the point of this post is not to teach you SQL we won’t spend much more time discussing the SQL query part. The main take away is that everything before the VISUALIZE clause is pure SQL, any resulting table is automatically used by your visualization, and any table or CTE created there is available for referencing in the visualization query.As we saw in the first examples, the SQL query part is optional. If your data is already in the right shape for plotting you can skip it and instead name the source directly in the VISUALIZE clause:VISUALIZE year_of_selection AS x, year_of_mission AS y FROM 'astronauts.parquet'Now, let’s look at the visual query — everything from VISUALIZE and onwards. VISUALIZE marks the end of the SQL query and the beginning of the visualization query (or VISUALISE for those who prefer UK spelling). It can stand on its own or, as we do here, have one or more mappings which will become defaults for every subsequent layer. Mappings are purely for relating data to abstract visual properties. A mapping is like a SELECT where you alias columns to a visual properties (called aesthetics in the grammar of graphics). In the visualization above we say that the age column holds the values used for x (position along the x axis) and the category column holds the values used for fill (the fill color of the entity). We do not say anything about how to draw it yet.Following the VISUALIZE query we have a DRAW clause. DRAW is how we add layers to our visualization. There is a large selection of different layer types in ggsql.

§4 Human · 0%

Some are straightforward: e.g. point for drawing a scatterplot. Some are more involved: histogram (which we use here) requires calculation of derived statistics like binned count. A visualization can have any number of layers and layers will be rendered in the sequence they are defined. DRAW has a sibling clause called PLACE. It is used for annotation and works like DRAW except it doesn’t get data from a table but rather as provided literal values. It follows that our visualization above contains three layers: A histogram layer showing data from our table, a rule annotation layer showing precomputed mean values for each category, and a text annotation layer adding context to the visualization. It is worth mentioning that a layer does not correspond to a single graphical entity. Like with the text layer above, each layer can render multiple separate entities of its type so there is no need to have e.g. 3 line layers to render line plots for 3 different categories.After the DRAW and PLACE clauses we have a SCALE clause. This clause controls how data values are translated into values that are meaningful for the aesthetic. In our case, the category column holds the strings “Age at mission” and “Age at selection” which doesn’t in itself translate to a color value. The clause SCALE fill TO accent tells ggsql to use the “accent” color palette when converting the values mapped to fill to actual colors. Scales can be used for much more, like applying transformations to continuous data, defining break points, and setting specific scale types (like ordinal or binned).The last clause in our visual query is LABEL which allows us to add or modify various text labels like title, subtitle, and axis and legend titles.Stepping back# That was a mouthful. But there are two very silvery linings to it all: You now know the most important aspects of the syntax (there are more, of course, but you can grow into that) Many visualization queries will be much simpler than the one above We have already seen examples of shorter visual queries above but let’s continue with a boxplot of astronaut

§5 Human · 0%

birth year split by sex:VISUALIZE sex AS x, year_of_birth AS y FROM 'astronauts.parquet' DRAW boxplotThat’s much shorter than the last plot code but still, if you are coming from a different plotting system you may even think this is overly verbose (e.g. compared to something like boxplot(astronauts.sex, astronauts.year_of_birth)). Yes, it is longer, but it is also more structured, composable, and self-descriptive. These features (which are a direct result of its grammar of graphics lineage) means that both you and your future LLM coding buddy will have an easier time internalizing the workings of all types of plots that can be made. The 18 years of dominance of ggplot2 (which shares these features) in the R ecosystem is a testament to this.As an example, let’s change the above plot to instead show the same relationship as a jittered scatterplot.VISUALIZE sex AS x, year_of_birth AS y FROM 'astronauts.parquet' DRAW point SETTING position => 'jitter'Or perhaps the jitter follows the distribution of the data so it doubles as a violin plot:VISUALIZE sex AS x, year_of_birth AS y FROM 'astronauts.parquet' DRAW point SETTING position => 'jitter', distribution => 'density'As you can see the syntax and composable nature makes visualization iteration very ergonomic, something that is extremely valuable in both explorative analyses and visualization design.Why ggsql?# Writing a new visualization library from scratch is a big task and you might wonder why we’re doing it again. Some of the reasons are: We want to engage with and help data analysts and data scientists that predominantly work in SQL SQL and the grammar of graphics fit together extremely well We want to create an extremely powerful, code-based, visualization tool that doesn’t require an entire programming language (like R or python) LLMs speak SQL very well and also presents a new interface to data visualization creation We have learned so much from 18 years of

ggplot2 development that we’re excited to apply to a blank canvas Let’s discuss each in turn.