OpenSCAD LLM Benchmark: Building the Pantheon | ModelRift Blog

M modelrift.com ↗

▲ 417 points • 157 comments • by jetter • 2d ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is a mix of AI-generated, and human-written content

66 %

AI likelihood · overall

Mixed

39% human-written 61% AI-generated

SEGMENTS · HUMAN 3 of 6

SEGMENTS · AI 3 of 6

WORD COUNT 1,760

PEAK AI % 100% · §3

Analyzed

May 22

backend: pangram/v3.3

Segments scanned

6 windows

avg 293 words each

Distribution

39 / 61%

human / AI fraction

Verdict

Mixed

Pangram v3.3

Article text · 1,760 words · 6 segments analyzed

Human AI-generated

§1 Human · 21%

We ran a small practical benchmark: give several AI coding tools the same kind of task and ask them to build the Pantheon in OpenSCAD.ModelRift generates OpenSCAD for every 3D model on the platform. The LLM’s ability to handle spatial geometry directly affects what we can ship, so we track how models improve on this kind of task.The goal was to see how well each system could turn architectural reference material into parametric CAD code, using the OpenSCAD CLI to render previews and iterate.The prompt was intentionally visual and architectural: build the Pantheon from reference images, including the rotunda, dome, portico, columns, pediment, and recognizable front details.Overview of the six current benchmark results. Each thumbnail is labeled with the client and model used for that run.Why Pantheon?This was not a basic OpenSCAD syntax test. All of the current coding LLMs can produce a simple “cube with a hole” model in OpenSCAD perfectly well. That kind of prompt mostly tests whether the model knows difference(), cube(), and cylinder().The Pantheon is more useful as a benchmark because it sits in a middle ground. OpenSCAD is not a good fit for natural sculpted models, organic surfaces, or character-like geometry. It is much better at Boolean operations, radial symmetry, extrusions, and clean constructive shapes. The Pantheon has a large radial rotunda and dome, a central oculus, straight portico faces, columns, stepped bases, and a triangular pediment. That mix makes it illustrative without being impossible.It is also recognizable. A weak result still looks vaguely like a domed building, but a better result has to get the relationship between the round drum, the rectangular portico, the dome rings, and the front facade roughly right.PromptThe prompt used for the benchmark was:see two ref images and build .scad file with openscad implementation of pantheon. use openscad CLI (available) to preview your work (by rendering openscad model to .png) and iterate until you are happy with the result.Reference ImagesReference #1 is the front facade view on the left. Reference #2 is the aerial/top view on the right. The combined image was generated with ffmpeg from the two source images used in the benchmark.ResultsThe six current benchmark outputs, labeled by client and model.

§2 Human · 22%

Tool and modelTimeQualitySummaryLinkCursor 3.5 / Composer 2.5●●●●●5/5, fastest●○○○○1.4/5Quickest run, but the weakest output. It captured a dome and portico, but the proportions, color discipline, and architectural details were the poorest.Explore 3D resultCodex 5.5 High●●●●○4/5, baseline●●●○○3.0/5Strong detail density, including the inscription on the entablature. The main issue was a mismatch between preview renders and the final STL.Explore 3D resultClaude Code 2.1 / Opus 4.7●●○○○2/5, slower●●●○○3.0/5Better structure than Cursor, with a clearer portico and stepped base, but too monochrome and less convincing than the stronger runs.Explore 3D resultClaude Code 2.1 / Sonnet 4.6●○○○○1/5, slowest●●●◐○3.4/5The model had clean massing, balanced proportions, and the most plausible overall read among the original autonomous batch, but took the longest to implement.Explore 3D resultGoogle Antigravity 2.0 / Gemini 3.5 Flash HighBest autonomous result●○○○○1/5, around 12 min●●●●◐4.5/5Strongest autonomous output. It used real Pantheon dimensions, included the inscription, and was the only agent to implement the signature interior coffered ceiling pattern.Explore 3D resultModelRift / Gemini Flash 3.0Human-in-the-loop winner●○○○○1/5, about 10 min●●●◐○3.8/5Best non-autonomous result. It used ModelRift’s iterative annotation workflow with Gemini Flash 3.0 and took about 2x the Claude Code time.Explore 3D resultThe scores are relative to this benchmark only. They are not general model rankings, and the time score reflects observed implementation time, not project publication timestamps. The quality scores are intentionally conservative: even the best result is not close to a perfect Pantheon model.Workflow NotesThe client workflow mattered almost as much as the model.

§3 AI · 100%

Codex Desktop shows the images that the LLM has loaded into context directly inside the conversation. For visual CAD work, that is very convenient: you can see whether the agent is actually using the same references you intended. Cursor Agent and Claude Code CLI were workable, but their process views made visual context less explicit.All tested systems handled the local OpenSCAD toolchain well. OpenSCAD was installed on the test Mac and available on PATH, and every agent used it successfully to render PNG previews during iteration. The limiting factor was not tool access. It was geometric judgment, camera setup, and whether a previewed model exported into a clean final mesh.Codex also made the preview iteration easier to follow. It exposed the reference images, the OpenSCAD file edits, and generated preview images in the same thread.After the public benchmark result, Codex attempted to investigate and fix the problematic roof and entablature export issue. That follow-up was not included in the final benchmark results, because the published comparison uses the original submitted models.Cursor had the fastest interaction loop, and its UI showed a useful plan plus generated OpenSCAD code side by side. The output quality still lagged behind the slower runs.Claude Code was more terminal-centric. It did read the images and iterate with OpenSCAD commands, but the process was less visual while the model was being built.

Google Antigravity 2.0 / Gemini 3.5 Flash HighExplore 3D resultShort demo clip of the Antigravity result and workflow.We added this run on May 22, 2026, immediately after Google launched Antigravity 2.0 at I/O 2026 and published Gemini 3.5 Flash on May 19, 2026. It is a good early signal for Flash 3.5: the result was the best fully autonomous model in this benchmark.The product context was messy. Antigravity 1.0 was a VS Code-based IDE. Antigravity 2.0 is closer to Codex Desktop: an agent-first desktop app with plans, task execution, previews, and less of the old editor-centric workflow. That migration drew a lot of release-week criticism because users who wanted the previous IDE experience did not have a smooth path back other than downgrading or pinning the older app.Even with that rough migration, Flash 3.5 High was impressive here.

§4 AI · 100%

Antigravity did something the other autonomous agents did not: it searched for real Pantheon parameters instead of only eyeballing the reference images. The plan and code used explicit measurements for the rotunda, dome, portico, and oculus, then turned those into parametric OpenSCAD values.The implementation plan was more architectural than the others: Implement a detailed, visually stunning, and dimensionally accurate 3D model of the Pantheon in Rome using OpenSCAD. It also proposed a cutaway mode, which mattered because the Pantheon is not just a dome from the outside: To showcase both the exterior (stepped rings, portico) and the interior (coffers, niches, perfect spherical proportion), I will include a toggle in the code show_cutaway = false;. The strongest detail was the ceiling. The plan called out the actual coffer structure: The Pantheon dome interior has 5 rings of 28 coffers. Subtracting these mathematically in OpenSCAD is highly detailed and looks amazing.

§5 Human · 13%

Antigravity was the only autonomous agent that implemented the Pantheon’s signature interior ceiling pattern: repeated square coffers visible through the oculus.The dedicated cutaway render makes the same choice easier to see:The exterior result also had several details that usually get skipped in quick OpenSCAD outputs: mixed grey and red column materials, a readable inscription, stepped roof rings, and a correct broad relationship between the rotunda, intermediate block, portico, and dome.The score is 4.5/5 for quality and 1/5 for speed. It was not fast, but it moved the autonomous ceiling for this benchmark. Flash 3.5 looks very promising for spatial code generation when paired with a tool that can plan, render, inspect, and revise.ModelRift / Gemini Flash 3.0Explore 3D resultThis result used ModelRift with Gemini Flash 3.0 and a human-in-the-loop process. It was not an autonomous single-pass benchmark like the first four runs. The workflow took about 10 minutes, roughly 2x the Claude Code time, so it gets the same 1/5 speed score.This benchmark was run on May 21, 2026, shortly after Gemini 3.5 Flash was published. The Antigravity result above shows that 3.5 Flash is strong, but for ModelRift’s default model we still have to balance quality against cost and latency: Google’s published Gemini API pricing lists Gemini 3.5 Flash standard pricing at $1.50 input and $9.00 output per 1M tokens, while Gemini 3 Flash is listed at $0.50 input and $3.00 output. That is a 3x increase over the previous Flash generation, and far above the older Gemini 1.5 Flash-era cost baseline.The quality was better than the original autonomous batch: 3.8/5. The model still is not perfect, but the portico, column layout, roof, dome ribs, and overall massing are more coherent.

§6 AI · 100%

The main difference was that visual feedback could be attached directly to the current render instead of being described only in text.The first ModelRift pass produced a valid model quickly, but the roof and portico details were still rough. That is where annotation mode helped.Instead of writing a long spatial correction, the feedback could point at the missing or weak features on the render.This is the workflow ModelRift is designed around: generate a model, inspect it in the browser, draw visual notes on the render, and ask the AI to revise the OpenSCAD. For spatial CAD tasks, that loop is much more precise than text-only instructions.Codex 5.5 HighExplore 3D resultCodex 5.5 High produced the densest model. It included the rotunda, dome ribs, oculus, layered masonry bands, a front portico, columns, surrounding base details, and even text on the entablature: M AGRIPPA L F COS TERTIVM FECIT.That inscription was impressive because text in OpenSCAD is not just decorative from a modeling perspective. It has to be placed, extruded, oriented, and kept thin enough to read without overwhelming the geometry.The failure mode was also interesting. During iteration, the render previews looked better than the final exported STL. In the final result, the entablature and portico roof area developed a problematic ceiling-like surface that changed how the front assembly read. So Codex showed strong spatial reasoning and ambition, but it also exposed a real export-risk issue: preview correctness is not always final mesh correctness.The editor screenshot above shows one of the intermediate project previews. The final public STL preview differs enough to matter, especially around the portico and entablature.A later Codex attempt did analyze that issue and started removing the high-risk contact patterns near the portico roof and dome junction. That repair pass was useful process evidence, but it is not counted in the table because it happened after the benchmark result was recorded.Claude SonnetExplore 3D resultClaude Sonnet produced the cleanest model in the original autonomous batch. It did not attempt the same level of micro-detail as Codex, but the silhouette was cleaner and the major architectural parts fit together more naturally.The dome, drum, portico, and column layout read as one building rather than a set of adjacent primitives.