Laguna XS.2 and M.1: A Deeper Dive

P poolside.ai ↗

▲ 101 points • 50 comments • by tosh • 2mo ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

4 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 6 of 6

SEGMENTS · AI 0 of 6

WORD COUNT 1,620

PEAK AI % 13% · §6

Analyzed

Apr 28

backend: pangram/v3.3

Segments scanned

6 windows

avg 270 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 1,620 words · 6 segments analyzed

Human AI-generated

§1 Human · 1%

We’ve released the first two models in the Laguna family, Laguna M.1 and Laguna XS.2, alongside the runtime we use to train and operate agents, available through two product experiences in preview.Laguna M.1 came first, finishing pre-training at the end of last year; it's the foundation for everything else we're building across the family. Laguna XS.2 is a much smaller model, but remarkably capable for its size, and it's our first open-weight release. Both models are free to use for a limited time via our API and on OpenRouter, and Laguna XS.2 weights are also available under an Apache 2.0 license.Laguna XS.2 and Laguna M.1 are agentic coding models built for long-horizon work. To date, we’ve been focused on serving our government and public sector clients with capable models deployable into the highest-security environments. And while our commitment to these customers remains, we’re now ready to share where we are with the world. We’re also excited to release the weights of Laguna XS.2, our newest generation model, to the open ecosystem to support builders and the wider research community.We're working toward models that enable more capable agents; and we believe the path runs through coding capability and increasingly long-horizon tasks. Creating software is the core skill through which many other capabilities get expressed.Today, most agents interact with the world through tool calling, where structured interfaces restrict agents to a fixed set of actions defined in advance. We think this is a transitional pattern. Software is a much more expressive interface. An agent that can write and execute code can compose actions, parallelize work, and build its own ad-hoc systems to interact with the world.These models are the work of the roughly 60 people who make up our Applied Research organization, across architecture, data, pre-training, and reinforcement learning. We're excited to bring this work into the world and see what the community builds with it.

§2 Human · 1%

Laguna M.1 225B-A23B Laguna XS.2 33B-A3B Qwen3.5 397B-A17B Qwen3.5 35B-A3B Qwen3.6 35B-A3B Claude Sonnet 4.6 -Laguna M.1 is our most capable model to date and completed pre-training at the end of last year. It's a 225B total parameter Mixture of Experts (MoE) model with 23B activated parameters, trained completely in-house and from scratch on 30T tokens, using 6,144 interconnected NVIDIA Hopper GPUs. Laguna M.1 reaches 46.9% on SWE-bench Pro and 40.7% on Terminal-Bench 2.0. Laguna M.1 225B-A23B Devstral 2 123B dense† GLM-4.7 355B-A32B DeepSeek-V4-Flash 284B-A13B Qwen3.5 397B-A17B Claude Sonnet 4.6 -† We have chosen to include dense models with larger activated parameter counts to highlight the relative efficiency of MoE models.Laguna XS.2 is our second-generation MoE and our first open-weight model, built on everything we've learned since training Laguna M.1 across data, including synthetic, and RL. At 33B total parameters with 3B activated (30T tokens trained), it's a highly capable open-weight agentic coding model in its weight class, reaching 44.5% on SWE-bench Pro and 30.1% on Terminal-Bench 2.0. The weights are available for download today under Apache 2.0.

§3 Human · 5%

Laguna XS.2 33B-A3B Devstral Small 2 24B dense † Gemma 4* 31B dense† Qwen3.5 35B-A3B Qwen3.6 35B-A3B Claude Haiku 4.5* - GPT-5.4 Nano -† We have chosen to include dense models with larger activated parameter counts to highlight the relative efficiency of MoE models.Our agent harness, an Agent Client Protocol (ACP) server, is the same carrier we use for agent RL training and evaluation. We're releasing it alongside the models because we believe models and agents should be seen and used together as the gap between them closes.

Laguna M.1 (225B-A23B) Devstral 2 (123B dense) GLM-4.7 (355B-A32B) DeepSeek-V4-Flash (284B-A13B) Qwen3.5 (397B-A17B) Claude Sonnet 4.6 (-)

SWE-bench Verified 72.5 72.2 73.8 79.0 76.2 79.6

SWE-bench Multilingual 67.3 61.3 66.7 73.3 69.3 -

SWE-bench Pro 46.9 - - 52.6 50.9 -

Terminal-Bench 2.0 40.7 32.6 41.0 56.9 52.5 59.1

Laguna XS.2 (33B-A3B) Devstral Small 2 (24B dense) Gemma 4 (31B dense) Qwen3.5 (35B-A3B) Qwen3.6 (35B-A3B) Claude Haiku 4.5 (-) GPT-5.4 Nano (-)

§4 Human · 3%

SWE-bench Verified 68.2 68.0 52.0 69.2 73.4 73.3 -

SWE-bench Multilingual 62.4 55.7 51.7 60.3 67.2 - -

SWE-bench Pro 44.5 - 35.7 44.6 49.5 39.5 52.4

Terminal-Bench 2.0 30.1 22.5 42.9 40.5 51.5 29.8 46.3

Footnotes: All benchmarking for Laguna M.1 and Laguna XS.2 was completed using the Laude Institute's Harbor Framework with our agent harness, using a maximum of 500 steps and sandboxed execution using 8 GB RAM/2 CPUs (with the exception of Terminal-Bench 2.0; see below). The same sampling parameters were used across both models and for all benchmarking: temperature=0.7 and top_k=20.Some base task images and verifiers were patched to fix infrastructure reliability issues inherent in task setup, such as rate limits on third-party dependencies in external registries used by the verifier. More details outlining these updates and other findings will follow in a future technical blog post.SWE-bench Pro: mean pass@1 averaged over 3 runs.SWE-bench Verified: mean pass@1 averaged over 4 runs.SWE-bench Multilingual: mean pass@1 averaged over 7 runs.Terminal-Bench 2.0: mean pass@1 averaged over 5 runs. 48GB RAM/32 CPUs.* We used the highest publicly-referenced scores for all comparison models across each benchmark. In all cases these were official scores published in release blog posts or equivalent, with the exception of Gemma 4 31B IT where the highest published scores were reported by the Qwen team, and Claude Haiku 4.5 where the highest published (verified) scores for SWE-bench Pro and Terminal-Bench 2.0 are from their respective official leaderboards.Open weightsLaguna XS.2 is our first open-weight model.

§5 Human · 4%

Until now, we've been focused on building for the public sector, where security requirements like on-prem and air-gapped deployments make shipping frontier models a uniquely hard but important problem. That work continues and remains core to what we do.At the same time, we believe the West needs strong open-weight models, and we want to contribute to that ecosystem. The fastest way for us to improve our models is to bring the world along in building and evaluating them, and we want people to know they can look to us to contribute going forward.The most exciting applications of foundation models come from people building on top of capable starting points. If you want to fine-tune, quantize, or serve, the weights are yours. We will release Laguna XS.2-base soon. OpenRouter OllamaComing soonWe're bringing Laguna XS.2 to more leading frameworks in the coming weeks, with the help of our partners and the community.Working with NVIDIAEvery aspect of our Laguna series, from data curation and pre-training through post-training, was conducted on NVIDIA hardware. Additionally, Laguna XS.2 is supported in NVIDIA TensorRT-LLM on Day 1. We're also providing an NVFP4 version of Laguna XS.2, so you can expect strong performance on NVIDIA Blackwell architecture.Model buildingWe train all our models from scratch. That means our own data work, our own training codebase (Titan, which we cover in this blog post), and our own agent RL infrastructure. Laguna pushed the limits of that stack, particularly across three domains: our data pipeline including synthetic data, how we optimized the efficiency of the Muon optimizer, and our async on-policy RL scheme.Data and automixingBoth Laguna M.1 and XS.2 were trained on more than 30T tokens. Reaching that scale, and using it productively in training, required pushing the limits of data generation, processing, curation, and mixing.Large-scale web dataWe take great care in building and curating our datasets. We treat web data curation as a joint optimization of quality and diversity. We model quality as a continuous, multi-dimensional signal and rank data using a composite score, using models heavily across the stack for quality signals. Crucially, we don't only keep top-quality data.

§6 Human · 13%

We found it to be biased toward STEM and reasoning, so we retain portions of mid- and lower-quality buckets to preserve diversity, which is critical for generalization.Compared to precision-focused pipelines optimized for short token horizons, this approach yields ~2× more unique tokens while maintaining performance. The gain persists when scaling to longer training horizons, which highlights the importance of diversity alongside quality.We also conducted a detailed deduplication analysis and confirmed FineWeb's hypothesis that global deduplication disproportionately removes high-quality data. By matching the quality distribution between global and snapshot deduplication, we could further close the gap on downstream performance.Synthetic dataTo round out natural web data, we use synthetic data to complement the training mix along dimensions that are otherwise hard to control. In Laguna XS.2, it contributes about 13% of the final training mix throughout all pre-training stages, building on organic data rather than replacing it, and expanding where it falls short. The Laguna series uses approx. 4.4T+ synthetic tokens.To preserve diversity and validity at pre-training scale, our work spans a spectrum between seed-heavy and pipeline-heavy generation. At the seed-heavy end, we reshape content across formats (Q&A, structured lists, dialogue, and so on) to regularize how information is presented, so the model sees valuable artifacts through multiple angles. At the pipeline-heavy end, we move into feature extraction and recomposition, surfacing implicit reasoning, structure, and relationships, and teaching them in new forms and contexts.We also expand synthetic generation beyond narrow, high-signal domains. Alongside STEM and code, we apply these pipelines across the broader data mix, expanding coverage while maintaining high, grounded signal density.Our approach is designed to integrate within the larger training ecosystem, focusing on robustness and letting synthetic data contribute earlier and more consistently throughout training.AutoMixer: data mixture optimizationData curation and the mix that goes into training is extremely impactful on final model performance. We developed an automixing framework to systematically explore and optimize pre-training data mixtures. Instead of relying on manual heuristics, each run of the automixer trains a swarm of ~60 sufficiently large proxy models, each on a different data mix, and measures performance across key capability groups (code, math, STEM, common sense). From these runs, we fit surrogate regressors that approximate how changes in dataset proportions affect downstream evaluation.