Pangram verdict · v3.3
We believe that this document is a mix of AI-generated, and human-written content
AI likelihood · overall
MixedArticle text · 1,723 words · 6 segments analyzed
“It is important to draw wisdom from many different places. If you take it from only one place, it becomes rigid and stale.” — Uncle Iroh At Fulcrum Research, we study the performance and behavior of long-horizon agents. Although each task setting has its own specific structure, we believe it’s possible to find general principles of agent performance across settings, each contributing to a nascent agent science. In this post, we motivate the difficulty of finding suitable settings for agent science and propose inverse rubric optimization (IRO) settings, in which an agent has to optimize the preferences of a blackbox judge it has variable access to. We observe these tasks induce rich behavior and smooth scaling. We find that frontier models effectively iterate and improve with more judge access but by default do not maximally use the resources provided to them. Notably, Fable 5 outperforms all models given smaller amounts of labels, but does not improve at the largest budget and plateaus around the level of Opus 4.6. We open source code here. Testbeds for agent science Studying the behavior of agents is challenging due to the variance and cost of long horizon tasks. Trajectory-level variance is often notoriously high due to the many non-deterministic choices made in a run, making it hard to estimate the impact of various methods. Often this variance is precisely high in the tasks with large action spaces. But those tasks are those that induce the exact complex behaviors we intend to study and intervene on in our experiments. The challenge is then to find settings that require general kinds of capability and benefit from a broad range of strategies, like resource utilization, exploration, hypothesis testing, etc., while being smooth enough for research. To remedy this, we look at toy settings that remain challenging, rich and smooth. Inverse rubric optimization In an IRO task, the agent being evaluated has the goal of learning the preferences of a black-box judge model, parametrized by some judging rubric. It submits a policy for generation, e.g. a prompt, a scaffold, etc. which is then used to generate domain samples which are evaluated by the judge. It has to learn and explore the judge preferences by studying its scores and submitting new attempts. Optimizer agent policy input judge hidden rubric policy π output score 1 label per scored output, budget B submit final policy π*, evaluated held-out Fig. 1: An inverse rubric optimization task.
The optimizer agent iteratively submits a policy (e.g. a generation prompt) that maps task inputs to outputs; a black-box judge scores each output against a hidden rubric, spending one label per score. The agent finally submits its best policy, which is evaluated on held-out inputs. In IRO tasks we vary the budget of judge labels that the optimizer can sample. The dynamics of how the optimizer interacts and samples that resource, whether it tests hypotheses, whether it properly considers sampling noise, etc. constitute the main interest of the setting. In this post, we study a specific set of IRO tasks specified by various black-box LLM poetry judges. We generated a set of poem topics, as well as various poetry rubrics that grade poems along various stylistic and textual features, guided by the style of various poets. The optimizer agent then iterates on a prompt that is used by a generator model to generate poems which are scored by the judge. Its goal is to hillclimb the judge’s mean score on the generated poems. Each of our rubrics grades a set of features additively with partial credit, so that partial stylistic discoveries move the score (see Appendix A.1 for the full rubric of our Milton judge). 1 In the poetry setting, the optimizer agent is given a submit_train_batch(generation_prompt, num_labels) tool that takes in a set of guidelines for poem generation, and can generate num_labels poems using train set topics that are then scored and returned to the optimizer. Each run has a label budget that it can use, and at some point submits a final prompt that is evaluated on the eval set. It can call the tool as many times as it wants until its budget has been consumed. We observed that different judges define different optimization landscapes, in terms of the range of scores and the difficulty of the learning problem they pose to the optimizer (see Appendix A.3 for per-judge learning curves). To normalize the difference in scale, we normalize by the gap between the performance of a policy that sees the hidden rubric and a policy that has no guidance. This gap provides a natural scale to benchmark optimizer progress against, although the rubric-visible score is not necessarily an upper bound 2.
We define the normalized score of the original mean judge score S as follows: \[\hat{S} = \frac{S - \text{blind mean}}{\text{rubric-visible mean} - \text{blind mean}}\] Results Performance is smooth in label budget and use In our preliminary experiments, we ran optimizer agents against 5 different poetry rubrics, 3 seeds each and varied the amount of access to judge labels. We used Opus 4.6, GPT-5.5, Fable 5, or Haiku 4.5 as the optimizer, Opus 4.6 as the judge, and always used Haiku 4.5 as the poem generator. Unless a caption says otherwise, plots average over these 5 judges × 3 seeds per optimizer and error bars are standard errors over runs: Fig. 2: Mean final score improves with label budget. We can also track the average train performance scores for the smaller batches an agent submits along the trajectory to see how it progresses over time: Fig. 3: Within-run train scores improve as optimizers spend labels. Each line shows the latest train-eval score on B10000 runs, carried forward between calls. A score of 0 is the blind baseline; GPT-5.5's early prompts score slightly below it. Models employ rich strategies to learn the judge In our qualitative analysis of the trajectories, we find that models are able to iterate, test out hypotheses about the judge, and notice patterns that allow them to make progress. Models inspect per-sample scores, analyze low and high scoring samples to make hypotheses about plausible features, and run experiments where they isolate characteristics of good prompts. Fable and Opus seem more systematic than GPT-5.5 in terms of exploring a broad range of hypotheses and stylistic variations and isolating the effect of their interventions. As a concrete example, let’s look at a Fable 5 run against the Milton judge (the hardest of our five judges, see Appendix A.3) at budget 1000. The run does systematic testing, statistical feature checks and ablations and closes 91% of the gap to the rubric-visible policy.
The trajectory decomposes into six phases — select a phase in the figure to follow what the agent was doing: 3456715101520253036 Style screen (calls 1–5)1Calibrating the scale (calls 6–9)2Feature mining (calls 10–12)3Accumulation of positive features (calls 13–26)4Counter-tests (calls 27–29)5Validate and stop (calls 30–36)6 held-out 6.19 call 1: score 3.3, n=10call 2: score 3.0, n=8call 3: score 3.38, n=8call 4: score 3.25, n=8call 5: score 4.13, n=8call 6: score 3.9, n=10call 7: score 4.5, n=10call 8: score 3.9, n=10call 9: score 4.58, n=12call 10: score 4.4, n=10call 11: score 4.33, n=12call 12: score 3.7, n=10call 13: score 4.67, n=12call 14: score 4.83, n=12call 15: score 4.75, n=12call 16: score 5.13, n=15call 17: score 5.53, n=15call 18: score 5.93, n=15call 19: score 6.53,
n=15call 20: score 5.93, n=15call 21: score 5.87, n=15call 22: score 5.27, n=15call 23: score 5.67, n=15call 24: score 5.13, n=15call 25: score 5.95, n=20call 26: score 5.93, n=15call 27: score 3.42, n=12call 28: score 5.17, n=12call 29: score 4.0, n=12call 30: score 6.4, n=15call 31: score 6.2, n=20call 32: score 5.73, n=15call 33: score 6.67, n=15call 34: score 6.35, n=20call 35: score 6.53, n=15call 36: score 6.2, n=20 submit_train_batch call index batch mean judge score 1. Style screenA generic “skilled poet” baseline scores 3.3,and the agent tests various style ablations: contemporary free verse (3.0), traditional formal verse (3.4), a New Yorker / Pushcart register (3.25), and high Romantic (4.1).example poem · generated by Haiku 4.5 · scored 3 · a typical baseline poem“Citrus in December” — the smell of oranges in winter · loose sonnet-like poem, not necessarily rhymed · call 4The dimpled skin yields to my thumbnail— a small tearing, and the air thickens with something between sweetness and the mineral bite of a cold that hasn't yet killed everything.
My mother peeled them this way, one long ribbon of white pith falling to the kitchen counter where the light turned her hands the color of old honey. Now I hold the segments to the window. Each one a small lantern. The juice collects in the webbing between my fingers, sticky, purposeful as a child's grip. Outside, the maples are stripped to their architecture. Inside, this small warmth— not comfort exactly, but the evidence of something that ripened elsewhere, that traveled through dark months to arrive here, at my wrist, at the exact moment I needed to remember what the sun tastes like.example poem · generated by Haiku 4.5 · scored 5 · the Romantic outlier“Platform 9, Fluorescent Apotheosis” — a passport photo booth in a train station · free verse with crisp line breaks · call 5O Muse, attend this humble shrine of light! Within the humming booth at Victoria Station, where weary pilgrims seek their countenances caught— the flash ignites like Zeus's own descent, four photographs emerging, still-wet and trembling, each face a Grecian mask of transit's burden. The vinyl chair, worn smooth by ten thousand souls, receives the supplicant in formal dress, chin lifted toward the godly lens, while overhead the fluorescent heavens burn with that cold phosphorescence of the modern age— no candlelight, no painter's tender hand, but pure electric fire transforming flesh to document. Behind the plastic window, sheets of paper curl like oracular scrolls from some mechanical Pythia, promising passage to distant Thebes, stamped with the seal of nations yet unknown. The curtain parts. A woman emerges, clutching her four selves like Persephone returning, bewildered, to the sunlit world, where trains depart on schedule, and the soul must wear its photograph like armor upon the open road.2. Calibrating on judge scale (calls 6–9)Tests a Shakespeare/Milton arm (4.5) that produces a 6-scoring poem, and the agent notices: “Scale goes to at least 6 — likely 1–10. Archaic intensity helps. Let me inspect 6s vs lows.