Pangram verdict · v3.3
We believe that this document is fully human-written
AI likelihood · overall
HumanArticle text · 1,495 words · 5 segments analyzed
Code for this post is available here.AI-assisted coding has become the norm and with tools like Cursor, GitHub Copilot, Claude Code, Codex, we are increasingly letting models touch our code. If you have used any of these tools in the past year, you have probably experienced something like this: you ask the model to fix a simple bug (perhaps a single off-by-one error, or maybe a wrong operator). The model fixes the bug but half the function has been rewritten. An extra helper function has appeared. A perfectly reasonable variable name has been renamed. New input validation has been added. And the diff is enormous.I refer to this as the Over-Editing problem where models have the tendency to rewrite code that didn’t need rewriting. This matters more than it might seem. Code review is already a bottleneck and reviewers need to understand what changed, why it changed, and whether the change is safe. A model that rewrites entire functions, even correctly, makes this job dramatically harder as the code is now completely unrecognizable.In this post, I will investigate this problem: whether existing LLMs have a tendency to over-edit and whether we can train models to be more faithful editors.Over-EditingFigure 1: A classic example of the Over-editing problem. GPT-5.4 (High) rewrites the entire function when the correct fix is simply changing range(len(x) - 1) to range(len(x)).Over-editing refers to a model modifying code beyond what is strictly necessary to fix the problem at hand. To be precise: a model is over-editing if its output is functionally correct but structurally diverges from the original code more than the minimal fix requires.The example in Figure 1 illustrates this well. The bug is a single off-by-one error in a range() call (range(len(x) - 1) should be range(len(x))) and the correct fix is a single line. GPT-5.4 (with high reasoning effort) responds by rewriting the entire function: it adds explicit None checks, introduces np.asarray conversions with dtype=float, adds finite-value masking, validates array sizes, changes the curve_fit call signature, and replaces the plotting logic entirely. While the output passes the tests and is functionally correct, the diff is enormous, and none of those additions were asked for or even necessary.It helps to think about this in terms of the kind of work being done.
Software engineering broadly splits into two modes: green-field (building something new from scratch) and brown-field (working within an existing codebase). Specifically in brown-field, the existing code has been understood by the team and has been deliberately written the way it was. The model’s job is to fix the issue and nothing else.A common piece of advice for working with AI coding tools is to simply write more tests because if the tests pass, the code is fine. However, Over-editing is a brown-field failure where unlike correctness failures, it is invisible to test suites. As models generate more code, engineers have more to review and over-editing makes that harder. There is more complex logic to parse, more lines of code to read, and a higher chance that overall codebase quality quietly degrades.Measuring Over-EditingTo study over-editing, we first need a dataset of code edits where the “ground truth” edit is well-defined with some degree of “minimality”. Rather than using another LLM to introduce bugs (which is what most existing benchmarks do), we programmatically corrupt 400 problems from BigCodeBench which gives us more fine-grained control: things like flipping a comparison operator (< → <=), swapping + for -, or changing boolean values (True → False).1 Each corrupted sample remains syntactically valid and verified to break the corresponding test cases. This ensures that the ground truth edit is exactly the reversal of the corruption and nothing more, thus making this edit minimal by construction. We can then evaluate not just whether a model fixes the bug, but how much else it changed in the process.MetricsMost coding benchmarks evaluate models on correctness using some variant of Pass@1. However, Pass@1 is necessary but not sufficient. A model can score perfectly on Pass@1 while completely rewriting every function it touches. For this experiment, we need metrics that capture how much the model changed beyond what was required.Token-level Levenshtein Distance. Unlike standard Levenshtein which counts the minimum number of character insertions, deletions, and substitutions to transform one string into another, we use a Python token-level variant. The code is first passed through Python’s tokenizer, which splits it into its atomic syntactic units (def, add, (, a, ,, b, ), :, return, a, +, b). Levenshtein is then computed over this token sequence rather than raw characters.
For example, consider the following two functions:def add(a, b): def someotherfunctionname(a, b): return a + b return a + b Character-level Levenshtein gives a distance of 19. Token-level Levenshtein gives a distance of 1 since someotherfunctionname becomes a single token. We normalize by total token count so scores are comparable across functions of different lengths.In addition, rather than simply comparing the model’s output to the ground truth, we compare both against the corrupted input. Let $C$ be the corrupted solution, $G$ the ground truth, and $M$ the model’s output. The true minimal edit (simply the reversal of the corruption) is $D_{\text{true}} = d(G, C)$ and the model’s edit is $D_{\text{model}} = d(M, C)$, giving a relative patch score:$$S(M) = D_{\text{model}} - D_{\text{true}}$$Values closer to zero indicate the model’s patch resembles the true minimal fix. The intuition is that we can interpret the original uncorrupted solution as the best possible edit to the corrupted solution, compute the scores for this best possible patch, and then compare with the model’s output.Added Cognitive Complexity. Cognitive Complexity (an improvement over Cyclomatic Complexity) measures how hard code is to understand. It penalizes nesting, recursion, mixed logical operators, and non-obvious control flow. For example a straight line of code with no branches is much easier to read than something that requires a reader to hold state, such as an if, a loop, or try/except. An example is shown below:def process(items): result = [] for item in items: # +1 if item > 0: # +2 (nesting penalty: inside a loop) if item % 2 == 0: # +3 (nesting penalty: two levels deep) result.append(item) return result # Cognitive Complexity: 6 Since all our corruptions change values rather than structure, the correct fix should always add zero Cognitive Complexity. Any increase in the model’s output was introduced unprompted and is unnecessary. We report the absolute difference between the model’s output and the original, which should be zero for a faithful minimal edit.
Values below 0 are also unwanted as unnecessary simplifications to code are also undesirable.Do Models Over-Edit?Yes, even frontier ones.ModelPass@1 ↑Normalized Levenshtein ↓Added Cognitive Complexity ↓Reasoning ModelsGPT-5.40.7230.3952.313Claude Opus 4.60.9120.0600.200Gemini 3.1 Pro Preview0.8580.1450.501GLM 5 High0.8590.0990.320Qwen 3.6 Plus0.8580.1450.048Kimi 2.50.8350.1510.770DeepSeek R10.8200.2320.673DeepSeek Chat V3.10.7950.2320.694GPT-5 High0.7130.4383.832Non-Reasoning ModelsGPT-5.40.7700.3271.563Claude Opus 4.60.9000.0790.313Gemini 3.1 Pro Preview0.8600.1290.358GLM 50.8400.0970.235Qwen 3.6 Plus0.8700.1060.605Kimi 2.50.7700.1400.687DeepSeek V30.8000.2010.803DeepSeek Chat V3.10.8020.2351.223GPT-5 Minimal0.7380.3972.877Table 1: Performance comparison of reasoning and non-reasoning models. Models that appear in both are hybrid models. Best scores for each metric are bolded.Among the latest frontier models, GPT-5.4 over-edits the most.
It has a Levenshtein of 0.39 in reasoning mode and 0.33 in non-reasoning, with Added Cognitive Complexity of 2.31 and 1.56 respectively. Despite this, its Pass@1 is only 0.723 and 0.770, making it one of the weakest correctness performers too. Claude Opus 4.6 achieves the highest Pass@1 of any model evaluated (0.912 reasoning, 0.900 non-reasoning) while also producing the smallest diffs with Levenshtein of 0.06 and 0.08, Added Cognitive Complexity of 0.20 and 0.31. Gemini 3.1 Pro Preview sits in similar territory, with GLM 5 arguably the most conservative model among the open weight ones.Does Prompting Help?Many papers that claim to uncover a new LLM failure mode do not first test whether the model can do the task when asked directly. A behavior that looks impossible in one setup may be easy under an explicit prompt, so I investigate the impact of adding “IMPORTANT: Try to preserve the original code and the logic of the original code as much as possible” to the prompt.Figure 2: Change in Pass@1 and Levenshtein Distance when models are prompted explicitly to keep edits minimal. Models are colour coded by reasoning mode.With explicit prompting, every model improves and reduces its Levenshtein Distance, and with the exception of DeepSeek R1/v3, also improves its Pass@1. One interpretation is that the constraint to make minimal edits inadvertently narrows the search space of possible fixes, steering models toward the kind of precise, targeted change that is more likely to be correct. The effect on Levenshtein Distance is much more pronounced in reasoning models, which is likely the result of their stronger instruction following ability.Does Reasoning Mean Overthinking and Over-Editing?Reasoning models are generally assumed to be better at coding tasks, and they do score higher on Pass@1. But since we are interested in the style of these edits, we need to look at the results through a different lens.Figure 3: Comparison of the Levenshtein Distance of reasoning and non-reasoning models. Models are grouped into pairs of one reasoning and non-reasoning.