Notes from Production

A year of AI capabilities compounding

January 04, 2026 - AI SWE forml

I'm probably deep in the AI echo chamber — I write software for a living and I'm building products that run on these models. But even knowing that, I'm surprised how far my day-to-day experience is from the average sentiment I see from other engineers, let alone people outside tech.

So I want to share what I'm actually seeing, from two angles: using AI coding agents as a developer, and building case-handling systems at forml where we've watched these models evolve over the past two years. The pattern is the same in both places, and I think it's worth paying attention to.

What I'm seeing on the coding side

A year ago, the realistic pitch for AI coding tools was: autocomplete, boilerplate generation, maybe help with straightforward functions if you prompt carefully. Useful, but clearly assistive. The idea of giving an agent a vague bug report and having it autonomously investigate — pulling in code, tracing logic, cross-referencing domain-specific documentation — would have been pretty unrealistic.

That's what we have running in production today.

Our system takes a user report like "this doesn't work" or "the calculation here is wrong," pulls in the relevant context (input data, code paths, our documentation, searches for specifics about regulations, laws, and customer guidelines), and produces an investigation. These investigations used to take 5-60 minutes each. You'd need to understand what the user actually meant, trace through the logic, figure out where things went wrong, and often dig into some niche corner of domain-specific law to understand whether our implementation or the user's expectation was off.

Now an agent does this and regularly produces analyses that correctly identify both the specific bug and the broader legal issue at play, on obscure edge cases I wouldn't have expected it to handle. The kind of investigation that, if a junior engineer handed it to me, I'd think "okay, this person actually understands the problem."

That shift happened fast. If your mental model of these tools is from early 2025, it's already outdated.

Seeing these results makes it easy to over-extrapolate. These same systems also make bizarre logical errors that a 6th grader wouldn't make, sometimes right after solving something genuinely hard. They lack architectural taste — they'll take the greedy, short-term path without thinking about maintainability or technical debt. Left unsupervised, they produce slop. If you want to see what happens when non-technical people try to build complex software with AI alone, browse the microSaaS subreddit and watch people hit walls after "really trying."

You need capable people in the driver seat. But the ceiling of what these tools can do keeps rising, fast.

The same pattern, different domain

At forml, we build software that processes incoming german public administration forms, mostly social benefits like Wohngeld. The bulk of cases need clarification or additional information before they can actually be processed. What's needed depends entirely on the specific case and the documents submitted. You can't cover this with standard forms; they'd become impossibly complex and people would misunderstand them anyway.

In a lot of industries, this kind of document processing has existed for a while. You turn incoming data into structured information, have software handle the business rules, and let humans focus on the genuinely complex cases. But for public administration in Germany, this was impossible before pre-trained models as you can't legally train on the input data. So we were working with LLMs from the start and have watched the entire evolution from the inside.

Early on, these models struggled to even classify documents reliably. Extracting key information from something like a rental contract, which has no standard format and can get pretty wild, required heavy scaffolding. Specialized models for classification, aggressive prompt engineering covering every edge case, processing page-by-page and stitching results together. You did what you had to do.

Then six months would pass and the next generation of models would make your careful scaffolding irrelevant. Problems you'd built intricate solutions for just dissolved. Every time, the lesson was the same: give the model more context, more wiggle room, clearer high-level objectives. Stop trying to anticipate every edge case with hand-crafted rules. The models beat any heuristics we could come up with.

By the second half of 2025, the fidelity you'd need in handcrafted business rules to match a well-prompted agent had become prohibitively complex. Just dumping the entire input alongside complex instructions about what we want became our best-performing solution. At the start of 2025 that wouldn't have worked reliably at all.

This matters especially for long-tail problems. Those cases where you think "how often does this even happen, ten times a year?" make up the bulk of total volume. Hand-crafted heuristics can't cover them. Models with good context and clear objectives just handle them.

The recurring lesson

What makes both of these systems work isn't that we found a magic model. It's that we gave them rich context, clear objectives, and access to the right resources — and the models got good enough to actually use all of that. Everything a human would need to do the job. When you do that today, you can often really just watch it go.

This approach only became viable because instruction-following improved dramatically. A year ago, complex multi-step instructions would confuse models. You had to simplify, break things down, hand-hold. Now, you can throw a dense task description at an agent and it actually executes.

Where this is heading

The trendline here is pretty clear. If you look at evaluations like METR's task length studies or Epoch's Capabilities Index the scope of what these models can handle autonomously roughly doubles every 4-7 months. That matches what I've been seeing in practice on both sides of my work.

I don't have strong arguments for why this would stop anytime soon. There's a lot of capital and competitive pressure behind continued scaling.

For SWEs, the job is already shifting toward something closer to project management. Figuring out what actually needs to be built, defining it clearly, validating that the output is good, maintaining taste about architecture and long-term maintainability. The part where you personally type out the implementation is becoming a smaller slice of the work.

If the current trends hold, SWE work will look drastically different within the next two to three years. Not five, not ten.