AI Agents in the Real World: What Actually Works (and What Doesn't) - Amit Malik

Hey folks, I wrote a post earlier this year about getting started with AI agents where I covered the basics: what agents are, the different patterns, how to pick a framework. That post was me being optimistic and excited about the potential.

This one’s different. This is what I’ve learned after actually shipping agent-based features in production at Spektra over the past several months. Some of it confirms the hype. A lot of it doesn’t.

Microsoft Ignite 2025 just wrapped up, and agents were everywhere. Every keynote, every demo. Copilot Studio agents, Azure AI Foundry agents, Semantic Kernel agents. You’d think we all have autonomous AI running our businesses. We don’t. But some of us have gotten real agents into production, and I want to talk about what that actually looks like.

The Demo-to-Production Gap Is Massive

Let me give you a number that stuck with me. According to a 2024 Camunda survey, roughly 63% of organizations piloting AI agents hadn’t moved a single one to production. And the ones that did? Most ended up scaling back the scope quite a bit from what they originally planned.

I get it. I’ve built the demos too. You wire up an LLM to a few tools, give it a system prompt, and watch it do something cool on stage. Takes an afternoon. Getting that same thing to work reliably at 2 AM on a Tuesday when real users depend on it? That’s a completely different engineering challenge.

The gap comes down to three things: reliability, cost, and the long tail of weird inputs your users will throw at it.

Single Agent with Tools Wins. Multi-Agent Is Mostly Pain.

I have opinions on this and I’m not going to hedge.

For production workloads, a single well-prompted agent with clearly defined tools beats a multi-agent system almost every time. I know that sounds boring. Multi-agent architectures look amazing in papers and conference talks. You’ve got the planner agent, the researcher agent, the reviewer agent, all collaborating. It’s beautiful.

It’s also a debugging nightmare.

We tried a multi-agent setup in CloudLabs where one agent diagnosed a failed deployment, a second proposed a fix, and a third validated the fix before applying it. In theory: checks and balances, very elegant. In practice: the agents disagreed with each other, passed context back and forth in ways that lost information, and the whole thing took 45-60 seconds when a single agent could do the same job in 12.

We ripped it out. Reliability went from about 72% to 89%. Latency dropped by 4x. I’m not saying multi-agent never makes sense, but for enterprise automation? Start with one agent, give it good tools, and only add complexity when you’ve genuinely hit a wall.

AutoGen and CrewAI are cool frameworks. We prototyped with both. But Semantic Kernel with a single agent and Azure AI Foundry for deployment is what we actually run in production.

Tool Calling Is the Hard Part

Everyone talks about the LLM as if it’s the tricky bit. It’s not. The LLM is the easy part. Tool calling is where everything breaks.

You define a tool, say get_lab_status(lab_id: string), and tell the agent when to use it. Sounds simple. Then in production:

The agent calls get_lab_status with an environment ID instead of a lab ID because the user mentioned both in their message
The agent calls the tool twice in a row for no reason
The agent hallucinates a tool that doesn’t exist (we saw it invent restart_lab_engine out of thin air)
The agent decides to call three tools simultaneously when they need to be sequential
The tool returns an error and the agent just… ignores it and makes up an answer

We track tool call accuracy in production. With GPT-4o and well-written tool descriptions, we get correct tool selection about 91% of the time. Sounds good until you realize that for a workflow requiring 4 sequential tool calls, end-to-end success drops to ~68% (0.91^4). That’s a lot of failures.

What helped:

Extremely detailed tool descriptions. Not just "Gets lab status" but a paragraph explaining what the tool does, what each parameter means, when to use it vs other tools, and what the return format looks like. We tripled our tool descriptions and accuracy jumped from 84% to 91%.
Fewer tools. Started with 15 tools. Cut to 8 by combining related operations. Fewer choices, fewer wrong choices.
JSON schema validation on tool call arguments. Catches malformed calls before they hit your backend.
Retry with feedback. When a tool call fails, feed the error back and let the agent try again. Recovers about 60% of failures on the first retry.

The Cost Problem Nobody Talks About Honestly

I mentioned token costs in my CloudLabs AI post and my RAG post. Agents make the cost problem way worse because of loops.

A simple RAG query: one embedding call, one LLM call. Maybe $0.002-0.005 per query with GPT-4o.

An agent workflow: the LLM reasons about what to do, calls a tool, gets the result, reasons again, calls another tool, reasons again, formulates an answer. That’s 3-5 LLM calls minimum per user interaction, each with a growing context window.

Our lab troubleshooting agent averages 3.2 LLM calls per interaction. Cost per interaction: $0.04-0.08. At our volume that’s around $1,200/month for one agent feature, and that’s after optimization. The first version ran $0.15-0.20 per interaction because we hadn’t set max iteration limits and the agent would occasionally loop 8-10 times.

What we do now:

Hard cap of 5 tool-calling iterations per interaction. If the agent hasn’t solved it by then, it escalates to a human.
Aggressive context window management. We summarize earlier tool results instead of keeping the full JSON responses in context.
Model routing. Not every step needs GPT-4o. We use GPT-4o-mini for the initial intent classification and only escalate to GPT-4o for the actual reasoning steps. This cut costs by about 40%.
Caching for repeated patterns (same as we did with RAG, and it works even better for agent workflows because the same tool sequences recur frequently).

When You Shouldn’t Use an Agent at All

This is maybe the most important section.

I’ve seen teams reach for agents when a simple API call with an LLM would do the job. If your "agent" always follows the same three steps in the same order, it’s not an agent. It’s a pipeline. Build it as a pipeline. It’ll be faster, cheaper, and more reliable.

Use an agent when the workflow genuinely requires dynamic decision-making. Our lab troubleshooting agent is a good example: depending on the error type, it might check deployment logs, then quota, then network config, or it might skip straight to resolution. The path varies every time.

Don’t use an agent when you can enumerate all the paths upfront. Our lab content generation feature? Fixed pipeline: extract requirements, generate outline, expand sections, format. No agent needed. Calling it an "agent" would just add latency and cost for zero benefit.

I’d estimate that 60-70% of the "agent" use cases I see people building would be better served by a well-designed prompt chain or a simple function-calling API call.

The Gotchas List

Alright, here are the things that bit us in production that no conference talk prepared us for.

Infinite loops. Agent needs more info, calls a tool, gets a confusing result, decides it needs more info, calls the same tool again. We had one interaction rack up 47 tool calls before we caught it. That’s why the hard cap exists now.

Context window overflow. Tool results can be big. A deployment log might be 3,000 tokens. After a few tool calls, your context is mostly tool output and the agent loses track of the original question. Truncate tool results aggressively.

Hallucinated confidence. This one is subtle. When an agent calls tools and gets real data back, it sounds MORE confident in its answer, even when it’s wrong. Users trust it more because it "looked stuff up." We added confidence scoring and flag anything below 0.7 as uncertain.

Cold start latency. First interaction takes 8-12 seconds because of the reasoning step plus the first tool call. Users expect chatbot speed (2-3 seconds). Streaming partial responses helped, but the latency is real.

MCP: The Standard We’ve Been Waiting For

One bright spot. Anthropic’s Model Context Protocol (MCP) has been gaining traction and I think it’s going to be a big deal.

The problem it solves: every agent framework has its own way of defining tools. Semantic Kernel has one format, LangChain has another, OpenAI’s function calling has another. If you build a tool for one framework, you rebuild it for another. MCP standardizes this. Think of it like USB for AI tool connectivity.

I like it because it separates the tool server from the agent (update tools without redeploying your agent), it supports discovery (an agent can ask "what tools are available?" at runtime), and it’s already supported by Claude, Semantic Kernel, and a growing list of frameworks.

We’re prototyping it now. I expect we’ll migrate our tool layer to MCP in early 2026.

My Recommendations for Getting Agents to Production

After nine months of building this stuff, here’s what I’d actually tell someone starting today:

Build a pipeline first, promote to agent later. Get your workflow running as a fixed sequence of LLM calls and tools. Only convert to a dynamic agent if the fixed sequence can’t handle the variance in user requests.
Invest in tool descriptions like they’re documentation because they are. Tool description quality determines your agent’s accuracy more than which LLM you pick.
Set hard limits on everything. Max iterations, max tokens per tool result, max cost per interaction. The defaults in most frameworks are "unlimited" and that’s dangerous.
Measure tool call accuracy separately from answer quality. A great answer built on a wrong tool call is a ticking time bomb.
Use Azure AI Foundry for deployment if you’re in the Microsoft ecosystem. Pair it with Semantic Kernel for the agent logic. That combo works.
Watch MCP. Build new tools with MCP compatibility in mind even if you’re not using it yet.

Resources

Azure AI Foundry documentation
Semantic Kernel Agents
Model Context Protocol spec
AutoGen framework
OpenAI function calling guide
My earlier post: Building AI Agents: A Practical Framework for Getting Started
My RAG post: RAG Patterns for Enterprise

If you’re running agents in production, I want to hear from you. What’s your tool call success rate? What patterns are you using for cost control? Drop a comment or find me on LinkedIn.

Happy building, folks!

Amit

Assisted by AI during writing

AI Agents in the Real World: What Actually Works (and What Doesn’t)