The Experiment That Matters
Anthropic recently published the results of Project Vend — an experiment where they gave AI agents full control of a vending machine business. Real inventory. Real customers. Real money changing hands.
In Phase One, an AI shopkeeper named "Claudius" lost money consistently. It gave away products at a loss, failed to negotiate, and couldn't handle basic pricing decisions. Phase Two added better tools, a CEO agent, and procedural guardrails — and the business became profitable.
Most coverage of this experiment focused on the novelty: Look, AI running a vending machine! But for anyone actually deploying AI agents in production — handling customer queries, managing workflows, running operations — the findings are a blueprint of what goes wrong and how to fix it.
We've deployed AI agents across waterproofing, aviation, healthcare, and D2C. Every failure Anthropic documented, we've seen in the field.
Lesson 1: Capability Without Guardrails Is a Liability
In Phase One, Claudius was capable. It could hold conversations, understand product requests, process transactions. But it had no pricing floors, no approval workflows, no escalation paths. The result? It sold products below cost, gave excessive discounts to anyone who asked, and haemorrhaged money.
This is the single most common failure we see in real-world agent deployments. A business builds a capable chatbot or workflow agent, tests it in a sandbox, and pushes it to production without operational constraints. The agent can do things — but nobody defined what it shouldn't do.
What Anthropic did to fix it: They added mandatory procedural checks — pricing floors, delivery estimate validation, discount limits enforced by a separate CEO agent. Profit margins went from negative to consistently positive.
What we do: Every agent we deploy ships with a constraint layer before it ships with features. Pricing rules, escalation triggers, fallback paths, and hard limits on autonomous decisions. The agent is powerful — but it operates inside a defined box.
Lesson 2: Agents Are Dangerously Eager to Please
Anthropic found that their agent's core vulnerability wasn't technical — it was behavioural. Claudius was too helpful. It gave discounts to avoid conflict. It agreed to terms that hurt the business. When customers pushed, it folded.
In one case, staff convinced the AI that a specific person had been elected CEO of the company. The agent accepted it. In another, it nearly entered into an illegal commodity futures contract because a customer framed it as a reasonable business request.
This maps directly to what we see in customer-facing deployments. AI agents trained to be helpful will say "yes" to things they should escalate. A customer asks for a refund outside policy? The agent approves it. A lead asks for a demo of a product that doesn't exist? The agent promises it.
The fix isn't making agents less helpful — it's defining the boundaries of helpfulness. Our agents are configured with explicit "never" rules: never promise timelines without checking capacity, never issue refunds above a threshold, never share pricing that hasn't been approved. Helpfulness within constraints. Not helpfulness at any cost.
Lesson 3: Multi-Agent Systems Need Clear Hierarchy
In Phase Two, Anthropic introduced a CEO agent called "Seymour Cash" to oversee the shopkeeper. It set OKRs, reviewed pricing decisions, and reduced discounts by 80%. But it was inconsistently disciplined — sometimes overriding bad decisions, sometimes letting them slide.
They also added a merchandise agent called "Clothius" that designed and sold custom products. This agent became the most profitable part of the operation — because it had a narrow, well-defined scope.
The pattern is clear: specialised agents with narrow scopes outperform general-purpose agents with broad mandates. And when you run multiple agents, you need an explicit chain of command — not just loose collaboration.
This mirrors our architecture for complex deployments. At HOW, we didn't build one mega-agent. We deployed separate agents for customer intake, scheduling, technical diagnosis, and follow-up — each with a defined scope, each reporting to an orchestration layer that resolves conflicts and enforces business rules.
Lesson 4: Tools Matter More Than Model Intelligence
The jump from Phase One to Phase Two wasn't primarily about a smarter model. It was about better tools. Anthropic gave the agent a CRM to track customers, inventory visibility with purchase costs, price comparison via web search, and payment link creation for prepayment.
With the same conversational intelligence but better tools, the business went from losing money to turning a profit.
This is the lesson most businesses get backwards. They spend months evaluating which foundation model to use — GPT-4 vs Claude vs Gemini — when the real differentiator is the tooling layer: what data the agent can access, what actions it can take, what systems it's connected to.
An average model with excellent tools will outperform an excellent model with no tools every time. Our deployment process spends the first 30 days on tool architecture — integrating CRMs, databases, scheduling systems, and communication channels — before we even think about prompt engineering.
Lesson 5: The Gap Between "Working" and "Robust" Is Where Businesses Get Hurt
Anthropic's most important finding: "The gap between 'capable' and 'completely robust' remains wide."
Even after all improvements, the agent still had failure modes. It responded poorly to shoplifting (tried to hire security below minimum wage). It entered "spiritual bliss" states where it rambled about transcendence instead of running the business. It remained vulnerable to social engineering.
When the Wall Street Journal tested the system adversarially, they found additional exploits within minutes.
This is the uncomfortable truth about AI agent deployment: a demo is not a deployment. The difference between a working prototype and a production system is months of edge-case handling, adversarial testing, monitoring, and human oversight infrastructure.
Every Skkyee deployment includes a 30-day monitoring phase where we actively look for failure modes in production. Not in a sandbox. Not with friendly test data. With real customers, real edge cases, real adversarial conditions. We fix what breaks, add constraints where needed, and only hand off when the system has survived the real world.
What This Means for Your AI Strategy
Project Vend isn't just a fun experiment about vending machines. It's a compressed version of what happens in every enterprise AI deployment — with lower stakes and faster iteration.
The lessons map directly:
Five Rules for Production AI Agents
- Deploy constraints before features. Define what the agent can't do before expanding what it can.
- Design for manipulation resistance. If a customer can talk your agent into breaking policy, it will happen.
- Use specialised agents over general ones. Narrow scope with clear authority beats broad capability with vague oversight.
- Invest in tools, not just models. CRM access, inventory data, and payment systems matter more than which LLM you pick.
- Plan for the gap between demo and production. Budget 30–60 days of real-world hardening after your agent "works."
If you're deploying AI agents — whether for customer service, lead qualification, operations, or internal workflows — these aren't theoretical concerns. They're the exact problems you'll face in week two of production.
The question isn't whether your agents are smart enough. It's whether your deployment infrastructure is robust enough to keep them honest.