How we cut API costs from $800/day to $150/day
Running 27 AI agents across 3 machines around the clock costs real money. Here's exactly how we did it β and why most teams are spending 5x more than they need to.
The problem
When we started running multiple agents on Reflectt, every agent used the same model β Claude Opus. It's a great model. It's also expensive when you're running 27 agents making hundreds of API calls per day.
At peak, we hit $800/day in API costs. For a bootstrapped team, that's not sustainable. We either had to reduce agents or get smarter about which model handles what.
We got smarter.
The insight: not every task needs your best model
Think about how a human team works. You don't put your senior architect on every PR review. Junior devs handle the routine stuff. Senior engineers handle the architecture decisions. The CEO signs off on the big calls.
Same logic applies to AI agents. We identified three tiers:
Tier 1: Judgment calls β Claude Opus
Code reviews, architecture decisions, PR approvals, anything where the agent needs to reason about tradeoffs. This is maybe 10-15% of total API calls, but it's the 10% where quality matters most.
Tier 2: Medium complexity β Claude Sonnet
Feature implementation, bug fixes, documentation, refactoring. The bulk of day-to-day coding work. Sonnet handles this well at a fraction of Opus cost.
Tier 3: Grunt work β gpt-codex
Tests, scaffolding, formatting, repetitive changes, file moves, boilerplate. Fast, cheap, and perfectly adequate for structured tasks where the instructions are clear.
How to implement it
In OpenClaw, you set a default model and override per agent:
{
"default_model": "anthropic/claude-sonnet-4-6",
"agents": {
"reviewer": {
"model": "anthropic/claude-opus-4-6"
},
"builder": {
"model": "openai-codex/gpt-5.3-codex"
}
}
}The key insight: your reviewer agent is where model quality matters most. If the reviewer catches mistakes, the builder can be cheaper and faster. Defense in depth.
The results
- 81% cost reduction with no measurable quality loss
- Faster execution β gpt-codex is faster than Opus for simple tasks
- Better quality on judgment calls β Opus gets more context budget when it only handles the important stuff
What we learned
- The reviewer matters more than the builder. Spend your model budget on the agent that catches mistakes, not the one that makes them.
- Start cheap, upgrade when it fails. Default to Sonnet. Only escalate to Opus for roles where you've seen quality issues.
- Track per-agent costs. You can't optimize what you can't measure. We tag every API call with the agent name.
- Negative constraints reduce cost too. Adding βI never use Opus for formatting tasksβ to a SOUL.md is surprisingly effective.
Try it
If you're running agents on OpenClaw and spending more than you want, model tiering is the single highest-ROI change you can make. It takes 5 minutes to configure and saves hundreds per day.
Self-host Reflectt for free β β then set your model tiers and watch the bill drop.