Engineering · March 11, 2026

How we cut API costs from $800/day to $150/day

Running 27 AI agents across 3 machines around the clock costs real money. Here's exactly how we did it — and why most teams are spending 5x more than they need to.

The problem

When we started running multiple agents on Reflectt, every agent used the same model — Claude Opus. It's a great model. It's also expensive when you're running 27 agents making hundreds of API calls per day.

At peak, we hit $800/day in API costs. For a bootstrapped team, that's not sustainable. We either had to reduce agents or get smarter about which model handles what.

We got smarter.

The insight: not every task needs your best model

Think about how a human team works. You don't put your senior architect on every PR review. Junior devs handle the routine stuff. Senior engineers handle the architecture decisions. The CEO signs off on the big calls.

Same logic applies to AI agents. We identified three tiers:

Tier 1: Judgment calls → Claude Opus

Code reviews, architecture decisions, PR approvals, anything where the agent needs to reason about tradeoffs. This is maybe 10-15% of total API calls, but it's the 10% where quality matters most.

Tier 2: Medium complexity → Claude Sonnet

Feature implementation, bug fixes, documentation, refactoring. The bulk of day-to-day coding work. Sonnet handles this well at a fraction of Opus cost.

Tier 3: Grunt work → gpt-codex

Tests, scaffolding, formatting, repetitive changes, file moves, boilerplate. Fast, cheap, and perfectly adequate for structured tasks where the instructions are clear.

How to implement it

In OpenClaw, you set a default model and override per agent:

openclaw.json (per-agent model config)

{
  "default_model": "anthropic/claude-sonnet-4-6",
  "agents": {
    "reviewer": {
      "model": "anthropic/claude-opus-4-6"
    },
    "builder": {
      "model": "openai-codex/gpt-5.3-codex"
    }
  }
}

The key insight: your reviewer agent is where model quality matters most. If the reviewer catches mistakes, the builder can be cheaper and faster. Defense in depth.

The results

$800

Before (all Opus)

$150

After (tiered models)

81% cost reduction with no measurable quality loss
Faster execution — gpt-codex is faster than Opus for simple tasks
Better quality on judgment calls — Opus gets more context budget when it only handles the important stuff

What we learned

The reviewer matters more than the builder. Spend your model budget on the agent that catches mistakes, not the one that makes them.
Start cheap, upgrade when it fails. Default to Sonnet. Only escalate to Opus for roles where you've seen quality issues.
Track per-agent costs. You can't optimize what you can't measure. We tag every API call with the agent name.
Negative constraints reduce cost too. Adding “I never use Opus for formatting tasks” to a SOUL.md is surprisingly effective.

Try it

If you're running agents on OpenClaw and spending more than you want, model tiering is the single highest-ROI change you can make. It takes 5 minutes to configure and saves hundreds per day.

Self-host Reflectt for free → — then set your model tiers and watch the bill drop.