AI Usage & Model Economics
576 sessions over 20 days. Actual spend $22 (DeepSeek V4 Flash/Pro). Same token volume on premium models would cost 5–30× more. The feedback scaler shows how many retries of a cheap model equal one call of an expensive one.
Legend: Blue bars = daily token volume (left axis). Lines = cumulative cost if same volume was processed by each model (right axis). Green = actual DeepSeek cost. Orange = inferred Claude Sonnet 4.6 ($3/$15 per M tokens). Blue = inferred GPT 5.1 Codex ($1.25/$10 per M tokens). The gap between the green and orange lines is the premium tax — what you pay for top-tier model capability on the same token volume.
Cost by model — what-if projection
If the same 70,834 messages were processed by each model, the total cost would range from $22 (DeepSeek Flash) to $5,350 (Claude Opus). The ratio between DeepSeek Flash and Claude Opus is 240×.
Model capability: SWE-bench & agent leaderboards
Cost is meaningless without capability context. SWE-bench Verified measures a model's ability to resolve real GitHub issues. Agent leaderboards add tool-use, multi-step reasoning, and self-correction. The question: how many Flash retries equal one premium call?
| Model | SWE-bench Verified | Cost/Task* | Flash retries to match |
|---|---|---|---|
| DeepSeek V4 Flash | ~32% | $0.002 | — |
| DeepSeek Reasoner | ~38% | $0.002 | 1× |
| DeepSeek V4 Pro | ~45% | $0.024 | 12× |
| GPT-4o Mini | ~28% | $0.003 | 1.5× |
| Grok 3 Mini | ~30% | $0.004 | 2× |
| GPT 5.1 Codex Mini | ~35% | $0.005 | 2.5× |
| Gemini 2.5 Flash | ~34% | $0.008 | 4× |
| o4-mini | ~42% | $0.020 | 10× |
| o3 | ~48% | $0.036 | 18× |
| GPT 5.1 Codex | ~52% | $0.033 | 16× |
| Claude Sonnet 4.6 | ~55% | $0.060 | 30× |
| Claude Opus 4 | ~62% | $0.300 | 150× |
*Cost per task = 10K prompt + 2K output tokens. SWE-bench Verified scores are approximate from public leaderboards (swebench.com, May 2026). Agent leaderboard rankings correlate strongly with SWE-bench scores. Flash retries shows how many iterations of Flash equal one call of each model at the same cost.
The feedback scaler in practice: Flash at 32% SWE-bench can retry 30 times for the cost of one Claude Sonnet 4.6 call. If those 30 iterations use self-consistency, verification loops, and ensemble scoring, the effective capability can approach or exceed Sonnet's single-call 55%. The break-even is around 12–16× for premium coding models — which is well within reach of modern agentic workflows (reflection, self-debug, multi-turn refinement). The cheap model with enough tokens can close the gap.
Methodology
Costs estimated from message count: ~800 tokens per exchange (500 prompt + 300 completion). Inferred costs apply each model's pricing to the same token volume. DeepSeek pricing from models cache ($0.14/$0.28 Flash, $1.74/$3.48 Pro). Claude Sonnet 4.6 at $3/$15 per M tokens (302ai). GPT 5.1 Codex at $1.25/$10 per M tokens (OpenAI via nano-gpt). Session data from local Hermes agent logs. Actual DeepSeek billing may differ due to prompt caching discounts.