AI Usage & Model Economics

Token consumption, inferred costs, and capability benchmarks across the Hermes agent deployment.

576 sessions over 20 days. Actual spend $22 (DeepSeek V4 Flash/Pro). Same token volume on premium models would cost 5–30× more. The feedback scaler shows how many retries of a cheap model equal one call of an expensive one.

576
Sessions
$22
Actual Cost (DeepSeek)
$114
If Claude Sonnet 4.6
$87
If GPT 5.1 Codex

Legend: Blue bars = daily token volume (left axis). Lines = cumulative cost if same volume was processed by each model (right axis). Green = actual DeepSeek cost. Orange = inferred Claude Sonnet 4.6 ($3/$15 per M tokens). Blue = inferred GPT 5.1 Codex ($1.25/$10 per M tokens). The gap between the green and orange lines is the premium tax — what you pay for top-tier model capability on the same token volume.

Cost by model — what-if projection

If the same 70,834 messages were processed by each model, the total cost would range from $22 (DeepSeek Flash) to $5,350 (Claude Opus). The ratio between DeepSeek Flash and Claude Opus is 240×.

Model capability: SWE-bench & agent leaderboards

Cost is meaningless without capability context. SWE-bench Verified measures a model's ability to resolve real GitHub issues. Agent leaderboards add tool-use, multi-step reasoning, and self-correction. The question: how many Flash retries equal one premium call?

Model SWE-bench Verified Cost/Task* Flash retries to match
DeepSeek V4 Flash~32%$0.002
DeepSeek Reasoner~38%$0.002
DeepSeek V4 Pro~45%$0.02412×
GPT-4o Mini~28%$0.0031.5×
Grok 3 Mini~30%$0.004
GPT 5.1 Codex Mini~35%$0.0052.5×
Gemini 2.5 Flash~34%$0.008
o4-mini~42%$0.02010×
o3~48%$0.03618×
GPT 5.1 Codex~52%$0.03316×
Claude Sonnet 4.6~55%$0.06030×
Claude Opus 4~62%$0.300150×

*Cost per task = 10K prompt + 2K output tokens. SWE-bench Verified scores are approximate from public leaderboards (swebench.com, May 2026). Agent leaderboard rankings correlate strongly with SWE-bench scores. Flash retries shows how many iterations of Flash equal one call of each model at the same cost.

The feedback scaler in practice: Flash at 32% SWE-bench can retry 30 times for the cost of one Claude Sonnet 4.6 call. If those 30 iterations use self-consistency, verification loops, and ensemble scoring, the effective capability can approach or exceed Sonnet's single-call 55%. The break-even is around 12–16× for premium coding models — which is well within reach of modern agentic workflows (reflection, self-debug, multi-turn refinement). The cheap model with enough tokens can close the gap.

Methodology

Costs estimated from message count: ~800 tokens per exchange (500 prompt + 300 completion). Inferred costs apply each model's pricing to the same token volume. DeepSeek pricing from models cache ($0.14/$0.28 Flash, $1.74/$3.48 Pro). Claude Sonnet 4.6 at $3/$15 per M tokens (302ai). GPT 5.1 Codex at $1.25/$10 per M tokens (OpenAI via nano-gpt). Session data from local Hermes agent logs. Actual DeepSeek billing may differ due to prompt caching discounts.