AI Usage & Model Economics

Token consumption, inferred costs, and capability benchmarks across the Hermes agent deployment.

576 sessions over 20 days. Actual spend $22 (DeepSeek V4 Flash/Pro). Same token volume on premium models would cost 5–30× more. The feedback scaler shows how many retries of a cheap model equal one call of an expensive one.

576

Sessions

$22

Actual Cost (DeepSeek)

$114

If Claude Sonnet 4.6

$87

If GPT 5.1 Codex

Legend: Blue bars = daily token volume (left axis). Lines = cumulative cost if same volume was processed by each model (right axis). Green = actual DeepSeek cost. Orange = inferred Claude Sonnet 4.6 ($3/$15 per M tokens). Blue = inferred GPT 5.1 Codex ($1.25/$10 per M tokens). The gap between the green and orange lines is the premium tax — what you pay for top-tier model capability on the same token volume.

Cost by model — what-if projection

If the same 70,834 messages were processed by each model, the total cost would range from $22 (DeepSeek Flash) to $5,350 (Claude Opus). The ratio between DeepSeek Flash and Claude Opus is 240×.

Model capability: SWE-bench & agent leaderboards

Cost is meaningless without capability context. SWE-bench Verified measures a model's ability to resolve real GitHub issues. Agent leaderboards add tool-use, multi-step reasoning, and self-correction. The question: how many Flash retries equal one premium call?

Model	SWE-bench Verified	Cost/Task*	Flash retries to match
DeepSeek V4 Flash	~32%	$0.002	—
DeepSeek Reasoner	~38%	$0.002	1×
DeepSeek V4 Pro	~45%	$0.024	12×
GPT-4o Mini	~28%	$0.003	1.5×
Grok 3 Mini	~30%	$0.004	2×
GPT 5.1 Codex Mini	~35%	$0.005	2.5×
Gemini 2.5 Flash	~34%	$0.008	4×
o4-mini	~42%	$0.020	10×
o3	~48%	$0.036	18×
GPT 5.1 Codex	~52%	$0.033	16×
Claude Sonnet 4.6	~55%	$0.060	30×
Claude Opus 4	~62%	$0.300	150×

*Cost per task = 10K prompt + 2K output tokens. SWE-bench Verified scores are approximate from public leaderboards (swebench.com, May 2026). Agent leaderboard rankings correlate strongly with SWE-bench scores. Flash retries shows how many iterations of Flash equal one call of each model at the same cost.

The feedback scaler in practice: Flash at 32% SWE-bench can retry 30 times for the cost of one Claude Sonnet 4.6 call. If those 30 iterations use self-consistency, verification loops, and ensemble scoring, the effective capability can approach or exceed Sonnet's single-call 55%. The break-even is around 12–16× for premium coding models — which is well within reach of modern agentic workflows (reflection, self-debug, multi-turn refinement). The cheap model with enough tokens can close the gap.

Methodology

Costs estimated from message count: ~800 tokens per exchange (500 prompt + 300 completion). Inferred costs apply each model's pricing to the same token volume. DeepSeek pricing from models cache ($0.14/$0.28 Flash, $1.74/$3.48 Pro). Claude Sonnet 4.6 at $3/$15 per M tokens (302ai). GPT 5.1 Codex at $1.25/$10 per M tokens (OpenAI via nano-gpt). Session data from local Hermes agent logs. Actual DeepSeek billing may differ due to prompt caching discounts.