engineeringApril 28, 2026 · 7 min read · Manoj, Founder

Hermes — how we pick the right frontier-lab model for every message

MiyoMind is not a wrapper around one model. Hermes is our routing layer — for each message it weighs complexity, latency, modality, and cost, then picks the right brain.

Most AI products pick a model and ship. They wrap GPT or Claude or Gemini, slap a UI on top, and charge a markup. The user gets one brain for every message — a "Hi, what time is it?" and a "Read this 80-page contract and find the indemnity clauses" both go to the same place.

That is wasteful at the bottom and weak at the top. A flagship model on a one-line greeting is paying to think about nothing. A small model on a hard reasoning task gives you a confident, wrong answer.

Hermes is our routing layer. For each message it considers complexity, latency budget, modality, and credit cost — then picks the right model from a stable of frontier-lab options.

The tiers

We organise models into three tiers. Hermes picks one, runs the call, and finalises the credit cost based on actual usage.

Low — fast, cheap, good enough

Gemini 3 Flash — Google's low-latency workhorse.
Qwen 3.6 Pro — strong open-weights option for routine tasks.
Grok 4 Fast — fast and cheeky for quick replies.

Medium — the everyday brain

Gemini 3.1 Pro — broad, capable, well-priced.
GPT 5.4 Mini — tight reasoning at moderate cost.

High — when the task deserves it

Claude Sonnet 4.6 — our default for nuanced writing and analysis.
Claude Opus 4.6 — heavyweight reasoning, used sparingly.
GPT 5.4 — frontier-class generalist.

How the decision is made

Hermes looks at four signals before picking a model.

Complexity — message length, presence of code or math, number of follow-up turns, whether the user is asking for analysis or just a fact.
Latency budget — voice messages and live chat want sub-second responses; long-running research can take longer.
Modality — images go to a vision-capable model, voice transcripts go through Whisper first, and pure text takes the cheapest viable option.
Credit cost — the user's remaining balance, their plan tier, and the per-credit rack rate. We never silently choose a model that would empty a free user's monthly allowance on one message.

Cost discipline

A credit is one half of a cent. The API cost behind it averages around 0.15 cents — about a 3.3× gross margin. That margin pays for hosting, voice transcription, search, image generation, support, and the long road of a product company.

For each LLM call we pre-deduct ten credits before we know the answer. When the response comes back, we read the actual token usage from OpenRouter, compute the real cost, and finalise the ledger entry. Surplus refunds go back the same minute. If OpenRouter does not report usage, we refund the full pre-deduct rather than guess.

There is no one-credit floor. A tiny call that genuinely costs a fraction of a credit gets billed for that fraction. The ledger is append-only — every charge, every refund, every adjustment is a row you can see in your dashboard.

“Routing is the boring part of an AI product. It is also where most of the wins live.”

We will keep adding models as the labs ship new ones. The routing layer is designed to absorb that — new model in the catalog, new entry in the tier table, no code changes downstream.

← Newer

Inside OpenClaw — why every user gets their own sandboxed container

Older →

Credits, not tokens — predictable pricing for AI

← Back to all posts