AI Glossary

What is AI inference?

AI inference is the process of running an already-trained model on new input to produce an output, such as an answer, image, or prediction. Training teaches the model once; inference is every time you use it afterward. Each inference call consumes compute, which is why AI products meter and charge for usage.

Last updated June 2, 2026

Every time you ask an AI assistant a question and it answers, you are running inference. Inference is the "use" phase of a machine-learning model: the model takes input it has never seen before, runs it through the fixed parameters it learned during training, and returns an output. No learning happens during inference. The model's weights do not change; it simply applies what it already knows to your specific prompt.

How is inference different from training?

Training and inference are two separate phases with very different costs and frequencies. Training builds the model: it processes enormous datasets, adjusts billions of internal parameters, and happens once (or occasionally, when a model is updated). Inference uses the finished model: it runs millions or billions of times, once per request, and never changes the parameters. A useful analogy is education versus work. Training is the years of study; inference is answering each question on the job afterward.

Aspect	Training	Inference
What it does	Learns parameters from data	Applies learned parameters to new input
How often it runs	Once, or rarely (model updates)	Every single request
Changes the model?	Yes, weights are updated	No, weights are fixed
Main cost driver	Massive one-time compute	Recurring per-request compute
Who pays for it	Model labs / builders	Everyone who uses the model

Training vs. inference at a glance

This split matters because the two costs land on different people. The labs that train frontier models absorb the gigantic one-time training cost. Anyone who builds a product on top of those models, including MiyoMind, pays the recurring inference cost for every message a user sends.

How does inference actually work?

For a large language model, inference happens token by token. Your prompt is converted into tokens (chunks of text), fed through the network, and the model predicts the next token, then the next, until the response is complete. Three things drive cost and speed:

Prefill, where the model reads and processes your entire input prompt at once. Longer prompts and more context cost more here.
Decode, where the model generates the output one token at a time. Longer answers cost more here, and this is what you watch when text streams in word by word.
Context window, the total amount of text the model can hold in mind for a single call. Bigger context means more compute per request.

Two practical properties fall out of this: latency (how long you wait for the answer) and cost (how much compute the call burned). Latency depends on prompt length, output length, the size of the model, and how busy the serving hardware is. Cost scales with the number of input and output tokens, which is exactly why nearly every AI API prices by tokens processed.

~90%share of a deployed AI model's lifetime compute cost attributed to inference rather than trainingSource: AWS, "Optimizing AI responsiveness," 2024

Why is inference metered and charged?

Inference is metered because it is a real, recurring expense, not a fixed one. Every answer consumes GPU or accelerator time, and that compute is paid for whether you send one message or ten thousand. A subscription alone cannot cover unbounded usage, so AI products track how much each request actually costs and charge proportionally. This is the honest model: heavy users pay for heavy use, and light users do not subsidise them.

How does MiyoMind handle inference?

MiyoMind is a personal AI assistant you talk to inside WhatsApp, Telegram, Discord, and a web dashboard at miyomind.com. Behind every reply is an inference call to a frontier model. MiyoMind runs the open-source OpenClaw agent runtime, a model router called Hermes, and its own orchestration, memory, billing, safety, and routing code, then routes each request to the right model from OpenAI, Anthropic, Google, xAI, or Alibaba. Hermes picks the model based on the task, so a quick reminder does not pay for the most expensive model and a hard reasoning task gets the capacity it needs.

Because inference has a genuine cost, MiyoMind meters it with credits. One credit is worth roughly $0.005 of value, and credits track the actual model and tool usage of each request rather than charging a flat fee per message. The plans line up with how much inference you expect to run:

Free, $0/mo, 100 credits every month, no card required, running on a shared direct-agent path with no dedicated container.
Plus, $14.99/mo, 6,000 credits/mo, with one dedicated, sandboxed container.
Pro, $39.99/mo, 18,000 credits/mo, with one dedicated, sandboxed container.
Top-up packs (600/$3, 2,000/$10, 5,000/$25, 10,000/$50) when you need more inference without changing plans.

Paid users get their own isolated Docker container with no public internet egress, a read-only root filesystem, and zero stored external API keys, while integrations and memories are encrypted at rest with AES-256-GCM. The point of metering is fairness: you see your usage, the price reflects the compute each inference actually used, and there are no hidden surprises.

Frequently asked questions

What is AI inference in simple terms?

AI inference is the act of using a trained model to get an answer. You give it input it has not seen before, it runs that input through what it already learned, and it returns an output. Nothing about the model changes during inference; it only applies existing knowledge.

Is inference the same as training?

No. Training is the one-time process of teaching a model by adjusting its internal parameters on large datasets. Inference is using the finished model to produce outputs, and it runs every time you make a request. Training updates the model; inference does not.

Why does AI inference cost money?

Every inference call runs on real GPU or accelerator hardware that consumes power and compute time. That cost recurs with every request, so AI products meter usage and charge for it rather than offering unlimited compute for a flat fee.

What affects how fast AI inference is?

Latency depends mainly on how long your input prompt is, how long the output needs to be, the size of the model, and how busy the serving hardware is. Larger models and longer responses take more time, which is why answers often stream in token by token.

How does MiyoMind charge for inference?

MiyoMind uses credits that meter actual model and tool usage per request, where one credit is worth roughly $0.005 of value. Plans range from a free tier with 100 monthly credits to Pro at $39.99/mo with 18,000 credits, plus top-up packs for extra usage.

Does inference change or improve the AI model?

No. During inference the model's parameters stay fixed, so a model does not learn from your individual prompts at run time. Improvements come from a separate training or fine-tuning process that the model's builders run, not from everyday use.

Foundation model AI assistant LLM AI Glossary What is an AI agent?

Meet your new assistant

Already in WhatsApp, Telegram, Discord, and the web. 100 free credits every month — no card required.

Get started free How it works