What your code actually talks to — and the properties of language models that make intelligent routing both necessary and possible.
When a user opens Claude.ai or ChatGPT in a browser, they see a polished interface that remembers their history, manages sessions, and presents a coherent conversation. None of that exists at the API layer.
When your code calls an LLM API, you are talking to a stateless mathematical function. You send a payload. You get a response. That's it. The intelligence is real. The memory is not. Every session management, history tracking, and continuity you experience in a chat UI is built by the product layer sitting on top of the raw API — not by the model itself.
Understanding this is the foundation for everything else. The API is simpler, more powerful, and more honest than the UI suggests.
You send a JSON payload with a list of messages (each with a role and content) and some parameters. The model reads the entire payload, generates a response, and returns it. Nothing is retained between calls. The model has no idea a previous call ever happened unless you include it in the payload.
Language models are stateless. This is not a limitation — it is a design property. When you call the API, the model processes whatever is in the payload and returns a result. It has no memory of any previous call, no accumulated state, no understanding that you called it thirty seconds ago with a related question.
To simulate a conversation, you pass the entire history on every turn. Turn one: you send the system prompt and the user's first message. Turn two: you send the system prompt, the first user message, the model's first response, and the new user message. Turn three: all of the above, plus turn two. And so on.
The practical implication: as a conversation grows, so does the payload. A long multi-turn session eventually sends thousands of tokens worth of history on every single turn — just to reconstruct the context the model needs to respond sensibly.
This is what makes Tally possible. Because every API call is independent, there is no technical reason every call in a conversation must go to the same model. Turn 1 (complex question) → Sonnet. Turn 2 ("make it shorter") → Haiku. The model doesn't care. It only sees what you send it.
Language models don't read text the way you do. They process tokens — chunks of text that sit somewhere between individual characters and full words. Common English words are usually one token. Longer or unusual words might be two or three. Punctuation, whitespace, and code symbols each have their own tokenisation rules.
A rough heuristic: 1 token ≈ 0.75 words, or about 4 characters. A 1,000-word document is roughly 1,300 tokens. A short email is 50–200 tokens. A large code file with docstrings might be 4,000–8,000 tokens.
Tokens matter because every token costs money. Providers charge separately for input tokens (what you send) and output tokens (what you get back). Output is typically more expensive — generating tokens requires sequential computation, while reading input can be parallelised.
At scale, this difference is enormous. One million calls averaging 2,000 input + 500 output tokens costs roughly $2,100 on Haiku vs $18,750 on Opus — for the same tasks, many of which Haiku handles perfectly well.
Every model has a context window — the maximum number of tokens it can process in a single call, counting both input and output. Think of it as working memory. Whatever fits in the window is available to the model. Whatever doesn't, doesn't exist.
Modern flagship models have large context windows: Claude 3.7 Sonnet supports 200k tokens; GPT-4o supports 128k. Smaller models typically have smaller windows — though this changes rapidly as providers update their offerings.
A task with a 50,000-token document to analyse can only go to a model whose context window accommodates it. Tally tracks context window sizes per model and will never recommend a model that can't fit the payload. This is a hard constraint enforced before any cost or quality optimisation runs.
Context windows also affect cost. Sending 100,000 input tokens on every turn of a long conversation adds up fast. Part of good routing is recognising when a task is long-context and ensuring it goes to a model with the right capability at an acceptable price.
Most LLM APIs organise a call as a list of messages with roles:
System — Instructions to the model. Personality, constraints, format requirements. Sent every call. Often large. Usually the same across all turns in a session.
User — What the human said (or what your app is submitting as the request).
Assistant — The model's previous response, included in the history so the model can refer back to what it said.
The system prompt alone can be hundreds or thousands of tokens. If your system prompt is 2,000 tokens and you have a 10-turn conversation with 500 tokens per turn, your 10th API call is sending roughly 12,000 tokens of context just to situate the model — before the actual question is even asked.
Tally observes the shape of calls, not the content. It sees that a call has a long context, not what's in it. It sees that a system prompt is large, not what it says. Your prompts stay in your application. Always.
Beyond the message payload, API calls include parameters that control how the model generates:
Temperature — Randomness. Low (0.0–0.3) gives deterministic, predictable responses. High (0.7–1.0) gives more creative, varied responses. Code generation usually wants low temperature. Creative writing usually wants higher. Tool/MCP calls should be near zero.
Top-p (nucleus sampling) — Another randomness dial, complementary to temperature. Most applications only need to touch temperature and leave top-p at default.
Structured outputs / JSON mode — Force the model to respond in valid JSON matching a schema. Requires a model that supports it. Non-streaming is strongly preferred when using structured outputs — you need the full, parseable response before you can act on it.
Max tokens — A hard cap on output length. The model stops when it hits this. Setting it appropriately prevents runaway outputs and controls cost.
Statelessness is the property that makes per-turn routing not just possible but correct. There is no model-level continuity to preserve — only the context you explicitly pass. Tally evaluates the shape of each turn and picks the right model for that specific request. The conversation stays coherent. The spend does not.
One API key. Instant access to intelligent model routing across all major providers.
Next up
MCP Primer →