LLM Primer — How Language Models Work in API Mode

Starting point

You are not talking to a chatbot.

When a user opens Claude.ai or ChatGPT in a browser, they see a polished interface that remembers their history, manages sessions, and presents a coherent conversation. None of that exists at the API layer.

When your code calls an LLM API, you are talking to a stateless mathematical function. You send a payload. You get a response. That's it. The intelligence is real. The memory is not. Every session management, history tracking, and continuity you experience in a chat UI is built by the product layer sitting on top of the raw API — not by the model itself.

Understanding this is the foundation for everything else. The API is simpler, more powerful, and more honest than the UI suggests.

What an API call actually looks like

You send a JSON payload with a list of messages (each with a role and content) and some parameters. The model reads the entire payload, generates a response, and returns it. Nothing is retained between calls. The model has no idea a previous call ever happened unless you include it in the payload.

The most important thing

Every call is a blank slate.

Language models are stateless. This is not a limitation — it is a design property. When you call the API, the model processes whatever is in the payload and returns a result. It has no memory of any previous call, no accumulated state, no understanding that you called it thirty seconds ago with a related question.

To simulate a conversation, you pass the entire history on every turn. Turn one: you send the system prompt and the user's first message. Turn two: you send the system prompt, the first user message, the model's first response, and the new user message. Turn three: all of the above, plus turn two. And so on.

System

You are a helpful assistant specialised in Python.

User

What's the difference between a list and a tuple?

Assistant

Lists are mutable; tuples are not. Lists use square brackets, tuples use parentheses…

User

↑ This entire conversation is re-sent on the next API call. The model sees none of this automatically.

The practical implication: as a conversation grows, so does the payload. A long multi-turn session eventually sends thousands of tokens worth of history on every single turn — just to reconstruct the context the model needs to respond sensibly.

This is what makes Tally possible. Because every API call is independent, there is no technical reason every call in a conversation must go to the same model. Turn 1 (complex question) → Sonnet. Turn 2 ("make it shorter") → Haiku. The model doesn't care. It only sees what you send it.

The unit of work

Tokens: smaller than words,
bigger than letters.

Language models don't read text the way you do. They process tokens — chunks of text that sit somewhere between individual characters and full words. Common English words are usually one token. Longer or unusual words might be two or three. Punctuation, whitespace, and code symbols each have their own tokenisation rules.

A rough heuristic: 1 token ≈ 0.75 words, or about 4 characters. A 1,000-word document is roughly 1,300 tokens. A short email is 50–200 tokens. A large code file with docstrings might be 4,000–8,000 tokens.

THE SENTENCE "Tally routes your AI calls intelligently." BROKEN INTO TOKENS:

T ally routes your AI calls intel ligently .

9 tokens. Each colour block is one token. "intelligently" splits across two.

Tokens matter because every token costs money. Providers charge separately for input tokens (what you send) and output tokens (what you get back). Output is typically more expensive — generating tokens requires sequential computation, while reading input can be parallelised.

$0.80

Claude Haiku per 1M tokens

$3.00

Claude Sonnet per 1M tokens

$15.00

Claude Opus per 1M tokens

At scale, this difference is enormous. One million calls averaging 2,000 input + 500 output tokens costs roughly $2,100 on Haiku vs $18,750 on Opus — for the same tasks, many of which Haiku handles perfectly well.

Hard limits

Context windows: the memory
you actually have.

Every model has a context window — the maximum number of tokens it can process in a single call, counting both input and output. Think of it as working memory. Whatever fits in the window is available to the model. Whatever doesn't, doesn't exist.

Modern flagship models have large context windows: Claude 3.7 Sonnet supports 200k tokens; GPT-4o supports 128k. Smaller models typically have smaller windows — though this changes rapidly as providers update their offerings.

Why context window size matters for routing

A task with a 50,000-token document to analyse can only go to a model whose context window accommodates it. Tally tracks context window sizes per model and will never recommend a model that can't fit the payload. This is a hard constraint enforced before any cost or quality optimisation runs.

Context windows also affect cost. Sending 100,000 input tokens on every turn of a long conversation adds up fast. Part of good routing is recognising when a task is long-context and ensuring it goes to a model with the right capability at an acceptable price.

Message structure

System, user, assistant:
the anatomy of a call.

Most LLM APIs organise a call as a list of messages with roles:

System — Instructions to the model. Personality, constraints, format requirements. Sent every call. Often large. Usually the same across all turns in a session.

User — What the human said (or what your app is submitting as the request).

Assistant — The model's previous response, included in the history so the model can refer back to what it said.

The system prompt alone can be hundreds or thousands of tokens. If your system prompt is 2,000 tokens and you have a 10-turn conversation with 500 tokens per turn, your 10th API call is sending roughly 12,000 tokens of context just to situate the model — before the actual question is even asked.

Tally observes the shape of calls, not the content. It sees that a call has a long context, not what's in it. It sees that a system prompt is large, not what it says. Your prompts stay in your application. Always.

The other dials

Temperature, top-p,
and structured outputs.

Beyond the message payload, API calls include parameters that control how the model generates:

Temperature — Randomness. Low (0.0–0.3) gives deterministic, predictable responses. High (0.7–1.0) gives more creative, varied responses. Code generation usually wants low temperature. Creative writing usually wants higher. Tool/MCP calls should be near zero.

Top-p (nucleus sampling) — Another randomness dial, complementary to temperature. Most applications only need to touch temperature and leave top-p at default.

Structured outputs / JSON mode — Force the model to respond in valid JSON matching a schema. Requires a model that supports it. Non-streaming is strongly preferred when using structured outputs — you need the full, parseable response before you can act on it.

Max tokens — A hard cap on output length. The model stops when it hits this. Setting it appropriately prevents runaway outputs and controls cost.

Every turn is its own API call.
So route each one independently.

Statelessness is the property that makes per-turn routing not just possible but correct. There is no model-level continuity to preserve — only the context you explicitly pass. Tally evaluates the shape of each turn and picks the right model for that specific request. The conversation stays coherent. The spend does not.

LLMs in API Mode

You are not talking to a chatbot.

What an API call actually looks like

Every call is a blank slate.

Tokens: smaller than words,
bigger than letters.

Context windows: the memory
you actually have.

Why context window size matters for routing

System, user, assistant:
the anatomy of a call.

Temperature, top-p,
and structured outputs.

Every turn is its own API call.
So route each one independently.

Ready to route smarter?

LLMs in API Mode

You are not talking to a chatbot.

What an API call actually looks like

Every call is a blank slate.

Tokens: smaller than words,bigger than letters.

Context windows: the memoryyou actually have.

Why context window size matters for routing

System, user, assistant:the anatomy of a call.

Temperature, top-p,and structured outputs.

Every turn is its own API call.So route each one independently.

Ready to route smarter?

Tokens: smaller than words,
bigger than letters.

Context windows: the memory
you actually have.

System, user, assistant:
the anatomy of a call.

Temperature, top-p,
and structured outputs.

Every turn is its own API call.
So route each one independently.