Two ways to receive a model's response — each with distinct trade-offs. Tally reads your task shape and tells you which to use before you make the call.
When you make a non-streaming API call, your code blocks and waits. The model generates the entire response token-by-token, then hands it to you all at once. If the response is 2,000 tokens, you wait for all 2,000 before seeing any of it.
When you enable streaming, the API uses Server-Sent Events (SSE) — an HTTP connection that stays open and delivers tokens (or small groups of tokens) as they are generated. Your code receives a stream of chunks you can process and display incrementally.
The underlying computation is the same — the model generates tokens sequentially either way. The only difference is when those tokens are handed to you. Streaming reduces time to first token (TTFT) to near-zero. Non-streaming reduces TTFT to the full generation time.
The two modes are not interchangeable for every use case. Streaming forces you into a pipeline architecture — your code must handle partial state. Non-streaming gives you a complete, validated response you can act on atomically. The wrong choice creates either a poor user experience or fragile engineering.
Streaming is the right choice for user-facing interfaces, but it comes with genuine engineering complexity that's easy to underestimate:
Token counting. With non-streaming, the response includes a usage object with exact input and output token counts. With streaming, you typically don't get this until the final chunk, and some providers deliver it in a separate terminal event. If you're doing cost tracking mid-stream, you need to count chunks yourself — and chunk boundaries don't align with tokens precisely.
Structured output parsing. If you've asked the model to respond in JSON, a streaming response gives you fragments of JSON that aren't parseable until the stream closes. Attempting to parse partial JSON mid-stream requires special tooling and is a common source of bugs. Non-streaming is strongly recommended for any structured output.
Error handling and retries. If a stream fails midway, you have a partial response — which may be worse than no response. Retry logic for streams is substantially more complex than for atomic responses. You need to decide: discard and retry from the start, or attempt to resume? Most implementations just discard and retry, which wastes work.
Tool / MCP calls always use non-streaming internally. Even if you request a streaming response, the model must complete its tool-use reasoning before the tool is invoked, and the tool result must be fully received before the model can synthesise a final response. Piping this through a stream adds complexity with no benefit.
When you call route(), Tally returns a routing recommendation that includes
not just which model to use, but whether to use streaming. This recommendation is derived
entirely from the semantic envelope — the task shape — not from the content
of the prompt.
The signals Tally reads from the envelope:
structure_type: "json" → non-streaming. "prose" or "code" → streaming preferred for long outputs.
consumer: "human" → streaming. "machine" or "pipeline" → non-streaming.
Short completions (under ~200 tokens) gain little from streaming. Long outputs gain substantially from progressive rendering.
tools_used > 0 → non-streaming recommendation for the tool-call phases. The final synthesis turn may still stream.
The response from route() includes a stream_recommended boolean.
You don't have to think about it — just use what Tally tells you.
| Task | Output type | Consumer | Recommendation |
|---|---|---|---|
| Conversational chat UI | Prose | Human | Stream |
| Generate a long article | Prose | Human | Stream |
| Generate a large code file | Code | Human (editor) | Stream |
| Extract structured data to JSON | JSON | Machine | Don't Stream |
| Function / tool call | Tool args | Machine | Don't Stream |
| Classify or label input | Short text | Machine | Don't Stream |
| Summarise an email (short) | Short prose | Either | Either |
| MCP tool invocation + synthesis | Mixed | Human | Don't Stream tool phase, then Stream synthesis |
Tally doesn't just pick the cheapest model. It returns a complete routing decision: which model to use, whether to stream the response, and the confidence behind both. You implement the recommendation; Tally does the thinking.
Route smarter — model, streaming mode, and confidence. All from one API.
Next up
Shapes →