How to Stream LLM Responses in Your Application
Streaming LLM responses reduces perceived latency and improves UX. Here's how server-sent events work, how to implement streaming with OpenAI and Anthropic, and what to watch for in production.
Without streaming, an LLM request works like a traditional API call: you send a prompt, wait for the model to generate the entire response, and receive it all at once. For a 500-token response from GPT-5.4, that might mean waiting 3-8 seconds staring at a blank screen before anything appears.
Streaming changes this. The model sends tokens as they’re generated, so the first word appears in milliseconds. The user starts reading while the model is still thinking. The total generation time is the same, but the perceived latency drops dramatically.
Every production LLM application should stream by default. Here’s how it works.
Why Streaming Matters
Two metrics define the responsiveness of an LLM application:
Time to First Token (TTFT) is how long the user waits before seeing any output. Without streaming, TTFT equals the full generation time. With streaming, TTFT is typically 100-500ms regardless of response length.
Time to Last Token (TTLT) is when the complete response is available. Streaming doesn’t change this. A 500-token response takes the same total time whether you stream it or not.
The UX difference is significant. Users perceive a streaming response as faster even when the total time is identical, because they’re reading content as it arrives rather than waiting for a blank screen to fill.
Server-Sent Events
Most LLM APIs use Server-Sent Events (SSE) for streaming. SSE is a simple protocol: the server sends a stream of data: lines over a long-lived HTTP connection. Each line contains a chunk of the response.
data: {"choices":[{"delta":{"content":"Hello"}}]}
data: {"choices":[{"delta":{"content":" world"}}]}
data: [DONE]
SSE is unidirectional (server to client), which is all you need for streaming completions. The client sends one request, then reads chunks as they arrive. No WebSocket complexity required.
Streaming with OpenAI
Add stream=True to any chat completion request:
from openai import OpenAI
client = OpenAI()
stream = client.chat.completions.create(
model="gpt-5.4",
messages=[{"role": "user", "content": "Explain embeddings in two paragraphs."}],
stream=True
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
Each chunk contains a delta object with the incremental content. The delta replaces the message field you’d see in a non-streaming response. When generation is complete, the final chunk signals the end of the stream.
Streaming with Anthropic
Anthropic’s API uses a similar pattern with event types:
import anthropic
client = anthropic.Anthropic()
with client.messages.stream(
model="claude-sonnet-4-6-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain embeddings in two paragraphs."}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
Anthropic’s streaming emits typed events: message_start, content_block_start, content_block_delta, content_block_stop, and message_stop. The text_stream helper abstracts over these, but you’ll need the raw events if you’re handling tool use or multiple content blocks.
Streaming Tool Calls
Streaming gets more complex when the model uses function calling. Instead of receiving complete JSON arguments at once, you receive them incrementally:
delta: {"tool_calls":[{"function":{"arguments":"{\"ci"}}]}
delta: {"tool_calls":[{"function":{"arguments":"ty\": "}}]}
delta: {"tool_calls":[{"function":{"arguments":"\"London\"}"}}]}
You need to accumulate the argument string across chunks and parse it only when the tool call is complete. Attempting to parse partial JSON on every chunk will fail. Buffer the argument fragments, detect completion (the stream moves to the next tool call or ends), then parse and execute.
import json
tool_args = ""
for chunk in stream:
delta = chunk.choices[0].delta
if delta.tool_calls:
tool_args += delta.tool_calls[0].function.arguments or ""
args = json.loads(tool_args)
Forwarding Streams to the Browser
In a web application, you typically proxy the LLM stream through your backend to the client. The simplest approach is forwarding SSE directly:
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
app = FastAPI()
client = OpenAI()
@app.get("/chat")
async def chat(q: str):
def generate():
stream = client.chat.completions.create(
model="gpt-5.4",
messages=[{"role": "user", "content": q}],
stream=True
)
for chunk in stream:
content = chunk.choices[0].delta.content or ""
yield f"data: {content}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
On the client side, use fetch with a readable stream to consume the events and append tokens to the DOM as they arrive:
const response = await fetch('/chat?q=Hello');
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value, { stream: true });
const lines = text.split('\n').filter(line => line.startsWith('data: '));
for (const line of lines) {
const content = line.replace('data: ', '');
if (content === '[DONE]') return;
document.getElementById('output').textContent += content;
}
}
The EventSource API also works but doesn’t support POST requests or custom headers, which limits its usefulness for LLM applications where you need to send conversation history in the request body. The fetch approach is more flexible.
Streaming with the Vercel AI SDK
If you’re building with Next.js or another JavaScript framework, the Vercel AI SDK (v5) abstracts away most of the streaming plumbing. It handles SSE parsing, React state updates, and connection management through a transport-based architecture:
import { useChat } from '@ai-sdk/react';
export default function Chat() {
const { messages, input, setInput, sendMessage, status } = useChat();
return (
<div>
{messages.map(m => (
<div key={m.id}>{m.role}: {m.parts.map(p => p.text).join('')}</div>
))}
<form onSubmit={e => { e.preventDefault(); sendMessage({ text: input }); setInput(''); }}>
<input value={input} onChange={e => setInput(e.target.value)} />
</form>
</div>
);
}
The useChat hook manages the streaming connection, accumulates tokens into messages, and re-renders the component as tokens arrive. It supports OpenAI, Anthropic, and other providers through a unified interface. For many web applications, this is the fastest path to a working streaming UI.
Production Considerations
Error handling. Streams can disconnect mid-response. Your client should detect incomplete responses and either retry or show the partial result with an error indicator. Don’t silently swallow dropped connections.
Token counting. With streaming, you don’t get the usage summary until the stream ends (or in a final chunk). If you need to track token usage for cost monitoring, capture the final chunk’s usage field or count tokens client-side as they arrive.
Rate limits. A streaming request holds a connection open for the duration of generation. If your model takes 10 seconds to produce a response, that’s 10 seconds of connection time. Plan your connection pool and timeout settings accordingly.
Buffering. Some reverse proxies (Nginx, Cloudflare) buffer responses by default, which defeats streaming. Configure X-Accel-Buffering: no or equivalent to ensure chunks flow through immediately.
Cancellation. If the user navigates away mid-stream, close the connection. Most SDKs support aborting a stream, which also stops the model from generating further tokens, saving you money on output tokens you’ll never use.
When Not to Stream
Streaming adds complexity. Skip it when:
- You need the complete response before doing anything (e.g., parsing structured JSON, running validation)
- The response is short enough that TTFT doesn’t matter (under 50 tokens)
- You’re processing responses in batch, not showing them to a user
For structured output workflows where you need valid JSON before proceeding, non-streaming requests are simpler and less error-prone. For anything user-facing, stream by default.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Anthropic Makes Claude's 1M Token Context Generally Available
Anthropic made 1M-token context GA for Claude 4.6, removing long-context premiums and boosting throughput for large code and agent tasks.
System Prompts: How to Write Effective LLM Instructions
System prompts define how your LLM behaves. Here's how to structure them, what mistakes to avoid, and how provider-specific behavior affects your prompt strategy.
How to Use AI for Code Review
AI catches patterns, style issues, and common bugs fast. It misses business logic and architecture problems. Here's the practical workflow for using AI code review effectively.
What Is Vibe Coding? The Developer's Guide
Vibe coding means describing what you want in natural language and letting AI write the code. Here's what it actually looks like, where it works, where it fails, and how to do it well.
Claude Adds Inline HTML Visuals and Interactive Charts to Chat
Claude can now generate interactive HTML-based charts and diagrams inline in chat, signaling a new wave of visual reasoning tools.