Ai Agents 9 min read

How to Build Stateful AI Agents with OpenAI's Responses API Containers, Skills, and Shell

Learn how to use OpenAI's Responses API with hosted containers, shell, skills, and compaction to build long-running AI agents.

OpenAI’s new Responses API computer environment lets you build stateful agents that can run shell commands, persist files in hosted containers, load reusable skills, and survive long sessions with server-side compaction. The March 11 engineering write-up explains how OpenAI wired these pieces together, and the official announcement plus the API docs for containers, skills, and response compaction cover the full surface. This walkthrough shows how to create a container-backed agent, add files and network policy, package skills, and manage long-running sessions in code.

What this runtime adds to the Responses API

The practical shift is from a stateless prompt-response loop to an agent runtime with execution state.

OpenAI’s hosted environment adds four capabilities that matter in production:

CapabilityWhat it doesWhy you use it
Shell toolExecutes terminal commands in a hosted environmentRun scripts, CLIs, servers, data transforms
ContainersPersist files and runtime state between stepsKeep working directories, outputs, and intermediate artifacts
SkillsLoad reusable, versioned bundles into the environmentStandardize workflows and tool instructions
CompactionCompress long histories into token-efficient stateKeep multi-step agents running without blowing context

Two constraints matter up front.

First, shell execution requires a model trained for it. OpenAI states that GPT-5.2 and later are trained to propose shell commands. Second, the container examples in the docs show a 4g memory limit and an expiry policy based on last_active_at, with an example inactivity window of 20 minutes.

If you need a refresher on why state matters for agents, see What Are AI Agents and How Do They Work? and AI Agents vs Chatbots: What’s the Difference?.

Installation and setup

You can use the OpenAI SDK for JavaScript or call the REST API directly. The examples below use Node.js.

npm install openai
export OPENAI_API_KEY=your_api_key_here

Create a client:

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

For shell-enabled workflows, use a GPT-5.2+ model and be explicit in the system prompt that the model can use shell when needed. OpenAI’s runtime only executes tool calls proposed by the model through the orchestrator, so your prompt should describe the job and the available environment clearly. That pattern overlaps with good context design more broadly, which is covered in Context Engineering: The Most Important AI Skill in 2026.

Create a hosted container

A container is the working environment for your agent. It holds files, can be configured with network controls, and expires after inactivity.

This example creates a container with a domain allowlist:

const container = await client.containers.create({
  name: "research-agent-workspace",
  network_policy: {
    type: "allowlist",
    allow: [
      { domain: "api.github.com" },
      { domain: "raw.githubusercontent.com" }
    ]
  }
});

console.log(container.id);

OpenAI’s container creation docs show a status like running, a memory limit example of 4g, and an expiry policy tied to last_active_at with a 20-minute sample window. Treat that as the baseline shape of the hosted environment when designing tasks that may pause between steps.

You can inspect the container later:

const current = await client.containers.retrieve(container.id);
console.log({
  id: current.id,
  status: current.status,
  lastActiveAt: current.last_active_at,
  expiresAfter: current.expires_after
});

Add files instead of stuffing data into prompts

One of OpenAI’s main recommendations is to stop copying large files into prompt context. Put them in the container filesystem and let the agent read what it needs.

If you already uploaded a file through OpenAI’s file APIs, you can attach it to the container by ID. You can also send multipart content directly, depending on your workflow. This example uses an existing file ID:

await client.containers.files.create(container.id, {
  file_id: "file_abc123"
});

List files in the workspace:

const files = await client.containers.files.list(container.id);
for (const f of files.data) {
  console.log(f.id, f.path, f.size_bytes);
}

For tabular data, OpenAI recommends keeping structured data in something like SQLite instead of pasting CSV or spreadsheet content into the model context. That is a good default for agents that need repeatable queries, joins, or filtered retrieval. If your application needs retrieval beyond the container filesystem, compare that with How to Build a RAG Application (Step by Step) and Fine-Tuning vs RAG: When to Use Each Approach.

Run a basic stateful agent with shell access

The key pattern is simple. Create a container, ask the model to work inside it, and let the Responses API orchestrate shell execution and tool output over multiple turns.

const response = await client.responses.create({
  model: "gpt-5.2",
  input: [
    {
      role: "system",
      content: [
        {
          type: "input_text",
          text: "You are a coding agent. Use shell when useful. Store outputs in the working directory and explain final results briefly."
        }
      ]
    },
    {
      role: "user",
      content: [
        {
          type: "input_text",
          text: "Create a small Node.js script that fetches the latest OpenAI blog RSS feed and writes the titles to titles.txt."
        }
      ]
    }
  ],
  container: {
    id: container.id
  }
});

console.log(response.output_text);

The exact tool call items in the response may vary by SDK version, but the architecture is the same. The model proposes shell commands, the Responses API runs them in the hosted container, then the model continues with the results.

OpenAI’s March 11 write-up also notes that models can propose multiple shell commands in one step, and the platform can run them concurrently in separate container sessions. That is useful for parallel downloads, test shards, or independent data transforms. It also means you should cap or constrain output where possible so logs do not consume context unnecessarily.

Add a skill to standardize agent behavior

A skill is a reusable bundle, typically a folder with a SKILL.md plus supporting assets. Skills are versioned and immutable by version, which makes them useful for repeatable agent workflows.

You can provision a container with one or more existing skills:

const skilledContainer = await client.containers.create({
  name: "agent-with-skills",
  skills: [
    {
      id: "skill_123",
      version: "ver_20260311"
    }
  ]
});

The Skills API exposes metadata such as default_version and latest_version, and new versions are created separately. That versioning model is important in production because agent behavior often changes more from instructions and tool packaging than from model changes.

Here is a typical decision table:

OptionBest forTradeoff
No skillOne-off tasks, prototypesBehavior drifts across prompts
Shared default skillTeam-wide baseline workflowChanges affect all consumers unless pinned
Version-pinned skillProduction agents, audits, repeatabilityMore version management overhead
Inline skill bundleDynamic or generated skills per taskHarder to reuse and inspect

For a deeper look at packaging, see What Are Agent Skills and Why They Matter and How to Create Your First Agent Skill. If your team also uses editor-native rules, Agent Skills vs Cursor Rules: When to Use Each is the useful comparison.

Configure network access and secrets carefully

The security model matters because a shell-capable agent can make outbound requests. OpenAI’s hosted containers route outbound traffic through an egress proxy with centralized policy, using allowlists and domain-scoped secret injection.

At the API level, the part you control directly is the network policy. Keep it narrow:

const lockedDownContainer = await client.containers.create({
  name: "finance-agent",
  network_policy: {
    type: "allowlist",
    allow: [
      { domain: "api.stripe.com" },
      { domain: "storage.googleapis.com" }
    ]
  }
});

That design has three practical implications:

PracticeWhy it matters
Allowlist only required domainsLimits blast radius for tool misuse
Keep secrets scoped to specific destinationsReduces accidental leakage paths
Prefer APIs over open web browsingMakes outputs more deterministic and auditable

If your agent handles sensitive data, pair this with your normal application controls for audit logs, request validation, and secret rotation. The shell runtime gives the model more power, so permission design becomes part of prompt design.

Manage long sessions with compaction

Long-running agents accumulate history quickly, especially when shell output and file operations are part of the loop. OpenAI added server-side compaction so the Responses API can preserve important state in a compressed representation.

You can trigger compaction directly with the POST /responses/compact endpoint when you want explicit control:

const compacted = await client.responses.compact({
  response_id: response.id
});

console.log(compacted.id);

The compaction object includes a typed compaction item with encrypted_content. OpenAI also states that compaction can happen automatically once a threshold is crossed, depending on the workflow.

Use compaction when:

  • the agent runs across many tool calls,
  • shell output is large,
  • intermediate reasoning no longer needs to stay verbatim in context,
  • or you want to keep a session alive past normal prompt window pressure.

This is closely related to context management in general. If you want the broader mental model, Context Windows Explained: Why Your AI Forgets is the right companion piece.

Tradeoffs and limitations

This runtime is powerful, but it changes your architecture.

LimitationWhat it means for your app
Container expiryIdle agents lose their workspace after the expiration window
Hosted environment limitsMemory, process behavior, and runtime packages are constrained
Shell requires supported modelsUse GPT-5.2+ for shell-trained behavior
Output can overwhelm contextYou need log caps, summaries, or compaction
Skill quality mattersPoorly packaged instructions reduce reliability
Network controls need designExternal access should be minimal and explicit

The default pattern works well for coding agents, report generation, ETL-style automation, and research workflows. It is less ideal for tasks that need long-term persistence across days unless you externalize state to databases or object storage and treat containers as short-lived workspaces.

Start with one narrowly scoped workflow, create a version-pinned skill, keep the network allowlist tight, and store task data in container files or SQLite instead of prompt text. Once that loop is stable, add compaction and move the workflow behind your application’s job queue so the agent can run reliably under production load.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading