Ai Engineering 4 min read

How to Expose Ephemeral vLLM Endpoints on Hugging Face Jobs

Learn how to spin up temporary, OpenAI-compatible vLLM inference endpoints on Hugging Face serverless infrastructure using a single CLI command.

Hugging Face recently introduced a feature that allows developers to expose network ports directly from background compute tasks, turning them into reachable inference endpoints. This workflow, detailed in the Hugging Face Jobs announcement, lets you spin up a temporary, OpenAI-compatible vLLM server on serverless infrastructure using a single CLI command. You can provision high-throughput inference backends for tasks like data labeling, batch generation, or local agent testing without paying for permanent infrastructure.

This approach differs from standard Inference Endpoints. It focuses entirely on ephemeral workloads where a server only needs to exist for the duration of a specific job. When the job completes or times out, the billing stops automatically.

Prerequisites and Setup

To use the port exposure feature, you need the Hugging Face CLI configured on your local machine. The feature relies on the hf jobs module, which requires a specific library version.

Ensure your local environment has huggingface_hub version 1.20.0 or higher installed. You also need an active Hugging Face token with read permissions for the namespace where the job will execute.

Configure your environment by logging in through the CLI:

bash huggingface-cli login

This token handles both authenticating your job submission and verifying requests made to the exposed endpoint later.

Launching the vLLM Server

Starting the server requires passing the --expose flag to the hf jobs run command. This flag instructs the Hugging Face infrastructure to route the container’s internal port through a public proxy, generating a unique URL.

The command uses the official vllm/vllm-openai Docker image. This guarantees compatibility with any OpenAI SDK-based client, including tools like instructor, llm-cli, or various testing frameworks.

To launch a Qwen 3 model on a single NVIDIA A10G GPU, execute the following command:

bash hf jobs run —flavor a10g-large —expose 8000 vllm/vllm-openai:latest vllm serve Qwen/Qwen3-4B —host 0.0.0.0 —port 8000

Several parameters control the deployment:

  • --flavor: Defines the hardware. Options include a10g-large, l4, or h100, billed on a pay-per-minute basis.
  • --expose 8000: Tells the Hugging Face proxy to map internal port 8000 to the public internet.
  • vllm serve Qwen/Qwen3-4B: The standard vLLM entrypoint and the target model ID.
  • --host 0.0.0.0: Required to ensure the vLLM server binds to all network interfaces within the container, allowing the proxy to reach it.

If you need to load a private model or a gated model, initialize the job with the -s flag to pass your token as a secret.

bash hf jobs run -s HF_TOKEN —flavor a10g-large —expose 8000 vllm/vllm-openai:latest vllm serve LiquidAI/LFM2.5-8B-A1B —host 0.0.0.0 —port 8000

Connecting to the Endpoint

Once the job starts, the CLI outputs a unique job ID. The server becomes reachable at a specific URL formatted based on that ID and the exposed port.

The URL pattern is: https://<job_id>--<port>.hf.jobs

Accessing this endpoint is not anonymous. The proxy enforces authentication, requiring a Hugging Face token passed as a standard Bearer token in the authorization header. You must use a token that has read access to the namespace that owns the job.

Because the server uses the OpenAI-compatible vLLM image, you can connect to it using the standard openai-python client. Point the client to the generated URL and use your Hugging Face token as the API key.

python from openai import OpenAI

client = OpenAI( base_url=“https://your-job-id—8000.hf.jobs/v1”, api_key=“your_huggingface_token” )

response = client.chat.completions.create( model=“Qwen/Qwen3-4B”, messages=[{“role”: “user”, “content”: “Write a python script to parse JSON.”}] )

print(response.choices[0].message.content)

Architecture Tradeoffs and Use Cases

Hugging Face positions this workflow specifically for temporary tasks. It is not designed to replace high-availability production APIs. Understanding the architectural differences helps you choose the right deployment method for your workload.

FeatureHF Jobs (vLLM)Inference Endpoints
PersistenceEphemeral (stops when job ends/times out)Permanent (stays up until deleted)
ScalingManual/Fixed hardwareManaged Autoscaling
SetupOne command (hf jobs run)UI or API Managed
Best ForExperiments, Evals, Batch generationProduction APIs, Web Apps

The ephemeral nature makes HF Jobs ideal when you need to evaluate and test AI agents against specific model checkpoints. You can spin up a specific version of a model, run an evaluation suite like lm-evaluation-harness, and let the server terminate automatically. This pattern is particularly useful if you are trying to reduce LLM API costs in production by shifting testing workloads off permanent, expensive endpoints.

It also functions well as a temporary backend for coding agents like Claude Code or Cursor. A developer can launch an isolated instance of a smaller model to act as a local sandbox for testing tool integrations or complex reasoning loops.

Review the Hugging Face Jobs documentation to verify the exact timeout limits and available hardware flavors for your account tier before scripting large evaluation runs.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading