How to Expose Ephemeral vLLM Endpoints on Hugging Face Jobs
Learn how to spin up temporary, OpenAI-compatible vLLM inference endpoints on Hugging Face serverless infrastructure using a single CLI command.
Hugging Face recently introduced a feature that allows developers to expose network ports directly from background compute tasks, turning them into reachable inference endpoints. This workflow, detailed in the Hugging Face Jobs announcement, lets you spin up a temporary, OpenAI-compatible vLLM server on serverless infrastructure using a single CLI command. You can provision high-throughput inference backends for tasks like data labeling, batch generation, or local agent testing without paying for permanent infrastructure.
This approach differs from standard Inference Endpoints. It focuses entirely on ephemeral workloads where a server only needs to exist for the duration of a specific job. When the job completes or times out, the billing stops automatically.
Prerequisites and Setup
To use the port exposure feature, you need the Hugging Face CLI configured on your local machine. The feature relies on the hf jobs module, which requires a specific library version.
Ensure your local environment has huggingface_hub version 1.20.0 or higher installed. You also need an active Hugging Face token with read permissions for the namespace where the job will execute.
Configure your environment by logging in through the CLI:
bash huggingface-cli login
This token handles both authenticating your job submission and verifying requests made to the exposed endpoint later.
Launching the vLLM Server
Starting the server requires passing the --expose flag to the hf jobs run command. This flag instructs the Hugging Face infrastructure to route the container’s internal port through a public proxy, generating a unique URL.
The command uses the official vllm/vllm-openai Docker image. This guarantees compatibility with any OpenAI SDK-based client, including tools like instructor, llm-cli, or various testing frameworks.
To launch a Qwen 3 model on a single NVIDIA A10G GPU, execute the following command:
bash hf jobs run —flavor a10g-large —expose 8000 vllm/vllm-openai:latest vllm serve Qwen/Qwen3-4B —host 0.0.0.0 —port 8000
Several parameters control the deployment:
--flavor: Defines the hardware. Options includea10g-large,l4, orh100, billed on a pay-per-minute basis.--expose 8000: Tells the Hugging Face proxy to map internal port 8000 to the public internet.vllm serve Qwen/Qwen3-4B: The standard vLLM entrypoint and the target model ID.--host 0.0.0.0: Required to ensure the vLLM server binds to all network interfaces within the container, allowing the proxy to reach it.
If you need to load a private model or a gated model, initialize the job with the -s flag to pass your token as a secret.
bash hf jobs run -s HF_TOKEN —flavor a10g-large —expose 8000 vllm/vllm-openai:latest vllm serve LiquidAI/LFM2.5-8B-A1B —host 0.0.0.0 —port 8000
Connecting to the Endpoint
Once the job starts, the CLI outputs a unique job ID. The server becomes reachable at a specific URL formatted based on that ID and the exposed port.
The URL pattern is:
https://<job_id>--<port>.hf.jobs
Accessing this endpoint is not anonymous. The proxy enforces authentication, requiring a Hugging Face token passed as a standard Bearer token in the authorization header. You must use a token that has read access to the namespace that owns the job.
Because the server uses the OpenAI-compatible vLLM image, you can connect to it using the standard openai-python client. Point the client to the generated URL and use your Hugging Face token as the API key.
python from openai import OpenAI
client = OpenAI( base_url=“https://your-job-id—8000.hf.jobs/v1”, api_key=“your_huggingface_token” )
response = client.chat.completions.create( model=“Qwen/Qwen3-4B”, messages=[{“role”: “user”, “content”: “Write a python script to parse JSON.”}] )
print(response.choices[0].message.content)
Architecture Tradeoffs and Use Cases
Hugging Face positions this workflow specifically for temporary tasks. It is not designed to replace high-availability production APIs. Understanding the architectural differences helps you choose the right deployment method for your workload.
| Feature | HF Jobs (vLLM) | Inference Endpoints |
|---|---|---|
| Persistence | Ephemeral (stops when job ends/times out) | Permanent (stays up until deleted) |
| Scaling | Manual/Fixed hardware | Managed Autoscaling |
| Setup | One command (hf jobs run) | UI or API Managed |
| Best For | Experiments, Evals, Batch generation | Production APIs, Web Apps |
The ephemeral nature makes HF Jobs ideal when you need to evaluate and test AI agents against specific model checkpoints. You can spin up a specific version of a model, run an evaluation suite like lm-evaluation-harness, and let the server terminate automatically. This pattern is particularly useful if you are trying to reduce LLM API costs in production by shifting testing workloads off permanent, expensive endpoints.
It also functions well as a temporary backend for coding agents like Claude Code or Cursor. A developer can launch an isolated instance of a smaller model to act as a local sandbox for testing tool integrations or complex reasoning loops.
Review the Hugging Face Jobs documentation to verify the exact timeout limits and available hardware flavors for your account tier before scripting large evaluation runs.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Cohere Transcribe debuts as open-source ASR model
Cohere Transcribe launches as a 2B open-source speech-to-text model with 14-language support, self-hosting, and vLLM serving.
How to Run In-Loop Model Evaluations With olmo-eval
Learn how to set up olmo-eval to test large language model checkpoints during the training process using vLLM, LiteLLM, and Docker-based agent sandboxes.
8K Context Reranking Hits Hugging Face With Ettin Cross-Encoders
Hugging Face released six open-source cross-encoders under the Ettin Reranker family with an 8,192-token context window for long-form document retrieval.
Outpacing Whisper: Cohere Transcribe Hits Top ASR Speed
Experience enterprise-grade audio intelligence with Cohere Transcribe, a new open-weights model topping the ASR leaderboard with 3x faster speeds than Whisper.
Hugging Face Reports Chinese Open Models Overtook U.S. on Hub as Qwen and DeepSeek Drive Derivative Boom
Hugging Face's Spring 2026 report says Chinese open models now lead Hub adoption, with Qwen and DeepSeek powering a surge in derivatives.