Serving LLMs in Secure & Air Gapped Environment
There's a version of this job that's clean and simple: you pick a model, call an API, and ship. OpenAI, Anthropic, Google. They handle the infrastructure, you handle the application. Done.
Then there's the other version. The one where the client is a defense contractor, a hospital, a government agency, or any institution that operates under strict data sovereignty requirements. The one where the hardware lives in a room you need a badge to enter, and that hardware has never seen the internet. And never will.
That's what this post is about.
What is an Air Gapped Environment?
The term sounds more dramatic than it is. An air gapped machine is just one that physically can't talk to the internet, not firewall blocked, not rate limited, physically disconnected. No network cable, no Wi-Fi card active, sometimes not even a USB port that isn't explicitly approved. The name comes from the literal gap of air between the machine and any external network.
To move data in or out you need physical media: a USB drive, a hard disk, sometimes a disc. Someone physically walks it through a checkpoint. It sounds theatrical until you realize this is standard operating procedure in defense, healthcare, and government, where the cost of a breach isn't a bad news cycle, it's something much worse.
The security argument is simple: you can't remotely compromise what you can't reach. The operational cost is equally simple, and much more annoying: you have to do everything yourself. Updates, packages, model weights, Docker images. All of it, manually, every time.
Prerequisites
This guide assumes you're already comfortable with Linux (you'll live in the terminal), Docker (Dockerfile + Compose, since the whole deployment is containerized), and LLM fundamentals: quantization, VRAM budgeting, tensor parallelism.
If any of those feel shaky, fix that first. Air gapped deployments don't simplify anything, they just stack on top.
The LLM Serving Framework Landscape
I've touched most of these. Here's where they actually land:
| Framework | Best For | Key Trait |
|---|---|---|
| llama.cpp | CPU / edge / no GPU | Extreme portability, C++ core, minimal dependencies |
| Ollama | Local dev & prototyping | Single command simplicity, wraps llama.cpp |
| vLLM | Multi-user production (no K8s) | High throughput via PagedAttention, OpenAI compatible |
| NVIDIA Triton | NVIDIA native production | Deep NVIDIA ecosystem integration |
| TensorRT LLM | Max performance on NVIDIA | Requires model compilation, highest throughput ceiling |
| Ray Serve | Distributed / cluster based | Native Kubernetes + distributed compute |
| AIBrix | Serverless LLM on K8s | Kubernetes native, autoscaling |
For bare metal GPU clusters without Kubernetes, vLLM is the one I'd pick. It's not perfect, but the tradeoffs land in the right places: you get real throughput, multi GPU support that actually works, and an OpenAI compatible API with zero extra configuration.
Ollama works great on a laptop. Put it in front of any meaningful concurrent load and you'll watch response times crater. Its memory model preallocates statically and just wasn't built for that use case. llama.cpp is the right answer when you're on CPU only hardware or constrained edge devices. TensorRT LLM can eke out better raw numbers, but you're compiling the model specifically for your hardware and locking yourself into NVIDIA's toolchain in a way that tends to become someone's full time job. Most teams don't have appetite for that.
Why vLLM?
The core reason vLLM wins in this setup is PagedAttention. The UC Berkeley team published the paper in 2023 and the insight is genuinely elegant once you see it.
Every LLM request needs a KV cache, a memory buffer that holds the attention state as tokens are generated. The problem with older systems is they allocate this buffer upfront, at the maximum possible size. Since you can't know in advance how long the output will be, you're often reserving two or three times the memory you'll actually use. PagedAttention borrows the OS trick of paging: allocate in small fixed size blocks, only as needed, and release them when you're done.
The practical outcome is you fit a lot more concurrent requests into the same GPU. Batch sizes go up. Throughput at a given latency target jumps somewhere in the 2–4× range compared to a naive implementation. When you have eight users hitting the endpoint at once. It happens. The difference is very real.
The Full Deployment Workflow
Step 1: Choose your model
First, verify your model is on the vLLM supported list. Not everything is. Some newer architectures lag behind by a few releases. Once that's confirmed, pull up the model card on Hugging Face and write down:
- Parameter count → determines base VRAM requirements
- Number of attention heads / shards → determines valid
tensor-parallel-sizevalues - Quantization options →
bitsandbytes,AWQ,GPTQ, etc. - Architecture → some newer architectures require the latest vLLM image
Step 2: Pull the vLLM Docker image
vLLM ships official images via Docker Hub. Run
nvidia-smi before pulling. The image version has to align with your CUDA version, and getting
this wrong is annoying to debug after the fact.
docker pull vllm/vllm-openai:latest
Note: When a new model architecture drops, you sometimes need the latest vLLM image to load it. If a model fails with a cryptic error on startup, pulling a newer image is the first thing to try before going any deeper.
Step 3: Download the model
Do this on your staging machine while it still has internet. vLLM pulls the weights from Hugging Face into the mounted volume on first boot.
# docker-compose.staging.yml
services:
llm-server:
container_name: llm-server
image: vllm/vllm-openai:latest
network_mode: host
ipc: host # full shared memory access for multi-GPU comms
restart: always
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2 # adjust to your GPU count
capabilities: [gpu]
environment:
- CUDA_VISIBLE_DEVICES=0,1 # GPU IDs to use
volumes:
- model-cache:/root/.cache/huggingface
entrypoint: python3
command: >
-m vllm.entrypoints.openai.api_server
--model <model_id>
--host 0.0.0.0
--port 8000
--tensor-parallel-size 2
--quantization bitsandbytes
--gpu-memory-utilization 0.95
volumes:
model-cache:
Understanding --tensor-parallel-size
This tells vLLM how many GPUs to split the model across. The constraint is absolute: the model's attention head count must divide evenly by whatever number you pick.
Check the model card first. If you get it wrong, vLLM will catch it on startup and tell you why.
Step 4: Move to air gapped infrastructure
This is where most people lose time. You have two environments with no path between them, and you need to get both a large Docker image and several gigabytes of model weights from one side to the other.
Option A: Internal Docker Registry:
# Tag for your internal registry
docker tag vllm/vllm-openai:latest <REGISTRY_IP>:<REGISTRY_PORT>/vllm-openai:latest
# Push to the internal registry (accessible within the air-gapped network)
docker push <REGISTRY_IP>:<REGISTRY_PORT>/vllm-openai:latest
Option B: Tarball transfer (truly offline):
# Export the image on staging
docker save vllm/vllm-openai:latest | gzip > vllm-image.tar.gz
# Load it on the air-gapped machine after physical transfer
docker load < vllm-image.tar.gz
Move the model files:
# If the two environments share a network segment (staging to production)
rsync -avz /path/to/model-cache/ user@production-host:/path/to/model-cache/
# If truly offline: copy to external drive, physically walk it over, copy off
Step 5: Serve on air gapped infrastructure
The production Compose file is almost identical to staging. Two things change: the image source points at your internal registry, and the volume mount switches to the local path where you dropped the model weights.
# docker-compose.production.yml
services:
llm-server:
container_name: llm-server
image: <REGISTRY_IP>:<REGISTRY_PORT>/vllm-openai:latest # from internal registry
network_mode: host
ipc: host
restart: always
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
environment:
- CUDA_VISIBLE_DEVICES=0,1
volumes:
- /data/models:/root/.cache/huggingface # local path to copied model files
entrypoint: python3
command: >
-m vllm.entrypoints.openai.api_server
--model <model_id>
--host 0.0.0.0
--port 8000
--tensor-parallel-size 2
--quantization bitsandbytes
--gpu-memory-utilization 0.95
Start it detached:
docker compose -f docker-compose.production.yml up -d
Watch startup logs:
docker compose logs -f llm-server
You'll watch it load each shard, log how much memory it's claiming per GPU, and eventually print the FastAPI startup line with the port it's bound to. That line is the one you're waiting for.
Step 6: Use the API
vLLM's API is fully OpenAI compatible, same endpoints, same request shape. Point your client at the local host and port, and existing code doesn't need to change.
from openai import OpenAI
client = OpenAI(
base_url="http://<HOST>:<PORT>/v1",
api_key="not-needed" # vLLM doesn't require auth by default
)
response = client.chat.completions.create(
model="<model_id>",
messages=[
{"role": "user", "content": "Explain transformer attention in one paragraph."}
],
stream=True
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="", flush=True)
Everything you'd expect is there:
| Endpoint | Description |
|---|---|
GET /v1/models | List loaded models |
POST /v1/chat/completions | Chat (OpenAI compatible) |
POST /v1/completions | Text completion |
GET /metrics | Prometheus compatible metrics |
GET /health | Health check |
Wrapping Up
This isn't the glamorous part of AI engineering. Nobody's writing blog posts about their internal Docker registry or how they spent an afternoon syncing 70GB of model weights onto a hard drive to walk it through a security checkpoint.
But some of the most important AI deployments are going to be exactly this: unglamorous, slow to set up, running on hardware that's never touched the internet and never will. And when you're in that situation, you need a stack that actually works under those constraints. vLLM does. You get real concurrent throughput, multi GPU tensor parallelism, and an API your application already speaks, all running entirely on premises.
You own the maintenance burden. There's no getting around that. But in environments where data sovereignty isn't a preference, it's a legal or contractual requirement, that burden is already yours regardless of what you deploy.
If your setup has Kubernetes, the conversation shifts to Ray Serve or AIBrix. That's a different post.
Have questions or ran into something I didn't cover? Feel free to reach out.