Running LLMs in Secure & Air-Gapped Infrastructure
There's a version of this job that's clean and simple: you pick a model, call an API, and ship. OpenAI, Anthropic, Google — they handle the infrastructure, you handle the application. Done.
Then there's the other version. The one where the client is a defense contractor, a hospital, a government agency, or any institution that operates under strict data sovereignty requirements. The one where the hardware lives in a room you need a badge to enter, and that hardware has never seen the internet — and never will.
That's what this post is about.
What is an Air-Gapped Environment?
Before anything else, let's make sure we're speaking the same language.
An air-gapped system is a network or machine that is completely physically isolated from external networks — no internet, no Wi-Fi, no Bluetooth, nothing. The term comes from the literal "air gap" between the machine and any unsecured network. To move data in or out, you need physical media: a USB drive, a hard drive, a disc. There's no remote connection to exploit.
This sounds extreme, but it's standard practice in different domains. The security benefit is obvious: you can't hack what you can't reach. The operational cost is also obvious: you have to do everything manually — including deploying AI.
Prerequisites
Before you attempt this, you should be comfortable with the following:
- Linux — you'll be living in the terminal
- Docker (Dockerfile + Compose) — the entire deployment is containerized
- LLM fundamentals — quantization, VRAM budgeting, tensor parallelism
If any of those feel shaky, shore them up first. The air-gapped deployment layer adds complexity on top of all three.
The LLM Serving Framework Landscape
There are several frameworks for running LLMs locally on your own hardware. Here's how they compare:
| Framework | Best For | Key Trait |
|---|---|---|
| llama.cpp | CPU / edge / no-GPU | Extreme portability, C++ core, minimal dependencies |
| Ollama | Local dev & prototyping | Single-command simplicity, wraps llama.cpp |
| vLLM | Multi-user production (no K8s) | High throughput via PagedAttention, OpenAI-compatible |
| NVIDIA Triton | NVIDIA-native production | Deep NVIDIA ecosystem integration |
| TensorRT-LLM | Max performance on NVIDIA | Requires model compilation, highest throughput ceiling |
| Ray Serve | Distributed / cluster-based | Native Kubernetes + distributed compute |
| AIBrix | Serverless LLM on K8s | Kubernetes-native, autoscaling |
After working through most of these, my recommendation for bare-metal GPU clusters without Kubernetes is vLLM. It hits the right balance: production-grade performance without requiring a container orchestration setup, an OpenAI-compatible API out of the box, and solid multi-GPU support.
A quick note on the others: Ollama is fantastic for personal use and prototyping, but under concurrent load its static memory allocation model creates a throughput ceiling. llama.cpp shines on CPU-only or edge deployments. TensorRT-LLM can be faster, but it requires model compilation and is deeply tied to the NVIDIA toolchain — more engineering overhead than most teams want.
Why vLLM?
vLLM was built at UC Berkeley and introduced PagedAttention — an attention mechanism that manages GPU memory the way operating systems manage RAM: using pages.
Instead of pre-allocating a contiguous memory block for each request's KV cache (which wastes 60–80% of GPU memory due to fragmentation), PagedAttention allocates memory on demand in fixed-size blocks. The result: near-zero memory waste, higher batch sizes, and 2–4× higher throughput compared to previous systems at the same latency.
In practice, this matters a lot when multiple users are hitting the same endpoint simultaneously. vLLM handles concurrent requests gracefully. Ollama doesn't.
The Full Deployment Workflow
Let's go through each step.
Step 1 — Choose Your Model
Start at the vLLM supported models list to confirm your model is supported. Then browse Hugging Face for the details you'll need:
- Parameter count → determines base VRAM requirements
- Number of attention heads / shards → determines valid
tensor-parallel-sizevalues - Quantization options →
bitsandbytes,AWQ,GPTQ, etc. - Architecture → some newer architectures require the latest vLLM image
Step 2 — Pull the vLLM Docker Image
vLLM distributes official Docker images via Docker Hub.
The image must be compatible with your hardware's CUDA version — check nvidia-smi before pulling.
docker pull vllm/vllm-openai:latest
Note: When a new model architecture is released, you sometimes need the latest vLLM image to support it. If a model fails to load with a cryptic error, pulling a newer image is often the fix before going deeper into debugging.
Step 3 — Download the Model
Run this on your staging machine (internet-connected). vLLM will download the model weights from Hugging Face into the mounted volume.
# docker-compose.staging.yml
services:
llm-server:
container_name: llm-server
image: vllm/vllm-openai:latest
network_mode: host
ipc: host # full shared memory access for multi-GPU comms
restart: always
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2 # adjust to your GPU count
capabilities: [gpu]
environment:
- CUDA_VISIBLE_DEVICES=0,1 # GPU IDs to use
volumes:
- model-cache:/root/.cache/huggingface
entrypoint: python3
command: >
-m vllm.entrypoints.openai.api_server
--model <model_id>
--host 0.0.0.0
--port 8000
--tensor-parallel-size 2
--quantization bitsandbytes
--gpu-memory-utilization 0.95
volumes:
model-cache:
Understanding --tensor-parallel-size
This flag tells vLLM how many GPUs to split the model across. The hard constraint: the model's number of attention heads must be evenly divisible by this number.
Check the model card before setting this value. If you get it wrong, vLLM will tell you on startup.
Step 4 — Move to Air-Gapped Infrastructure
This is where it gets more involved. You have two environments:
Option A — Internal Docker Registry:
# Tag for your internal registry
docker tag vllm/vllm-openai:latest <REGISTRY_IP>:<REGISTRY_PORT>/vllm-openai:latest
# Push to the internal registry (accessible within the air-gapped network)
docker push <REGISTRY_IP>:<REGISTRY_PORT>/vllm-openai:latest
Option B — Tarball transfer (truly offline):
# Export the image on staging
docker save vllm/vllm-openai:latest | gzip > vllm-image.tar.gz
# Load it on the air-gapped machine after physical transfer
docker load < vllm-image.tar.gz
Move the model files:
# If the two environments share a network segment (staging to production)
rsync -avz /path/to/model-cache/ user@production-host:/path/to/model-cache/
# If truly offline: copy to external drive, physically walk it over, copy off
Step 5 — Serve on Air-Gapped Infrastructure
The production Compose file is nearly identical to staging — two things change: the image source and the volume mount.
# docker-compose.production.yml
services:
llm-server:
container_name: llm-server
image: <REGISTRY_IP>:<REGISTRY_PORT>/vllm-openai:latest # from internal registry
network_mode: host
ipc: host
restart: always
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
environment:
- CUDA_VISIBLE_DEVICES=0,1
volumes:
- /data/models:/root/.cache/huggingface # local path to copied model files
entrypoint: python3
command: >
-m vllm.entrypoints.openai.api_server
--model <model_id>
--host 0.0.0.0
--port 8000
--tensor-parallel-size 2
--quantization bitsandbytes
--gpu-memory-utilization 0.95
Start it detached:
docker compose -f docker-compose.production.yml up -d
Watch startup logs:
docker compose logs -f llm-server
You'll see vLLM loading the model shards across GPUs, memory allocation logs, and finally a message confirming the FastAPI server is running on your configured port.
Step 6 — Use the API
vLLM exposes a fully OpenAI-compatible REST API. You can use it exactly like the OpenAI API — just point your client at the local endpoint.
from openai import OpenAI
client = OpenAI(
base_url="http://<HOST>:<PORT>/v1",
api_key="not-needed" # vLLM doesn't require auth by default
)
response = client.chat.completions.create(
model="<model_id>",
messages=[
{"role": "user", "content": "Explain transformer attention in one paragraph."}
],
stream=True
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="", flush=True)
The full list of available endpoints:
| Endpoint | Description |
|---|---|
GET /v1/models | List loaded models |
POST /v1/chat/completions | Chat (OpenAI-compatible) |
POST /v1/completions | Text completion |
GET /metrics | Prometheus-compatible metrics |
GET /health | Health check |
Wrapping Up
This workflow gives you a fully self-contained LLM serving stack that:
- Runs entirely on-premise — no outbound network calls, ever
- Scales across GPUs via tensor parallelism
- Exposes a standard API your existing application code can use without changes
- Works where cloud is a non-starter — classified environments, healthcare, finance
The operational overhead is real, you own the updates, the image management, the model versioning, and the hardware. But in contexts where data sovereignty isn't negotiable, this is the tradeoff you're making, and vLLM makes it about as smooth as it can be without Kubernetes in the picture.
If your environment does have Kubernetes, the conversation shifts to Ray Serve or AIBrix. That's a different story.
Have questions or ran into something I didn't cover? Feel free to reach out.