Serving LLMs in Secure & Air Gapped Environment

09.04.2025·12 min read·

#GenAI#MLOps#Infrastructure

There's a version of this job that's clean and simple: you pick a model, call an API, and ship. OpenAI, Anthropic, Google. They handle the infrastructure, you handle the application. Done.

Then there's the other version. The one where the client is a defense contractor, a hospital, a government agency, or any institution that operates under strict data sovereignty requirements. The one where the hardware lives in a room you need a badge to enter, and that hardware has never seen the internet. And never will.

That's what this post is about.

What is an Air Gapped Environment?

The term sounds more dramatic than it is. An air gapped machine is just one that physically can't talk to the internet, not firewall blocked, not rate limited, physically disconnected. No network cable, no Wi-Fi card active, sometimes not even a USB port that isn't explicitly approved. The name comes from the literal gap of air between the machine and any external network.

To move data in or out you need physical media: a USB drive, a hard disk, sometimes a disc. Someone physically walks it through a checkpoint. It sounds theatrical until you realize this is standard operating procedure in defense, healthcare, and government, where the cost of a breach isn't a bad news cycle, it's something much worse.

The security argument is simple: you can't remotely compromise what you can't reach. The operational cost is equally simple, and much more annoying: you have to do everything yourself. Updates, packages, model weights, Docker images. All of it, manually, every time.

Prerequisites

This guide assumes you're already comfortable with Linux (you'll live in the terminal), Docker (Dockerfile + Compose, since the whole deployment is containerized), and LLM fundamentals: quantization, VRAM budgeting, tensor parallelism.

If any of those feel shaky, fix that first. Air gapped deployments don't simplify anything, they just stack on top.

The LLM Serving Framework Landscape

I've touched most of these. Here's where they actually land:

Framework	Best For	Key Trait
llama.cpp	CPU / edge / no GPU	Extreme portability, C++ core, minimal dependencies
Ollama	Local dev & prototyping	Single command simplicity, wraps llama.cpp
vLLM	Multi-user production (no K8s)	High throughput via PagedAttention, OpenAI compatible
NVIDIA Triton	NVIDIA native production	Deep NVIDIA ecosystem integration
TensorRT LLM	Max performance on NVIDIA	Requires model compilation, highest throughput ceiling
Ray Serve	Distributed / cluster based	Native Kubernetes + distributed compute
AIBrix	Serverless LLM on K8s	Kubernetes native, autoscaling

For bare metal GPU clusters without Kubernetes, vLLM is the one I'd pick. It's not perfect, but the tradeoffs land in the right places: you get real throughput, multi GPU support that actually works, and an OpenAI compatible API with zero extra configuration.

Ollama works great on a laptop. Put it in front of any meaningful concurrent load and you'll watch response times crater. Its memory model preallocates statically and just wasn't built for that use case. llama.cpp is the right answer when you're on CPU only hardware or constrained edge devices. TensorRT LLM can eke out better raw numbers, but you're compiling the model specifically for your hardware and locking yourself into NVIDIA's toolchain in a way that tends to become someone's full time job. Most teams don't have appetite for that.

Why vLLM?

The core reason vLLM wins in this setup is PagedAttention. The UC Berkeley team published the paper in 2023 and the insight is genuinely elegant once you see it.

Every LLM request needs a KV cache, a memory buffer that holds the attention state as tokens are generated. The problem with older systems is they allocate this buffer upfront, at the maximum possible size. Since you can't know in advance how long the output will be, you're often reserving two or three times the memory you'll actually use. PagedAttention borrows the OS trick of paging: allocate in small fixed size blocks, only as needed, and release them when you're done.

The practical outcome is you fit a lot more concurrent requests into the same GPU. Batch sizes go up. Throughput at a given latency target jumps somewhere in the 2–4× range compared to a naive implementation. When you have eight users hitting the endpoint at once. It happens. The difference is very real.

The Full Deployment Workflow

Step 1: Choose your model

First, verify your model is on the vLLM supported list. Not everything is. Some newer architectures lag behind by a few releases. Once that's confirmed, pull up the model card on Hugging Face and write down:

Parameter count → determines base VRAM requirements
Number of attention heads / shards → determines valid tensor-parallel-size values
Quantization options → bitsandbytes, AWQ, GPTQ, etc.
Architecture → some newer architectures require the latest vLLM image

Step 2: Pull the vLLM Docker image

vLLM ships official images via Docker Hub. Run nvidia-smi before pulling. The image version has to align with your CUDA version, and getting this wrong is annoying to debug after the fact.

docker pull vllm/vllm-openai:latest

Note: When a new model architecture drops, you sometimes need the latest vLLM image to load it. If a model fails with a cryptic error on startup, pulling a newer image is the first thing to try before going any deeper.

Step 3: Download the model

Do this on your staging machine while it still has internet. vLLM pulls the weights from Hugging Face into the mounted volume on first boot.

# docker-compose.staging.yml
services:
  llm-server:
    container_name: llm-server
    image: vllm/vllm-openai:latest
    network_mode: host
    ipc: host                        # full shared memory access for multi-GPU comms
    restart: always
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2               # adjust to your GPU count
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1     # GPU IDs to use
    volumes:
      - model-cache:/root/.cache/huggingface
    entrypoint: python3
    command: >
      -m vllm.entrypoints.openai.api_server
      --model <model_id>
      --host 0.0.0.0
      --port 8000
      --tensor-parallel-size 2
      --quantization bitsandbytes
      --gpu-memory-utilization 0.95

volumes:
  model-cache:

Understanding `--tensor-parallel-size`

This tells vLLM how many GPUs to split the model across. The constraint is absolute: the model's attention head count must divide evenly by whatever number you pick.

Check the model card first. If you get it wrong, vLLM will catch it on startup and tell you why.

Step 4: Move to air gapped infrastructure

This is where most people lose time. You have two environments with no path between them, and you need to get both a large Docker image and several gigabytes of model weights from one side to the other.

Option A: Internal Docker Registry:

# Tag for your internal registry
docker tag vllm/vllm-openai:latest <REGISTRY_IP>:<REGISTRY_PORT>/vllm-openai:latest

# Push to the internal registry (accessible within the air-gapped network)
docker push <REGISTRY_IP>:<REGISTRY_PORT>/vllm-openai:latest

Option B: Tarball transfer (truly offline):

# Export the image on staging
docker save vllm/vllm-openai:latest | gzip > vllm-image.tar.gz

# Load it on the air-gapped machine after physical transfer
docker load < vllm-image.tar.gz

Move the model files:

# If the two environments share a network segment (staging to production)
rsync -avz /path/to/model-cache/ user@production-host:/path/to/model-cache/

# If truly offline: copy to external drive, physically walk it over, copy off

Step 5: Serve on air gapped infrastructure

The production Compose file is almost identical to staging. Two things change: the image source points at your internal registry, and the volume mount switches to the local path where you dropped the model weights.

# docker-compose.production.yml
services:
  llm-server:
    container_name: llm-server
    image: <REGISTRY_IP>:<REGISTRY_PORT>/vllm-openai:latest  # from internal registry
    network_mode: host
    ipc: host
    restart: always
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1
    volumes:
      - /data/models:/root/.cache/huggingface  # local path to copied model files
    entrypoint: python3
    command: >
      -m vllm.entrypoints.openai.api_server
      --model <model_id>
      --host 0.0.0.0
      --port 8000
      --tensor-parallel-size 2
      --quantization bitsandbytes
      --gpu-memory-utilization 0.95

Start it detached:

docker compose -f docker-compose.production.yml up -d

Watch startup logs:

docker compose logs -f llm-server

You'll watch it load each shard, log how much memory it's claiming per GPU, and eventually print the FastAPI startup line with the port it's bound to. That line is the one you're waiting for.

Step 6: Use the API

vLLM's API is fully OpenAI compatible, same endpoints, same request shape. Point your client at the local host and port, and existing code doesn't need to change.

from openai import OpenAI

client = OpenAI(
    base_url="http://<HOST>:<PORT>/v1",
    api_key="not-needed"  # vLLM doesn't require auth by default
)

response = client.chat.completions.create(
    model="<model_id>",
    messages=[
        {"role": "user", "content": "Explain transformer attention in one paragraph."}
    ],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Everything you'd expect is there:

Endpoint	Description
`GET /v1/models`	List loaded models
`POST /v1/chat/completions`	Chat (OpenAI compatible)
`POST /v1/completions`	Text completion
`GET /metrics`	Prometheus compatible metrics
`GET /health`	Health check

Wrapping Up

This isn't the glamorous part of AI engineering. Nobody's writing blog posts about their internal Docker registry or how they spent an afternoon syncing 70GB of model weights onto a hard drive to walk it through a security checkpoint.

But some of the most important AI deployments are going to be exactly this: unglamorous, slow to set up, running on hardware that's never touched the internet and never will. And when you're in that situation, you need a stack that actually works under those constraints. vLLM does. You get real concurrent throughput, multi GPU tensor parallelism, and an API your application already speaks, all running entirely on premises.

You own the maintenance burden. There's no getting around that. But in environments where data sovereignty isn't a preference, it's a legal or contractual requirement, that burden is already yours regardless of what you deploy.

If your setup has Kubernetes, the conversation shifts to Ray Serve or AIBrix. That's a different post.

Have questions or ran into something I didn't cover? Feel free to reach out.