LiteLLM: One Gateway for Every LLM You'll Ever Run

20.04.2026·8 min read·

#GenAI#MLOps#Infrastructure

There's a version of every AI project that stays clean. One model, one SDK, one API key. You call it, it responds, you ship. Simple.

Then the project grows. The client needs Anthropic for reasoning tasks and OpenAI for the flows your team already has prompts for. Your On-Prem cluster runs vLLM because data can't leave the building. Someone spins up Ollama locally for prototyping. Now you have four different SDKs, four different auth schemes, four different ways to handle errors.

The code that's supposed to be your product is slowly becoming a compatibility layer for someone else's API.

That's the problem. LiteLLM is the fix.

The Mess That Builds Up

Every provider has its own SDK and its own call shape. Anthropic has one. OpenAI has another. Gemini has one too. Ollama speaks plain HTTP. vLLM speaks a dialect of the OpenAI spec. Each one has its own way to pass a message, its own error format, its own auth pattern.

Now imagine swapping one provider for another mid-project. You're not changing a model string. You're rewriting the integration. And every new model you add makes the next swap harder.

That's the maintenance trap.

What LiteLLM Actually Is

LiteLLM is an AI gateway. One API endpoint that your application always calls, and LiteLLM routes it to whichever model you configured: cloud, On-Prem, local, doesn't matter.

The API LiteLLM exposes is OpenAI-compatible. Your application doesn't need a new SDK. It just changes the base_url to point at LiteLLM instead of api.openai.com, and from that point on it doesn't know or care what's behind the gateway.

Swapping models becomes a config change, not a code change.

Why It's More Than Just Routing

The routing is the obvious part. The part that actually justifies running this in a team or organization is everything that sits on top of it.

When multiple people or teams use AI through the same application or platform, the questions come up fast:

Which team is burning the most tokens?
Who called the expensive model for something a cheap one could handle?
How much has this project cost this month?

Without a gateway, you answer these questions by scraping billing dashboards across three providers and hoping someone remembered to tag their API keys by team.

With LiteLLM, every call is logged centrally. You define teams and users, assign them virtual keys, and set hard budget caps or rate limits. LiteLLM enforces those caps at the gateway. The call never reaches the model if the budget is gone.

That's not a developer convenience. That's what you need to give 50 people in an organization access to AI without losing control of spend.

The Architecture

When you deploy LiteLLM via Docker, you get two things:

The proxy (backend): handles routing, authentication, logging, rate limiting, and budget enforcement. Every application in your stack talks to this.

The UI (frontend): shows you usage per team, per user, per model. Budget status. Cost breakdown. Logs for every call. The full list of models configured on the backend.

PostgreSQL stores all usage logs. Everything the dashboard shows you comes from there. In a secure environment, that database stays entirely inside your perimeter.

Configuring Your Models

Everything about which models LiteLLM knows about lives in config.yaml.

# config.yaml
model_list:

  # Cloud providers
  - model_name: anthropic-chat
    litellm_params:
      model: anthropic/claude-opus-4-7
      api_key: os.environ/ANTHROPIC_API_KEY
  # On-Prem: vLLM
  - model_name: llama3-vllm
    litellm_params:
      model: hosted_vllm/meta-llama/Llama-3.1-70B-Instruct
      api_base: http://vllm-host:8000/v1

litellm_settings:
  drop_params: true
  request_timeout: 600

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL
  store_model_in_db: true

The model_name is what your application sends in the API call. The litellm_params is what LiteLLM actually calls behind the scenes. Your app never needs to know the difference.

On-Prem: vLLM

For vLLM, the correct prefix is hosted_vllm/, not openai/ as you might assume. LiteLLM treats hosted_vllm differently from a generic OpenAI-compatible endpoint. The model string after the prefix must match exactly what you passed to --model when starting vLLM. Get it wrong and you'll get a 404 on startup.

Each vLLM server you run is just another entry in model_list. The gateway handles the rest.

Calling the Gateway

Once the proxy is up, your application points at it instead of the provider directly.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:4000/v1",
    api_key="sk-your-virtual-key"
)

response = client.chat.completions.create(
    model="anthropic-chat",   # model_name from config.yaml
    messages=[{"role": "user", "content": "Summarize this document."}]
)

Change the model string to route to a different provider. That's it. Streaming, tool use, and everything else the OpenAI SDK supports works the same way with no other changes needed.

Why LiteLLM Is a High-Value Target

In March 2026, versions 1.82.7 and 1.82.8 of the litellm PyPI package were compromised. A threat actor known as TeamPCP got into the maintainer's account by first poisoning Trivy, an open source security scanner used in LiteLLM's CI/CD pipeline. Once the scanner was compromised, they had the ability to publish under the real package name. The malicious versions were live for roughly three hours and downloaded over 119,000 times before PyPI pulled them. The payload harvested SSH keys, cloud credentials, and environment variables on every Python startup.

The reason LiteLLM is worth targeting is exactly what makes it useful. It holds API keys for every provider you've connected. Compromise one LiteLLM deployment and you get Anthropic, OpenAI, Gemini, and your On-Prem credentials in one shot. That's a far more valuable target than going after any individual provider account.

A fixed version (v1.83.0) was released with a rebuilt CI/CD pipeline. If you're running LiteLLM, make sure you're not on 1.82.7 or 1.82.8. And treat the LITELLM_MASTER_KEY and your provider API keys with the same seriousness as any production secret.

Full breakdown from BleepingComputer and the official PyPI incident report.

Observability: Prometheus + Grafana

LiteLLM exposes Prometheus metrics at /metrics out of the box. No extra configuration.

Add Prometheus and Grafana to the same Compose stack.

vLLM ships with a Prometheus exporter built in. GPU utilization, queue depth, token throughput, request latency, and KV cache usage per running instance. When you're running multiple vLLM servers behind LiteLLM, you scrape each one separately and aggregate in Grafana.

For a ready-made dashboard, the vLLM Monitoring v2 dashboard and LiteLLM Dashboard on Grafana is the one to import. Drop it in, point at your Prometheus datasource, and you get full vLLM and LiteLLM coverage out of the box.

The Full Picture

Wrapping Up

You're going to end up talking to multiple LLMs. The only question is whether your codebase handles that directly or a gateway does.

LiteLLM gives you one endpoint, one config file, and one place to see what everything costs. Cloud and On-Prem models live behind the same interface. The observability stack clips on cleanly with Prometheus and Grafana. Set it up once and it stays out of your way.

Something I missed, or a setup that didn't work the way I described? Feel free to reach out.