Building a production-grade AI inference stack in your own datacenter

When you’re dealing with a mixed bag of hardware, perhaps some RTX PROs and AMD GPUs, the challenge isn't just running the models; it's making them work together as a single, resilient API.

In this post, we’ll walk through how we built a heterogeneous AI rack using vLLM for the engine and HAProxy to bridge the gap between team Green and team Red.

The Architecture, Heterogeneous Hardware, Unified API

The core problem with mixing AMD and Nvidia is that you cannot (currently) run a single model instance sharded across both. You need to run distinct clusters and route traffic intelligently.

Our setup consists of:

Nvidia Clusters: Running the CUDA-optimized version of vLLM.
AMD Clusters: Running the ROCm-optimized version of vLLM.
HAProxy Layer: Acting as the "Traffic Controller" to balance requests based on health and capacity.

Step 1: Preparing the GPU Nodes

Since vLLM handles Nvidia and AMD differently, we use Docker to abstract the environment complexities.

For Nvidia (CUDA)

We use the standard vLLM image. In your rack, ensure you have the latest Nvidia Container Toolkit installed.

For AMD (ROCm)

AMD requires the ROCm-specific build. We use the official rocm/vllm image which is optimized for the Instinct and RX series.

Step 2: Load Balancing with HAProxy

Now that we have two independent vLLM endpoints (port 8001 for Nvidia and 8002 for AMD), we need a single entry point. HAProxy is perfect here because it’s incredibly fast and supports HTTP health checks essential for AI because vLLM nodes can take minutes to load a model.

The HAProxy Config

We want a Least Connection algorithm. Why? Because LLM requests have highly variable processing times. A simple Round Robin might send a 4096-token request to a worker already struggling with a massive prompt.

Step 3: Handling Heterogeneous Latency

RTX Pro have significantly more VRAM than AMD cards. This allows for larger batch sizes. We use the weight parameter in HAProxy to send more concurrent traffic to the RTX nodes while keeping the AMD nodes responsive for lower-latency tasks.

Key Learnings from the Rack:

Unified API: By using vLLM’s OpenAI-compatible server on both clusters, the frontend application doesn't even know it’s switching between CUDA and ROCm.
Failover: If an AMD driver crashes, HAProxy’s health check (/health) automatically pulls that node out of the rotation within 5 seconds, and the Nvidia cluster takes the full load.
Monitoring: We use the HAProxy Stats page to visualize which cluster is bottlenecked in real-time.

We also tested LiteLLM but didnt find it stable enough for our needs.

Why "Build" instead of "Buy"?

Building this locally allowed us to keep the data private and not send it outside of our own infra. By mixing hardware, we aren't at the mercy of a single vendor’s supply chain, if we find a deal on new AMD cards, we plug them in; if we get a shipment of RTX Pros, we scale the Nvidia backend. Most important for us is the privacy, the data never leaves out from our infrastructure.

Antti Koskela
Youlearn it OY