Introduction
Public cloud services are often the fastest and easiest option, but for us they were not the whole story. We wanted an environment where performance, cost control, security, and operational visibility remain firmly in our own hands. That is why we decided to build our own data center infrastructure, based on open-source technologies and a modern hyperconverged architecture.
In this post, we explain how we built a Proxmox HA cluster backed by a Ceph storage system, and why this combination became the foundation of our platform.
Goals and Design Principles
At the start of the project, we defined a clear set of goals:
- High availability (HA) with no single points of failure
- Horizontal scalability, allowing capacity to grow node by node
- Strong performance for demanding workloads
- Full control over our data
- Open-source technologies without vendor lock-in
Based on these requirements, Proxmox and Ceph quickly emerged as the strongest candidates.
Our Own Rack in a Data Center
The environment is built in our own 19” rack located in a colocation data center. This means that all hardware is owned and managed by us, while benefiting from professional data center facilities, including:
- redundant power delivery
- controlled cooling
- redundant network connectivity
- physical security
This setup combines the flexibility of owning our hardware with the reliability of a professional data center.
Hyperconverged Architecture
We chose a hyperconverged model, where:
- each server node provides both compute and storage resources
- no separate SAN or NAS system is required
- performance and capacity scale together as new nodes are added
This approach simplifies the overall architecture and removes traditional single points of failure.
Proxmox VE – The Virtualization Layer
Virtualization is built on Proxmox Virtual Environment, which uses the KVM hypervisor under the hood. Proxmox provides:
- centralized cluster management
- built-in high availability for virtual machines
- live migration between nodes
- native integration with Ceph storage
Virtual machines can be migrated between nodes without downtime, and in case of hardware failure, Proxmox HA automatically restarts workloads on healthy nodes.
Ceph – Distributed Storage at Scale
Storage is provided by Ceph, a fully distributed storage system running across the Proxmox cluster.
Key benefits of Ceph in our setup include:
- data replication across multiple nodes
- no dependency on a single disk or storage controller
- excellent performance using NVMe drives
- seamless integration with Proxmox
If a node or disk fails, data remains available and the cluster continues operating without service interruption.
High Availability in Practice
By combining Proxmox HA with Ceph replication, we achieve an environment where:
- the failure of a single server does not impact service availability
- maintenance can be performed without downtime
- unplanned outages are minimized
This level of resilience is essential for business-critical services that must be available around the clock.
GPU Acceleration and Local LLM Workloads
As part of the overall platform, the cluster is also equipped with GPU acceleration, which we use to run local large language models (LLMs). This allows us to deliver AI-powered services without relying on external cloud providers.
The GPUs are integrated into the Proxmox environment in a way that:
- virtual machines can access GPU resources directly (PCIe passthrough or vGPU)
- compute resources can be allocated flexibly across workloads
- AI workloads run close to the data, reducing latency
Running LLMs locally provides several advantages:
- improved data privacy and compliance
- predictable and controlled costs
- the ability to fine-tune and optimize models for our specific use cases
This makes our data center not only highly available and scalable, but also AI-ready by design.
Conclusion
By building our own data center platform on a Proxmox HA cluster with Ceph storage and GPU acceleration, we have created a robust foundation for business-critical services.
The environment is:
- scalable
- fault-tolerant
- fully under our control
Most importantly, it is designed to grow with our needs—both today and in the future.
If you would like to learn more about the architecture or our real-world experience with Proxmox, Ceph, and local AI workloads, feel free to get in touch.
Antti Koskela
Youlearn it oy