From Empty Floor to GPU Autoscaling: Building a Modern Datacenter

The Hardware Stack

The interactive story above gives you the visual overview, now let's go deeper into the technical decisions behind each layer.

GPU Selection

Not all GPUs are created equal, and the choice depends entirely on your workload profile.

Accelerator

Best For

Memory

Interconnect

NVIDIA H100

General training, inference

80 GB HBM3

NVLink 4 — 900 GB/s

NVIDIA H200

Memory-bound training + long-context inference

141 GB HBM3e — 4.8 TB/s

NVLink 4 — 900 GB/s

NVIDIA B200

Frontier training and inference throughput

192 GB HBM3e — 8 TB/s

NVLink 5 — 1.8 TB/s

NVIDIA GB200 NVL72

Rack-scale training and inference as one unit

72 GPUs, 13.5 TB HBM3e

NVLink fabric — 130 TB/s aggregate

AMD MI325X

Inference on large models

256 GB HBM3e — 6 TB/s

Infinity Fabric

AMD MI355X

Frontier training, FP4/FP6 inference

288 GB HBM3e — 8 TB/s

Infinity Fabric

The key is matching architecture to workload, not chasing the newest SKU for its own sake. For large-batch training at frontier scale, Blackwell-class GB200 NVL72 racks treat 72 GPUs as a single NVLink domain and collapse what used to require multi-rack InfiniBand fabrics into a single liquid-cooled cabinet. For memory-bound inference on very long contexts, the H200's 141 GB of HBM3e lets models fit on-card that would otherwise force tensor parallelism across multiple GPUs. On the AMD side, the MI355X's 288 GB of HBM3e and native FP4/FP6 support make it a strong inference-per-dollar option for customers who can tolerate a younger software stack in exchange for price-performance.

Networking: The Hidden Bottleneck

In a GPU datacenter, the network is often the limiting factor, not the GPUs themselves. Distributed training workloads need to synchronize gradients across dozens or hundreds of GPUs, and any network bottleneck directly impacts training time.

Our Network Architecture

Spine-leaf topology with 400GbE Ethernet and RDMA over Converged Ethernet (RoCE) for GPU-to-GPU communication within racks. For larger training clusters, we deploy InfiniBand NDR (400Gb/s) with adaptive routing to minimize congestion. Every rack has redundant top-of-rack switches, and the spine layer is designed for non-blocking throughput.

Power and Cooling

A single 8-GPU Hopper server draws 6 to 10 kW. A Blackwell-class HGX server pushes past 14 kW, and a full GB200 NVL72 rack lands at 120 kW, against 5 to 8 kW for a traditional compute rack. This fundamentally changes the datacenter design. Air cooling runs out of headroom past about 30 kW per rack; liquid cooling is no longer a nice-to-have, it is the only way to land modern AI hardware.

💧

Direct Liquid Cooling

Cold plates on each GPU with facility-level coolant distribution. Eliminates the acoustic nightmare of thousands of high-RPM fans.

⚡

PUE 1.1-1.2

Power Usage Effectiveness far below the 1.5+ industry average for air-cooled facilities. Lower energy costs, smaller carbon footprint.

📐

High-Density Racks

Direct-to-chip cold plates carry heat out of the rack rather than into the room. This is what enables 120 kW per rack in a GB200 NVL72 footprint, roughly 15x the density of a traditional compute rack, with far lower facility PUE.

Kubernetes for GPU Workloads

Standard Kubernetes treats compute as fungible, any CPU core is the same as any other. GPUs break this assumption completely. A workload that needs 8 GPUs with NVLink interconnects can't be scattered across random nodes.

The NVIDIA GPU Operator

We deploy the NVIDIA GPU Operator across all GPU nodes. It manages the full lifecycle: drivers, container runtimes, device plugins, and monitoring exporters. When a new GPU node joins the cluster, the operator automatically installs drivers, configures the container runtime for GPU passthrough, and registers GPU resources with the Kubernetes scheduler.

GPU Operator Device Plugin DCGM Exporter Container Runtime Node Feature Discovery

Multi-Instance GPU (MIG)

Not every workload needs a full GPU. Small inference models might only need a fraction of an H100's capacity. MIG lets us partition a single physical GPU into up to seven isolated instances, each with dedicated memory and compute resources.

Instead of a small model using 10% of a GPU's compute, MIG lets us pack multiple models onto the same card, dramatically improving utilization and cost efficiency.

Topology-Aware Scheduling

For distributed training jobs that need multiple GPUs, placement matters enormously. Eight GPUs on a single node connected via NVLink (900 GB/s) will train 2-3x faster than eight GPUs spread across nodes communicating over even 400GbE. Our scheduler uses custom topology constraints and NUMA-aware scheduling to ensure multi-GPU workloads land on the most efficient hardware configuration.

Inference at Scale

Training gets all the headlines, but inference is where most GPU cycles are actually spent. Running models in production at low latency and high throughput requires its own set of engineering challenges.

Model Serving

We use a combination of serving frameworks depending on the workload:

vLLM

PagedAttention-minnehåndtering øker LLM-throughput med 2-4x sammenlignet med naive implementasjoner. Ideell for høykonkurrerende chat- og completion-API-er.

TensorRT-LLM

Kompilerer modeller til optimaliserte kjøreplaner som presser ut hver eneste FLOP av maskinvaren. Best for latency-kritiske applikasjoner.

Triton Inference Server

Handles model ensemble pipelines and provides a unified API layer. Supports dynamic batching and model versioning out of the box.

Batching Strategies

The key to efficient GPU inference is keeping the hardware saturated. Dynamic batching collects incoming requests and groups them before sending them to the GPU. Continuous batching goes further, as individual requests in a batch complete, new requests are immediately inserted without waiting for the longest request to finish.

Continuous Batching Impact

For LLM inference with variable output lengths, continuous batching can double throughput compared to static batching. Requests that generate short outputs don't block the GPU from processing new work.

Latency vs. Throughput

Every inference deployment involves a tradeoff. We configure this per-workload:

Workload Type

Batch Size

Priority

Use Case

Real-time chat

Small (1-4)

Latency

Customer-facing AI assistants

API endpoints

Medium (8-32)

Balanced

Developer APIs, search

Bulk processing

Large (64+)

Throughput

Document analysis, batch embeddings

Custom Models: LoRA Fine-Tuning & Data Pipelines

Raw inference is only half the story. Most organizations need models tailored to their specific domain, trained on their data, speaking their language, understanding their context. That's where our fine-tuning platform comes in.

LoRA

Parameter-efficient tuning

0.1%

Parameters modified

10x

Faster than full fine-tune

LoRA Fine-Tuning at Scale

Full fine-tuning of a 70B parameter model requires hundreds of GBs of GPU memory and days of compute. LoRA (Low-Rank Adaptation) changes the game, by training small adapter matrices instead of modifying all model weights, we achieve comparable quality while modifying less than 0.1% of parameters.

How LoRA Works

Instead of updating all billions of parameters, LoRA freezes the pre-trained model and injects small trainable matrices into each transformer layer. The result: fine-tuning that takes hours instead of days, uses a fraction of the GPU memory, and produces adapters that are just megabytes, not gigabytes. You can hot-swap adapters at inference time to serve multiple specialized models from a single base.

Our platform handles the full LoRA workflow: dataset preparation, training job orchestration, adapter management, and deployment. Upload your data, configure your hyperparameters, and we handle the rest, scheduling training across available GPUs, monitoring loss curves, and automatically deploying the best checkpoint.

🔧

Adapter Management

Version, A/B test, and hot-swap LoRA adapters without reloading the base model. Serve dozens of specialized models from a single GPU.

📊

Training Dashboard

Real-time loss curves, evaluation metrics, and resource utilization. Automatic early stopping and checkpoint management.

Data Pipelines: Scraping & Processing

Fine-tuning is only as good as your data. Our platform includes a complete data pipeline for building high-quality training datasets, from web scraping and document ingestion to cleaning, deduplication, and formatting.

🌐

Web Scraping

Automated, scheduled scraping with intelligent content extraction. Handle JavaScript-rendered pages, rate limiting, and deduplication at scale.

🧹

Data Cleaning

PII removal, quality filtering, deduplication, and format conversion. Transform raw scraped data into clean training-ready datasets.

📝

Instruction Tuning

Generate instruction/response pairs from your documents. Build chat-optimized datasets from knowledge bases, FAQs, and documentation.

Enterprise Data Integrations

Fine-tuning and RAG are only useful if you can actually get to your data. Our platform connects to the systems where your business knowledge lives, no manual exports, no CSV uploads, no copy-pasting documents.

📁

SharePoint & OneDrive

Ingest documents, wikis, and team sites directly. Automatic sync keeps your training data and RAG indexes up to date as content changes.

🗄️

Databases

Connect to PostgreSQL, MySQL, MSSQL, MongoDB, and more. Query structured data for RAG context or extract records for fine-tuning datasets.

📄

Document Processing

Parse PDFs, Word docs, Excel spreadsheets, PowerPoint, HTML, and Markdown. OCR for scanned documents. Structured extraction from any format.

🔗

File Servers & Network Shares

SMB/CIFS and NFS mount support. Crawl file servers on a schedule, index new documents automatically, and respect existing folder permissions.

Supported Integrations

SharePoint, OneDrive, Google Drive, Confluence, Notion, Jira, Slack, Teams, S3, Azure Blob, GCS, SFTP, REST APIs, GraphQL endpoints, IMAP email — and custom connectors for anything we don't support out of the box. Data stays in your environment; we bring the compute to the data, not the other way around.

SharePoint OneDrive Google Drive Confluence Notion PostgreSQL MongoDB S3 REST APIs PDF Word/Excel OCR

Chat & Conversation Handling

Once your model is fine-tuned, it needs a production-grade serving layer that handles the complexity of real-world conversations.

Our Chat Platform

Multi-turn conversation management with context windowing, memory summarization, and retrieval-augmented generation (RAG). Built-in guardrails for content safety, token budgeting, and response quality monitoring. Deploy as an API, embed as a widget, or integrate with your existing tools.

RBAC: Granular Access Control

In any multi-team environment, not everyone should have access to everything. Our platform provides fine-grained role-based access control across two critical dimensions: models and MCP tools.

🔐

Model-Level RBAC

Control which teams can access which models and LoRA adapters. Restrict expensive GPU models to production workloads, give dev teams access to smaller models, and ensure sensitive fine-tuned models are only available to authorized users.

🧩

MCP Tool Permissions

Define which MCP tools each role can invoke. Limit code execution tools to engineering, restrict data access tools by department, and audit every tool call. Zero-trust by default, no tool access without explicit permission.

Why This Matters

Without RBAC, every user with API access can run any model and invoke any tool — including tools that access databases, execute code, or call external services. Our permission system ensures that a marketing intern's chatbot can't accidentally invoke a production database tool or burn through your most expensive GPU allocation.

LoRA Fine-Tuning Web Scraping Data Pipelines Chat API RBAC RAG Guardrails MCP Tools

The full loop, scrape domain data, clean and format it, fine-tune a LoRA adapter, deploy it behind a chat API with RAG and guardrails, runs entirely on our platform. No stitching together five different tools from five different vendors.

Autoscaling in Practice

The interactive story showed autoscaling at a high level, demand rises, new pods spin up, new nodes provision. In practice, GPU autoscaling is significantly more complex than CPU autoscaling.

Custom GPU Metrics

The standard Kubernetes HPA scales on CPU and memory, useless for GPU workloads. We deploy custom metrics adapters that expose GPU-specific signals:

Queue Depth GPU Utilization % Memory Pressure P99 Latency Tokens/Second

The HPA watches these metrics and scales the number of inference pods accordingly. When demand drops, pods scale down just as aggressively.

Cluster Autoscaling

When the HPA adds more pods than the current cluster can handle, the Cluster Autoscaler provisions new GPU nodes from our warm pool. The entire process, IPMI boot, OS install via PXE, GPU Operator setup, cluster join, takes minutes, not hours.

Cost Optimization

Smart Scaling Strategy

GPU nodes are expensive. We use predictive scaling based on historical patterns (many AI workloads have predictable daily cycles) combined with reactive scaling for unexpected spikes. Bin packing algorithms consolidate workloads onto fewer nodes during off-peak hours, allowing empty nodes to power down.

Security and Zero Trust

GPU infrastructure handles some of the most valuable data in any organization, proprietary models, training data, and inference inputs that may contain sensitive customer information. Security can't be an afterthought.

Network Isolation

Every tenant's GPU workloads run in isolated network namespaces with Kubernetes NetworkPolicies enforcing strict ingress/egress rules. GPU-to-GPU training traffic is segregated on dedicated VLANs. The control plane communicates exclusively over mutual TLS.

Encrypted Inference

Data in transit is encrypted end-to-end, from the client request through the load balancer, into the inference pod, and back. For customers with sensitive workloads we also offer confidential computing, where model weights and inference data remain encrypted in GPU memory via the hardware TEE features in recent NVIDIA and AMD accelerators. This is aligned with GDPR data-minimisation requirements out of the box, and full per-action audit logging is available for customers who need it for their own compliance programmes.

The Result: Infrastructure That Just Works

When all these layers work together, physical infrastructure, networking, Kubernetes orchestration, GPU-aware scheduling, intelligent autoscaling, and zero-trust security, you get a platform where AI teams focus on building models instead of managing infrastructure.

Deploy a model, set your latency and throughput targets, and the platform handles the rest: scaling GPUs up and down with demand, routing traffic for optimal performance, and keeping everything secure and compliant.

That's what ZeroSubnet delivers. Not just GPUs in a rack, but a complete, managed platform for AI infrastructure at any scale. If you're building AI applications and spending too much time on infrastructure, let's talk.