The Hardware Stack
The interactive story above gives you the visual overview, now let's go deeper into the technical decisions behind each layer.
GPU Selection
Not all GPUs are created equal, and the choice depends entirely on your workload profile.
Accelerator
Best For
Memory
Interconnect
NVIDIA H100
General training, inference
80 GB HBM3
NVLink 4 — 900 GB/s
NVIDIA H200
Memory-bound training + long-context inference
141 GB HBM3e — 4.8 TB/s
NVLink 4 — 900 GB/s
NVIDIA B200
Frontier training and inference throughput
192 GB HBM3e — 8 TB/s
NVLink 5 — 1.8 TB/s
NVIDIA GB200 NVL72
Rack-scale training and inference as one unit
72 GPUs, 13.5 TB HBM3e
NVLink fabric — 130 TB/s aggregate
AMD MI325X
Inference on large models
256 GB HBM3e — 6 TB/s
Infinity Fabric
AMD MI355X
Frontier training, FP4/FP6 inference
288 GB HBM3e — 8 TB/s
Infinity Fabric
The key is matching architecture to workload, not chasing the newest SKU for its own sake. For large-batch training at frontier scale, Blackwell-class GB200 NVL72 racks treat 72 GPUs as a single NVLink domain and collapse what used to require multi-rack InfiniBand fabrics into a single liquid-cooled cabinet. For memory-bound inference on very long contexts, the H200's 141 GB of HBM3e lets models fit on-card that would otherwise force tensor parallelism across multiple GPUs. On the AMD side, the MI355X's 288 GB of HBM3e and native FP4/FP6 support make it a strong inference-per-dollar option for customers who can tolerate a younger software stack in exchange for price-performance.
Networking: The Hidden Bottleneck
In a GPU datacenter, the network is often the limiting factor, not the GPUs themselves. Distributed training workloads need to synchronize gradients across dozens or hundreds of GPUs, and any network bottleneck directly impacts training time.
Our Network Architecture
Spine-leaf topology with 400GbE Ethernet and RDMA over Converged Ethernet (RoCE) for GPU-to-GPU communication within racks. For larger training clusters, we deploy InfiniBand NDR (400Gb/s) with adaptive routing to minimize congestion. Every rack has redundant top-of-rack switches, and the spine layer is designed for non-blocking throughput.
Power and Cooling
A single 8-GPU Hopper server draws 6 to 10 kW. A Blackwell-class HGX server pushes past 14 kW, and a full GB200 NVL72 rack lands at 120 kW, against 5 to 8 kW for a traditional compute rack. This fundamentally changes the datacenter design. Air cooling runs out of headroom past about 30 kW per rack; liquid cooling is no longer a nice-to-have, it is the only way to land modern AI hardware.
💧
Direct Liquid Cooling
Cold plates on each GPU with facility-level coolant distribution. Eliminates the acoustic nightmare of thousands of high-RPM fans.
⚡
PUE 1.1-1.2
Power Usage Effectiveness far below the 1.5+ industry average for air-cooled facilities. Lower energy costs, smaller carbon footprint.
📐
High-Density Racks
Direct-to-chip cold plates carry heat out of the rack rather than into the room. This is what enables 120 kW per rack in a GB200 NVL72 footprint, roughly 15x the density of a traditional compute rack, with far lower facility PUE.
Kubernetes for GPU Workloads
Standard Kubernetes treats compute as fungible, any CPU core is the same as any other. GPUs break this assumption completely. A workload that needs 8 GPUs with NVLink interconnects can't be scattered across random nodes.
The NVIDIA GPU Operator
We deploy the NVIDIA GPU Operator across all GPU nodes. It manages the full lifecycle: drivers, container runtimes, device plugins, and monitoring exporters. When a new GPU node joins the cluster, the operator automatically installs drivers, configures the container runtime for GPU passthrough, and registers GPU resources with the Kubernetes scheduler.
GPU Operator Device Plugin DCGM Exporter Container Runtime Node Feature Discovery
Multi-Instance GPU (MIG)
Not every workload needs a full GPU. Small inference models might only need a fraction of an H100's capacity. MIG lets us partition a single physical GPU into up to seven isolated instances, each with dedicated memory and compute resources.
Instead of a small model using 10% of a GPU's compute, MIG lets us pack multiple models onto the same card, dramatically improving utilization and cost efficiency.
Topology-Aware Scheduling
For distributed training jobs that need multiple GPUs, placement matters enormously. Eight GPUs on a single node connected via NVLink (900 GB/s) will train 2-3x faster than eight GPUs spread across nodes communicating over even 400GbE. Our scheduler uses custom topology constraints and NUMA-aware scheduling to ensure multi-GPU workloads land on the most efficient hardware configuration.
Inference at Scale
Training gets all the headlines, but inference is where most GPU cycles are actually spent. Running models in production at low latency and high throughput requires its own set of engineering challenges.
Model Serving
We use a combination of serving frameworks depending on the workload:
vLLM
PagedAttention-minnehåndtering øker LLM-throughput med 2-4x sammenlignet med naive implementasjoner. Ideell for høykonkurrerende chat- og completion-API-er.
TensorRT-LLM
Kompilerer modeller til optimaliserte kjøreplaner som presser ut hver eneste FLOP av maskinvaren. Best for latency-kritiske applikasjoner.
Triton Inference Server
Handles model ensemble pipelines and provides a unified API layer. Supports dynamic batching and model versioning out of the box.
Batching Strategies
The key to efficient GPU inference is keeping the hardware saturated. Dynamic batching collects incoming requests and groups them before sending them to the GPU. Continuous batching goes further, as individual requests in a batch complete, new requests are immediately inserted without waiting for the longest request to finish.
Continuous Batching Impact
For LLM inference with variable output lengths, continuous batching can double throughput compared to static batching. Requests that generate short outputs don't block the GPU from processing new work.
Latency vs. Throughput
Every inference deployment involves a tradeoff. We configure this per-workload:
Workload Type
Batch Size
Priority
Use Case
Real-time chat
Small (1-4)
Latency
Customer-facing AI assistants
API endpoints
Medium (8-32)
Balanced
Developer APIs, search
Bulk processing
Large (64+)
Throughput
Document analysis, batch embeddings
Custom Models: LoRA Fine-Tuning & Data Pipelines
Raw inference is only half the story. Most organizations need models tailored to their specific domain, trained on their data, speaking their language, understanding their context. That's where our fine-tuning platform comes in.
LoRA
Parameter-efficient tuning
10x
Faster than full fine-tune
LoRA Fine-Tuning at Scale
Full fine-tuning of a 70B parameter model requires hundreds of GBs of GPU memory and days of compute. LoRA (Low-Rank Adaptation) changes the game, by training small adapter matrices instead of modifying all model weights, we achieve comparable quality while modifying less than 0.1% of parameters.
How LoRA Works
Instead of updating all billions of parameters, LoRA freezes the pre-trained model and injects small trainable matrices into each transformer layer. The result: fine-tuning that takes hours instead of days, uses a fraction of the GPU memory, and produces adapters that are just megabytes, not gigabytes. You can hot-swap adapters at inference time to serve multiple specialized models from a single base.
Our platform handles the full LoRA workflow: dataset preparation, training job orchestration, adapter management, and deployment. Upload your data, configure your hyperparameters, and we handle the rest, scheduling training across available GPUs, monitoring loss curves, and automatically deploying the best checkpoint.
🔧
Adapter Management
Version, A/B test, and hot-swap LoRA adapters without reloading the base model. Serve dozens of specialized models from a single GPU.
📊
Training Dashboard
Real-time loss curves, evaluation metrics, and resource utilization. Automatic early stopping and checkpoint management.
Data Pipelines: Scraping & Processing
Fine-tuning is only as good as your data. Our platform includes a complete data pipeline for building high-quality training datasets, from web scraping and document ingestion to cleaning, deduplication, and formatting.
🌐
Web Scraping
Automated, scheduled scraping with intelligent content extraction. Handle JavaScript-rendered pages, rate limiting, and deduplication at scale.
🧹
Data Cleaning
PII removal, quality filtering, deduplication, and format conversion. Transform raw scraped data into clean training-ready datasets.
📝
Instruction Tuning
Generate instruction/response pairs from your documents. Build chat-optimized datasets from knowledge bases, FAQs, and documentation.
Enterprise Data Integrations
Fine-tuning and RAG are only useful if you can actually get to your data. Our platform connects to the systems where your business knowledge lives, no manual exports, no CSV uploads, no copy-pasting documents.
📁
SharePoint & OneDrive
Ingest documents, wikis, and team sites directly. Automatic sync keeps your training data and RAG indexes up to date as content changes.
🗄️
Databases
Connect to PostgreSQL, MySQL, MSSQL, MongoDB, and more. Query structured data for RAG context or extract records for fine-tuning datasets.
📄
Document Processing
Parse PDFs, Word docs, Excel spreadsheets, PowerPoint, HTML, and Markdown. OCR for scanned documents. Structured extraction from any format.
🔗
File Servers & Network Shares
SMB/CIFS and NFS mount support. Crawl file servers on a schedule, index new documents automatically, and respect existing folder permissions.
Supported Integrations
SharePoint, OneDrive, Google Drive, Confluence, Notion, Jira, Slack, Teams, S3, Azure Blob, GCS, SFTP, REST APIs, GraphQL endpoints, IMAP email — and custom connectors for anything we don't support out of the box. Data stays in your environment; we bring the compute to the data, not the other way around.
SharePoint OneDrive Google Drive Confluence Notion PostgreSQL MongoDB S3 REST APIs PDF Word/Excel OCR
Chat & Conversation Handling
Once your model is fine-tuned, it needs a production-grade serving layer that handles the complexity of real-world conversations.
Our Chat Platform
Multi-turn conversation management with context windowing, memory summarization, and retrieval-augmented generation (RAG). Built-in guardrails for content safety, token budgeting, and response quality monitoring. Deploy as an API, embed as a widget, or integrate with your existing tools.
RBAC: Granular Access Control
In any multi-team environment, not everyone should have access to everything. Our platform provides fine-grained role-based access control across two critical dimensions: models and MCP tools.
🔐
Model-Level RBAC
Control which teams can access which models and LoRA adapters. Restrict expensive GPU models to production workloads, give dev teams access to smaller models, and ensure sensitive fine-tuned models are only available to authorized users.
🧩
MCP Tool Permissions
Define which MCP tools each role can invoke. Limit code execution tools to engineering, restrict data access tools by department, and audit every tool call. Zero-trust by default, no tool access without explicit permission.
Why This Matters
Without RBAC, every user with API access can run any model and invoke any tool — including tools that access databases, execute code, or call external services. Our permission system ensures that a marketing intern's chatbot can't accidentally invoke a production database tool or burn through your most expensive GPU allocation.
LoRA Fine-Tuning Web Scraping Data Pipelines Chat API RBAC RAG Guardrails MCP Tools
The full loop, scrape domain data, clean and format it, fine-tune a LoRA adapter, deploy it behind a chat API with RAG and guardrails, runs entirely on our platform. No stitching together five different tools from five different vendors.
Autoscaling in Practice
The interactive story showed autoscaling at a high level, demand rises, new pods spin up, new nodes provision. In practice, GPU autoscaling is significantly more complex than CPU autoscaling.
Custom GPU Metrics
The standard Kubernetes HPA scales on CPU and memory, useless for GPU workloads. We deploy custom metrics adapters that expose GPU-specific signals:
Queue Depth GPU Utilization % Memory Pressure P99 Latency Tokens/Second
The HPA watches these metrics and scales the number of inference pods accordingly. When demand drops, pods scale down just as aggressively.
Cluster Autoscaling
When the HPA adds more pods than the current cluster can handle, the Cluster Autoscaler provisions new GPU nodes from our warm pool. The entire process, IPMI boot, OS install via PXE, GPU Operator setup, cluster join, takes minutes, not hours.
Cost Optimization
Smart Scaling Strategy
GPU nodes are expensive. We use predictive scaling based on historical patterns (many AI workloads have predictable daily cycles) combined with reactive scaling for unexpected spikes. Bin packing algorithms consolidate workloads onto fewer nodes during off-peak hours, allowing empty nodes to power down.
Security and Zero Trust
GPU infrastructure handles some of the most valuable data in any organization, proprietary models, training data, and inference inputs that may contain sensitive customer information. Security can't be an afterthought.
Network Isolation
Every tenant's GPU workloads run in isolated network namespaces with Kubernetes NetworkPolicies enforcing strict ingress/egress rules. GPU-to-GPU training traffic is segregated on dedicated VLANs. The control plane communicates exclusively over mutual TLS.
Encrypted Inference
Data in transit is encrypted end-to-end, from the client request through the load balancer, into the inference pod, and back. For customers with sensitive workloads we also offer confidential computing, where model weights and inference data remain encrypted in GPU memory via the hardware TEE features in recent NVIDIA and AMD accelerators. This is aligned with GDPR data-minimisation requirements out of the box, and full per-action audit logging is available for customers who need it for their own compliance programmes.
The Result: Infrastructure That Just Works
When all these layers work together, physical infrastructure, networking, Kubernetes orchestration, GPU-aware scheduling, intelligent autoscaling, and zero-trust security, you get a platform where AI teams focus on building models instead of managing infrastructure.
Deploy a model, set your latency and throughput targets, and the platform handles the rest: scaling GPUs up and down with demand, routing traffic for optimal performance, and keeping everything secure and compliant.
That's what ZeroSubnet delivers. Not just GPUs in a rack, but a complete, managed platform for AI infrastructure at any scale. If you're building AI applications and spending too much time on infrastructure, let's talk.