Deployment Models
Scalability
Platform Scale
- GPU Nodes — Scalable to thousands of nodes per deployment
- Accelerators — Scalable to thousands of GPUs per deployment
- GPU Vendors — NVIDIA, AMD, Intel multi-vendor support
- Tenants — Multi-tenant with dedicated IAM realms per tenant
- Clusters per Tenant — Configurable via quota system
- Network Isolation — Configurable pool of tenant VRFs, multiple VLANs per VRF, expandable
Tenant Onboarding Flow
- Domain Created — Platform Super Admin creates tenant with name, initial quota, and network allocation
- Identity Provisioned — IAM realm auto-created with default roles, clients, and auth flows
- First Admin User — Domain Admin created with full tenant-level permissions
- Network Allocated — VRFs/VLANs assigned (multiple VLANs per tenant), resource quotas configured
- Organization & Project Setup — Domain Admin creates org/project hierarchy, sets per-project quotas
- Users Invited — Users created/invited with appropriate roles and project assignments
- Ready — Tenant fully operational. Can provision bare metal, create clusters, deploy workloads
Support & SLA
High Availability Infrastructure
The control plane utilizes multiple redundancy strategies:
- Portal & Load Balancing — Multiple replicas with automatic failover
- Identity Management — Clustered IAM with session replication
- Database Layer — Primary-replica configuration with rapid failover
- Service Coordination — etcd quorum for distributed consensus
- Metrics — HA metrics cluster with data replication
Data Protection
- Configuration Backups — Automated backups every 6 hours
- Retention — 12-24 months in object storage
- Audit Logs — Continuous storage with long-term retention
- Metrics — 90 days hot storage, 12 months cold archives
Monitoring & Alerts
Comprehensive monitoring with exporters covering every infrastructure layer. Critical thresholds trigger alerts:
- GPU temperature exceeding safe limits
- CPU utilization above threshold
- Memory utilization at node level
- Storage latency anomalies
Incident Management
Incidents follow severity classifications from Critical (platform-wide outages) through Low (minor issues with workarounds). Response includes detection, triage, investigation with correlation ID tracking, resolution, and post-incident review.
Support Tiers
- Standard — Business-hours ticket support with next-business-day response
- Enterprise — Priority response for critical issues plus dedicated engineer
- Premium — Rapid critical response with assigned team and custom SLA negotiation
Specific SLA terms, response times, and support tiers are defined per customer engagement.