Integration Design Document
Acronyms & Abbreviations
| Acronym | Definition |
|---|
| BMaaS | Bare Metal as a Service |
| CCP | Cirrus Cloud Platform |
| CMP | Cloud Management Platform |
| GPUaaS | GPU as a Service |
| Runtime Environment (Coredge GPUaaS Platform) |
| IAM | Identity and Access Management |
| IB | InfiniBand |
| JWT | JSON Web Token |
| MAAS | Metal as a Service (Canonical) |
| NDFC | Nexus Dashboard Fabric Controller (Cisco) |
| OCP | OpenShift Container Platform |
| RBAC | Role-Based Access Control |
| ABAC | Attribute-Based Access Control |
| VPC | Virtual Private Cloud |
| VRF | Virtual Routing and Forwarding |
| VXLAN | Virtual Extensible LAN |
1. Purpose and Scope
1.1 Purpose
This Integration Design Document (IDD) defines the technical integration architecture, data flows, API contracts, security controls, and operational procedures for connecting with the Coredge GPUaaS Platform (Runtime Environment). The document serves as the authoritative reference for development, testing, operations, and governance teams involved in delivering unified GPU-accelerated cloud services on Coredge's sovereign cloud infrastructure.
1.2 Scope
The integration covers the following functional domains:
- Identity and Access Management (IAM) — Keycloak federation between CCP and GPUaaS
- Compute — GPU as a Service provisioning via OpenStack APIs and Coredge Bare Metal
- Container Orchestration — Kubernetes (Cloud Orbiter / OCP) cluster lifecycle integration
- Network — VPC, VRF, VLAN, and VXLAN/EVPN fabric orchestration (Cisco NDFC)
- Storage — NetApp (CCP) and DDN AI400 / VAST Data (GPUaaS) integration paths
- Metering and Billing — orbiter-metering to billing pipeline
- Monitoring and Alerting — Prometheus / VictoriaMetrics / Zabbix federation
- Security — mTLS service mesh, RBAC/ABAC policy enforcement, audit trails
1.3 Out of Scope
- Hardware procurement, cabling, and physical data center operations
- Penetration testing or third-party security audits
- Day-2 operations for underlying bare-metal physical infrastructure
- Application-layer changes unrelated to integration APIs
- MVP2 and MVP3 services unless explicitly noted
2. System Overview
Coredge is building a sovereign cloud platform for government and enterprise customers across India. The Cloud Management Platform layer is delivered by Cirrus Cloud Platform (CCP), which provides a hyper-scaler-grade self-service portal spanning IaaS, PaaS, and SaaS services. CCP is composed of the following orchestration layers:
- Cirrus Cloud Platform (CCP) — Cloud Management Platform and self-service portal
- Cirrus Cloud Platform (CCP) — IaaS orchestrator (OpenStack-based)
- Cloud Orbiter — Kubernetes orchestrator for container workloads
CCP runs as a microservices application deployed on Kubernetes (management cluster) within each availability zone, with active-passive high availability across two AZs per region and global services replicated across North and South regions.
The Coredge GPUaaS Platform is a purpose-built, bare-metal GPU cloud that provisions, orchestrates, and meters NVIDIA (H100) GPU infrastructure. It delivers:
- Bare Metal Provisioning via Canonical MAAS over IPMI/PXE
- Kubernetes Cluster Orchestration on bare metal GPU nodes
- Slurm HPC Cluster Orchestration via Slinky Operator
- High-performance storage via DDN AI400 (Lustre/InfiniBand) and VAST Data (CSI/S3)
- Network isolation via Cisco NDFC (VXLAN/EVPN, per-tenant VRFs)
- Full metering pipeline (DCGM, Prometheus, VictoriaMetrics, orbiter-metering)
2.3 Integration Relationship
CCP acts as the unified customer-facing control plane. The Coredge GPUaaS Platform acts as a specialized compute substrate exposed through CCP service catalogue entries. When a customer provisions GPU as a Service (GPUaaS) through the our Cloud Portal, CCP delegates to Coredge platform APIs for resource lifecycle management, while retaining control of IAM, billing aggregation, quota governance, and customer onboarding.
| Concern | Owner: CCP | Owner: Coredge GPUaaS |
|---|
| Customer Portal / UI | CCP Self-Service Console | No direct portal exposure |
| Tenant Identity & SSO | Keycloak (CCP Auth) — Federation | Keycloak (GPUaaS) — realm per tenant |
| Service Catalogue | CCP Service Catalogue + subscription | GPUaaS API endpoints |
| GPU Compute Provisioning | API delegation via OpenStack / Coredge APIs | Bare Metal Provisioning (MAAS) |
| Kubernetes Orchestration | Cloud Orbiter (CCP) for CaaS | compass-orchestrator for GPU K8s |
| Network Fabric | VPC / Firewall / LB (OpenStack + CheckPoint/Palo Alto) | VRF/VLAN (Cisco NDFC) for GPU nodes |
| Storage | NetApp (Block, Object, File) | DDN AI400 (GPU workload), VAST (platform) |
| Metering & Billing | orbiter-metering aggregation, billing | DCGM, Prometheus, orbiter-metering |
| Monitoring | Zabbix, Grafana (CCP) | VictoriaMetrics, Prometheus, Grafana |
| Quota Enforcement | CCP quota service per tenant/cell | orbiter-metering + domain quota |
| Backup & DR | Veritas backup agent, geo-replicated object storage | VAST S3 (pg_dump, mongodump, etcd) |
3. Integration Architecture
3.1 Architecture Principles
- API-First: All integrations are implemented through well-defined REST APIs or gRPC contracts with versioned endpoints.
- Loose Coupling: CCP and GPUaaS communicate through defined interface contracts; internal implementation changes in either system must not break the integration.
- Zero-Trust Security: Every inter-service call requires mutual authentication (mTLS) and JWT-based authorization. No implicit trust based on network location.
- Single Source of Truth per Domain: CCP owns customer identity and billing records; Coredge GPUaaS owns GPU hardware state and real-time metrics.
- Idempotency: All provisioning API calls must be idempotent. Retry logic must not produce duplicate resources.
- Observability: Every integration point emits structured logs with correlation IDs for end-to-end request tracing.
3.2 Logical Integration Layers
The integration is organized into four logical layers:
| Layer | Name | Description |
|---|
| L1 | Identity & Access | Federated Keycloak realms, JWT propagation, RBAC/ABAC synchronization between CCP and GPUaaS. |
| L2 | Control Plane | Resource lifecycle APIs (provision, scale, delete) for GPU Bare Metal, Kubernetes clusters, VPCs, and storage volumes. |
| L3 | Data Plane | Tenant networking fabric (VXLAN/EVPN), InfiniBand storage access (DDN/UFM), and workload traffic paths. |
| L4 | Observability & Billing | Metering event streams, quota synchronization, billing aggregation, monitoring federation, and audit log consolidation. |
3.3 Integration Topology
The following describes the high-level request flow from customer to GPU hardware:
- Customer accesses the CCP Cloud Portal (CCP Self-Service Console) authenticated identity, federated through CCP Keycloak.
- Customers place a GPU service order through the CCP service catalogue. Coredge calls the CCP onboarding API (
POST /api/organizations) to create or update the tenant.
- CCP validates the request against tenant quota (orbiter-metering quota service), then dispatches a provisioning event to the Coredge GPUaaS API (baremetal-manager or compass-orchestrator) over the private integration network.
- Coredge GPUaaS executes the provisioning pipeline: network fabric setup (Cisco NDFC), IPMI/PXE boot (MAAS), OS install, GPU agent deployment, storage allocation (DDN/UFM), and cluster formation (K8s or Slurm).
- The provisioned resource is registered back in CCP inventory. Tenant gains self-service access from the CCP portal.
- GPU usage data flows from DCGM Exporter to Prometheus to VictoriaMetrics to orbiter-metering. CCP pulls aggregated billing records for invoice generation.
4. Integration Points
4.1 Identity and Access Management
4.1.1 Overview
Both CCP and the Coredge GPUaaS Platform use Keycloak as their Identity Provider. The integration federates these two Keycloak deployments so that a customer authenticated on CCP does not need to re-authenticate when their workloads are dispatched to the GPUaaS platform.
4.1.2 Federation Model
CCP Keycloak (master IdP) issues signed JWT tokens (RS256) containing tenant realm, role claims, and project/organization attributes. The Coredge GPUaaS Keycloak is configured as a relying party — it trusts tokens issued by CCP Keycloak after validating the token signature against the CCP Keycloak public key endpoint (JWKS URI).
💡Note
Coredge serves as the canonical identity store. All customer accounts are created, modified, and deactivated exclusively. CCP Keycloak federation uses SAML 2.0 or OIDC, depending on provider's configuration.
4.1.3 Token Structure
| JWT Claim | Source | Usage in GPUaaS |
|---|
sub | CCP Keycloak | User unique identifier — maps to Coredge domain user record |
realm_name | CCP Keycloak | Tenant realm — maps to GPUaaS Domain (tenant isolation boundary) |
roles | CCP Keycloak | RBAC roles (e.g., Cell Administrator, Cell VM Admin) — translated to GPUaaS Project Admin or Domain Admin |
domain_id | CCP (custom claim) | CCP tenant identifier — maps to GPUaaS Domain ID |
project_id | CCP (custom claim) | CCP cell/project identifier — maps to GPUaaS Project scope |
org_id | CCP (custom claim) | CCP organization identifier — maps to GPUaaS Organization scope |
exp | Keycloak | Token expiry: 5–15 minute TTL for access tokens |
4.1.4 Role Mapping
| CCP Role | GPUaaS Role | Effective Permissions |
|---|
| Tenant Super Administrator | Platform Super Admin (scoped to domain) | Full control within tenant domain |
| Tenant Administrator | Domain Admin | Full domain control: clusters, BM, networks, quotas |
| Cell Administrator | Project Admin | Full project control: GPU clusters, storage, networks |
| Cell VM Admin | Project Admin (Compute scope) | GPU node and cluster lifecycle management |
| Cell Container Admin | Project Admin (K8s scope) | Kubernetes cluster creation, management, workload deployment |
| Cell Viewer / Tenant Viewer | Viewer | Read-only access to resources and dashboards |
| Tenant Billing Admin | Domain Admin (Billing scope only) | Metering dashboard, cost reports, quota usage |
4.1.5 Authentication Flow
- Customer logs into CCP Cloud Portal — CCP redirects to identity provider (SAML/OIDC).
- It authenticates the user (with optional MFA: TOTP, SMS, or hardware key).
- Coredge issues an assertion to CCP Keycloak; CCP Keycloak issues a signed JWT containing domain, org, project claims.
- CCP backend services validate the JWT on every request (signature + expiry + realm).
- When CCP dispatches a request to the Coredge GPUaaS API, it includes the JWT in the Authorization header. GPUaaS orbiter-auth validates the token against the CCP Keycloak JWKS endpoint.
- orbiter-auth performs RBAC (role check) + ABAC (domain/project attribute check) before allowing the request to proceed.
4.2.1 GPU as a Service Integration
GPU as a Service (GPUaaS) is listed in the CCP Service Catalogue. Integration is via OpenStack APIs (for VM-based GPU slices where applicable) and via the Coredge baremetal-manager API for dedicated bare-metal GPU node provisioning. Both paths are fronted by the CCP API Gateway.
| Method | Endpoint | Request Body | Description |
|---|
| POST | /api/baremetal-manager/allocate | {flavor, os_image, network_id, project_id, tenant_id} | Allocate a bare-metal GPU node to a tenant. Triggers NDFC network setup, MAAS provisioning. |
| GET | /api/baremetal-manager/{node_id}/status | — | Poll provisioning state: PENDING, PROVISIONING, ACTIVE, FAILED. |
| POST | /api/baremetal-manager/{node_id}/release | {drain: true} | Drain workloads and release node back to available pool. |
| GET | /api/baremetal-manager/flavors | — | List available GPU node flavors (H100 8-GPU, etc.) |
4.2.3 End-to-End Provisioning Data Flow
- CCP Self-Service Console receives a GPU node provisioning request from tenant.
- CCP API Gateway validates the JWT and forwards to platform microservice.
- Platform service checks quota against orbiter-metering. If quota exceeded, returns HTTP 422 with quota-exceeded error to the tenant.
- CCP calls
POST /api/baremetal-manager/allocate on Coredge GPUaaS over the private integration network (mTLS).
- Coredge baremetal-manager triggers: (a) NDFC network fabric setup — VRF + VLAN allocation, (b) MAAS IPMI power-on + PXE boot, (c) OS install via golden image, (d) Cloud-init, agent deployment.
- Storage allocation: DDN tenant directory created, NodeMap assigned, IB PKey created via UFM.
- Agent registers with GPUaaS portal via gRPC (port 8030/8040). Admin approves host.
- Provisioning state transitions to ACTIVE. GPUaaS notifies CCP via webhook (
POST /ccp/webhooks/baremetal/state-change).
- CCP registers the node in its inventory, updates tenant resource view. WebSocket notification sent to portal.
4.3 Container Orchestration — Kubernetes
4.3.1 Integration Model
CCP delivers Container as a Service (CaaS) via Cloud Orbiter, which supports both OCP (OpenShift Container Platform) and standard Kubernetes clusters. The Coredge GPUaaS Platform delivers GPU-accelerated Kubernetes clusters (via compass-orchestrator) on bare-metal nodes. These two systems integrate at the cluster registration and cluster agent level.
GPU Kubernetes clusters provisioned by Coredge are registered back into CCP Cloud Orbiter using the Cluster Agent protocol (gRPC, port 8030/8040), enabling CCP tenants to manage GPU workloads from the unified CCP portal.
4.3.2 Cluster Registration API
| Method | Endpoint (CCP Cloud Orbiter) | Description |
|---|
| POST | /api/orbiter/clusters/register | Register an externally provisioned GPU K8s cluster with Cloud Orbiter. Body: {cluster_name, kubeconfig_secret, node_count, gpu_type, tenant_id, project_id}. |
| GET | /api/orbiter/clusters/{cluster_id} | Get cluster status, node health, GPU availability, addon state. |
| POST | /api/orbiter/clusters/{cluster_id}/scale | Scale worker nodes up or down. GPUaaS executes kubeadm join/drain. |
| DELETE | /api/orbiter/clusters/{cluster_id} | Initiate cluster teardown: drain -> kubeadm reset -> deregister -> release BM nodes. |
4.3.3 GPU Operator Integration
When a GPU Kubernetes cluster is provisioned, the Coredge compass-orchestrator deploys the NVIDIA GPU Operator DaemonSet, which registers GPU resources with the Kubernetes node resource API (nvidia.com/gpu: 8 per node). These resource labels are propagated to Cloud Orbiter and are available for workload scheduling via node selectors and resource requests in CCP-deployed workloads.
4.4 Network Integration
4.4.1 Overview
CCP manages tenant VPC, Firewall, and Load Balancer resources via OpenStack APIs and CheckPoint/Palo Alto integrations. The Coredge GPUaaS Platform manages GPU node networking via Cisco NDFC (VXLAN/EVPN). These two network domains are connected through a defined inter-domain routing policy that allows GPU workload traffic to reach CCP-managed services (e.g., load balancers, object storage endpoints) while maintaining tenant isolation.
4.4.2 Network Segmentation Model
| Network Segment | Owner | Integration Point |
|---|
| Tenant VPC (CCP) | CCP / OpenStack | Customer workload network. GPU nodes egress through CCP-managed VPC gateways for external services. |
| Tenant VRF (GPUaaS) | Coredge / Cisco NDFC | GPU bare-metal node isolation. 4 VLANs per tenant: Control Plane, GPU Worker, LB, Reserved. |
| GPU Node Management VLAN 901 | Coredge | Cluster agent gRPC communications to GPUaaS portal. |
| Provisioning VRF VLAN 902 | Coredge | MAAS PXE/DHCP relay for OS provisioning. Not exposed to tenant. |
| InfiniBand Fabric (UFM PKey) | Coredge / NVIDIA UFM | GPU-to-GPU RDMA (NCCL) and GPU-to-DDN storage (Lustre). Per-tenant PKey isolation. |
| External / Internet Gateway | CCP (NAT Gateway) | GPU nodes access external services (package updates, ML model registries) through CCP NAT Gateway. |
| CCP–GPUaaS Private Link | Both | Dedicated private network segment for integration API calls between CCP and Coredge GPUaaS. mTLS enforced. |
4.4.3 VPC Lifecycle Coordination
When a tenant requests a VPC through the CCP portal, CCP calls OpenStack APIs to create the VPC constructs. If GPU bare-metal nodes are allocated to the tenant in the same provisioning request, CCP additionally notifies the Coredge network-manager to allocate a matching tenant VRF. The VRF is linked to the tenant VPC through a pre-configured L3 routing policy on the Palo Alto firewall, enabling east-west traffic between CCP VMs and GPU bare-metal nodes within the same tenant boundary.
💡Note
Dynamic VRF creation in Coredge GPUaaS currently requires a manual firewall rule update on the Palo Alto firewall. Pre-created (pooled) VRF allocation is the preferred path for production deployments.
4.5 Storage Integration
4.5.1 Storage Architecture Mapping
| Use Case | CCP Storage (NetApp) | GPUaaS Storage | Integration Notes |
|---|
| VM Block Storage | NetApp Block (iSCSI/FC) | Not applicable | CCP-owned. No integration required. |
| Object Storage (S3) | NetApp S3-compatible | VAST S3 (platform internal) | Tenant S3 endpoints served from CCP NetApp. GPUaaS uses VAST S3 internally for backup/config only. |
| File Storage (NFS) | NetApp NFS | DDN Lustre (GPU workloads) | Separate NFS mounts. GPU workloads use DDN for high-throughput AI/ML data access. |
| GPU Training Data | Not applicable | DDN AI400 (Lustre over IB) | Accessed by GPU bare-metal nodes via InfiniBand. 4 Tb/s per node aggregate throughput. |
| Platform DB Backup | NetApp-backed Veritas agent | VAST S3 (pg_dump, mongodump, etcd) | CCP backup: Veritas. GPUaaS backup: VAST S3. Cross-region replication applies to CCP data only. |
| CCP Config & Logs | Object storage in local region (5 TB) | Not applicable | Managed by CCP only. Backup copied cross-region. |
4.5.2 Storage Provisioning Data Flow (GPU Workloads)
- Tenant subscribes to GPU service — CCP creates tenant record and notifies Coredge GPUaaS to create tenant storage allocation.
- Coredge Storage Plugin (via SSH to DDN MGS) creates tenant directory on Lustre:
/lustre/{tenant-id}/.
- NVIDIA UFM creates an InfiniBand PKey for the tenant. All GPU node mlx GUIDs are added as PKey members.
- A DDN NodeMap is created, mapping the tenant's IB IP address ranges to the tenant directory. Only mapped IPs can mount the filesystem.
- NFS over VIP is configured for environments requiring Ethernet storage access.
- Quotas are set on the tenant directory according to the subscribed service tier.
- NodeMap is activated — tenant's GPU nodes can now access the DDN filesystem over InfiniBand with hardware-level isolation enforced by both UFM PKey and DDN NodeMap (dual isolation).
4.6 Metering, Billing, and Quota
4.6.1 Metering Pipeline
The Coredge GPUaaS Platform meters GPU resource consumption at hardware level (15-second granularity) using DCGM Exporter, Node Exporter, Slurm job accounting (slurmdbd), and storage/network telemetry. The orbiter-metering service aggregates these raw metrics into billable usage records and exposes a billing export API consumed by CCP for invoice generation.
| Metric | Source | Granularity | Billing Unit |
|---|
| GPU-Hours | DCGM Exporter (per GPU, per node) | 15-sec raw → hourly billable | GPU-node-hours × rate card |
| CPU-Hours | Node Exporter | 15-sec → hourly | vCPU-hours × rate card |
| Bare Metal Node-Hours | baremetal-manager (MongoDB state) | Allocation start/end timestamp | Node-hours × rate card |
| K8s Cluster-Hours | compass-orchestrator (MongoDB) | Cluster create/delete timestamp | Cluster-hours × rate card |
| Slurm Job GPU-Hours | slurmdbd → MariaDB | Per job on completion | AllocGRES (GPU count) × Elapsed × rate |
| Storage (DDN) | DDN Storage Plugin | Per tenant directory, polled | GB-hours × rate card |
| InfiniBand Bandwidth | NVIDIA UFM per PKey | Real-time, per tenant PKey | TB transferred × rate (if applicable) |
4.6.2 Billing Export API (Coredge → CCP)
| Method | Endpoint | Description |
|---|
| GET | /api/metering/usage?tenant_id={id}&from={ts}&to={ts} | Retrieve aggregated usage records for a tenant within a time window. Returns GPU-hrs, CPU-hrs, node-hrs, cluster-hrs, storage-GB, per project. |
| GET | /api/metering/quota/{tenant_id} | Current quota usage vs. allocated quota per resource type. |
| POST | /api/metering/quota/{tenant_id} | Update quota allocation (called by CCP when customer upgrades/downgrades subscription). |
| GET | /api/metering/export/csv?tenant_id={id}&period={month} | Download CSV billing export for a billing period. Compatible with invoice import format. |
4.6.3 Quota Synchronization Flow
- Customer subscribes/upgrades GPU service tier on CCP portal.
- Coredge calls CCP quota management API to update tenant allocation.
- CCP propagates the new quota to Coredge GPUaaS via
POST /api/metering/quota/{tenant_id}.
- orbiter-metering updates the in-memory quota counter. New quota takes effect immediately for subsequent resource requests.
- If a resource request would exceed quota, the Coredge API returns HTTP 422 (Quota Exceeded). CCP displays the error to the tenant with a prompt to upgrade their subscription.
- Quota dashboard (80% and 90% warning thresholds) is updated in both CCP portal and Coredge tenant dashboard.
4.7 Monitoring and Alerting Integration
4.7.1 Monitoring Stack Federation
| Component | CCP | GPUaaS (Coredge) | Integration |
|---|
| Metrics Collection | Zabbix Agent (node, service) | DCGM, Node, K8s State Exporters | Federated scrape |
| Metrics Storage | Zabbix DB | VictoriaMetrics (HA) | API bridge for CCP consumption |
| Alerting | Zabbix Alert Rules | Prometheus AlertManager | CCP Notification Service (SMTP/SMS) |
| Dashboards | Grafana (CCP-managed) | External + Internal Grafana (GPUaaS) | GPU dashboards embedded in CCP portal via iFrame / API |
| Log Aggregation | APM/NPM/IPM (CCP Log Analyzer) | VictoriaMetrics + log rotation | Log forwarding via syslog/agent to CCP Log Analyzer |
| Audit Logs | CCP audit trail (ordr_mgmt) | Correlation ID log stream (S3 every 6h) | Cross-correlated via X-Correlation-ID header |
4.7.2 Alert Thresholds (GPU Infrastructure)
| Alert | Warning | Critical | Notification Target |
|---|
| GPU Temperature | > 80°C for 5 min | > 90°C for 2 min | Coredge Ops + CCP Notification Service → Tenant |
| GPU Memory Utilization | > 90% for 15 min | > 95% | CCP Alarm Service → Tenant dashboard |
| CPU Utilization (K8s nodes) | > 75% | > 90% | Coredge Ops team |
| Memory Utilization | > 70% | > 85% | Coredge Ops team |
| DDN Storage Quota | > 80% used | > 90% used | CCP Notification Service → Tenant |
| Cluster Node Not Ready | 1 node down > 2 min | 2+ nodes down | Coredge Ops + CCP Alarm Service → Tenant |
| Storage Latency (DDN) | > 10 ms | > 15 ms | Coredge Ops team |
4.8 Tenant Onboarding Integration
4.8.1 End-to-End Onboarding Flow
- Customer self-registers on CCP or is onboarded by the Coredge Business team.
- Customer subscribes to Cirrus Cloud Platform. The platform calls CCP
POST /api/organizations with party/billing account details.
- CCP Keycloak auto-provisions a tenant realm. Default roles (Tenant Super Administrator, Tenant Administrator) are created.
- CCP creates default resources: project/cell/VPC in the default region, default service catalogue.
- CCP notifies Coredge GPUaaS tenant onboarding API (
POST /api/tenants) to create the corresponding domain, allocate a VRF pool (4 VLANs), and set initial resource quotas.
- Coredge GPUaaS auto-provisions: (a) Keycloak realm for the domain, (b) VRF/VLAN allocation in Cisco NDFC, (c) Initial storage directory structure in DDN Lustre, (d) Quota records in orbiter-metering.
- Tenant administrator receives credentials and can log in to CCP portal. CCP identity federation is active.
- Resource hierarchy is enforced: Tenant (mapped to LSI) → Cell → Resources (in CCP); Domain → Organization → Project (in GPUaaS).
4.8.2 Coredge to GPUaaS Entity Mapping
| Coredge Entity | CCP (ACP) Entity | GPUaaS Entity | Notes |
|---|
| Party | — | — | master entity |
| Billing Account (BA) | — | — | billing scope |
| Logical Subscriber Identity (LSI) | — | — | subscriber record |
| Tenant | Tenant | Domain (Tenant) | 1:1:1 mapping enforced |
| — | Cell | Project | Multiple cells per tenant allowed |
| — | Resources | Resources (BM/K8s/Storage) | Scoped within cell/project |
5. Security Architecture
5.1 Security Principles
- Zero Trust: No implicit trust between any systems. Every request authenticated and authorized regardless of network origin.
- Least Privilege: Users and services receive only the minimum permissions required for their function.
- Defense in Depth: Multiple independent security controls at network, identity, data, and application layers.
- Encryption Everywhere: All data in transit encrypted with TLS 1.2+ / mTLS. All data at rest encrypted with AES-256.
- Auditability: All actions logged with correlation IDs, timestamps, and user identity. Logs retained per compliance requirements.
5.2 Security Controls per Layer
| Layer | Control | Implementation |
|---|
| Identity | Federated IAM | Keycloak with SAML/OIDC federation. RS256-signed JWTs. 5–15 min access token TTL. Single-use refresh tokens. |
| Identity | MFA | TOTP, SMS, email, hardware keys. Configurable per realm and per role. |
| Identity | Session Management | Admin force-logout. Token revocation. Correlation ID tracking (X-Correlation-ID header). |
| Network | Transport Encryption | TLS 1.2+ on all external endpoints. HSTS enforced. Cipher: TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384. |
| Network | mTLS (Service Mesh) | Mutual TLS with automated PKI for all inter-service calls (CCP microservices, Coredge microservices, cross-system integration). |
| Network | Network Segmentation | Palo Alto stateful firewall between orchestration and GPU nodes. Per-tenant VRF isolation. InfiniBand PKey per tenant. |
| Network | API Gateway | CCP API Gateway enforces authentication and rate limiting on all inbound API calls. |
| Data | Encryption at Rest | AES-256 for all stored data (databases, object storage, backups). Key management via platform PKI. |
| Data | Tenant Data Isolation | Database-level tenant tagging. RBAC + ABAC enforcement. Grafana data source isolation per tenant. |
| Application | RBAC + ABAC | orbiter-auth performs role check AND attribute (domain/project/org) check on every request. |
| Application | Image Security | CIS-benchmarked golden OS images. Non-privileged containers. Vulnerability scanning in CI/CD pipeline. |
| Operations | Audit Logging | All provisioning, access, and configuration changes logged with user ID, timestamp, correlation ID. Backed to S3 every 6 hours. |
| Operations | Compliance | NIST 800-53 Rev 5 (AC, AU), ISO/IEC 27001 (A.9, A.10), HIPAA (IAM, RBAC, audit). |
6. High Availability and Disaster Recovery
6.1 CCP High Availability
- Each region has two CCP clusters: AZ1 (Primary / Active) and AZ2 (Standby / Passive).
- 3-node web layer (reverse proxy) in DMZ per AZ. Kubernetes cluster: 3 master + 5 worker nodes per AZ.
- PostgreSQL: Active-Passive with Logical/Streaming Replication across AZs. 3-node cluster per AZ (3+3 node setup, with arbiter VM for manual failover).
- MongoDB: Active-Passive replication within region. MongoDB Active-Active replication (change-stream) for global services (tenant/project/user metadata) across North and South regions.
- OpenFGA (AuthZ DB): Active-Passive between two regions. Writes always go to primary region. Read-heavy access pattern.
- Global GSLB probe detects Active cluster failure and routes traffic to Passive. Internal 2n+1 quorum system validates failover decision.
6.2 GPUaaS High Availability
- Kubernetes control plane: 3 or 5 etcd nodes for HA quorum.
- Slurm: slurmctld deployed as HA Kubernetes Deployment. MariaDB StatefulSet with VAST CSI persistent storage.
- MAAS: Region + Rack Controller HA configuration.
- VictoriaMetrics: HA deployment with replication. No data loss on single-node failure.
- DDN AI400: Redundant MGS/MDS nodes. Multiple OSTs for distributed parallel I/O.
6.3 Backup Strategy
| Data Source | Method | Frequency | Retention | Destination |
|---|
| CCP Keycloak PostgreSQL | Veritas backup agent | Incremental 30 min / Full 24 hrs | 3 months | Cross-region object storage |
| CCP Config MongoDB | Veritas backup agent | Incremental 30 min / Full 24 hrs | 3 months | Cross-region object storage |
| CCP Metrics MongoDB | Veritas backup agent | Incremental 30 min / Full 24 hrs | 3 months | Cross-region object storage |
| CCP K8s etcd | etcdctl snapshot | Incremental 30 min / Full 24 hrs | 3 months | Cross-region object storage |
| GPUaaS PostgreSQL (metering) | pg_dump (full + incremental) | Every 6 hours | 12–24 months | VAST S3 'NetBackup' |
| GPUaaS MongoDB (orchestration) | mongodump (full + incremental) | Every 6 hours | 12–24 months | VAST S3 'NetBackup' |
| GPUaaS etcd (K8s clusters) | etcdctl snapshot | Every 6 hours | Critical / indefinite | VAST S3 'NetBackup' |
7. API Integration Matrix
The following table provides a consolidated view of all integration APIs between CCP and the Coredge GPUaaS Platform. All APIs are secured with mTLS between systems and require a valid JWT in the Authorization header.
| # | Domain | API / Endpoint | Direction | Description |
|---|
| 1 | IAM | JWKS URI: /realms/{realm}/protocol/openid-connect/certs | CCP ← GPUaaS | Keycloak public key endpoint for JWT signature verification. |
| 2 | IAM | POST /realms/{realm}/protocol/openid-connect/token | CCP → GPUaaS | Service-to-service token exchange for integration calls. |
| 3 | Onboarding | POST /api/organizations | CCP | creates tenant organization in CCP upon subscription. |
| 4 | Onboarding | POST /api/tenants | CCP → GPUaaS | CCP creates corresponding domain in GPUaaS after CCP tenant is created. |
| 5 | Compute | POST /api/baremetal-manager/allocate | CCP → GPUaaS | Allocate bare-metal GPU node to tenant. |
| 6 | Compute | GET /api/baremetal-manager/{node_id}/status | CCP → GPUaaS | Poll GPU node provisioning state. |
| 7 | Compute | POST /api/baremetal-manager/{node_id}/release | CCP → GPUaaS | Release bare-metal GPU node. |
| 8 | Webhook | POST /ccp/webhooks/baremetal/state-change | GPUaaS → CCP | Async state-change notification (ACTIVE, FAILED, RELEASED). |
| 9 | K8s | POST /api/orbiter/clusters/register | CCP ← GPUaaS | Register GPU K8s cluster with CCP Cloud Orbiter. |
| 10 | K8s | POST /api/orbiter/clusters/{id}/scale | CCP → GPUaaS | Scale GPU K8s cluster worker nodes. |
| 11 | K8s | DELETE /api/orbiter/clusters/{id} | CCP → GPUaaS | Tear down GPU K8s cluster. |
| 12 | Network | POST /api/network-manager/vpc | CCP → GPUaaS | Allocate tenant VRF/VLAN from pool (triggered alongside CCP VPC creation). |
| 13 | Storage | POST /api/storage/tenant | CCP → GPUaaS | Create DDN tenant directory + NodeMap + UFM PKey. |
| 14 | Metering | GET /api/metering/usage | CCP → GPUaaS | Retrieve aggregated GPU usage records for billing period. |
| 15 | Metering | POST /api/metering/quota/{id} | CCP → GPUaaS | Update tenant quota allocation after subscription change. |
| 16 | Metering | GET /api/metering/export/csv | CCP → GPUaaS | Download billing CSV for invoice generation. |
| 17 | Monitoring | GET /api/metrics/gpu/{tenant_id} | CCP → GPUaaS | Fetch GPU utilization metrics for CCP tenant dashboard. |
| 18 | Monitoring | POST /api/alerts/subscribe | CCP → GPUaaS | Register CCP webhook to receive GPU infrastructure alerts. |
| 19 | Monitoring | POST /ccp/webhooks/alerts | GPUaaS → CCP | GPU alert delivery to CCP Notification Service (SMTP/SMS). |
8. Error Handling and Resilience
8.1 Error Response Standard
All integration API responses follow a consistent error schema:
| HTTP Code | Error Type | Handling Strategy |
|---|
| 400 | Bad Request | Validate request schema before retrying. Do not retry. Log with correlation ID and surface to tenant. |
| 401 | Unauthorized | Refresh JWT token and retry once. If still 401, re-initiate auth flow. Log token expiry event. |
| 403 | Forbidden | Do not retry. RBAC/ABAC rejection. Log for audit. Surface 'Insufficient permissions' to tenant. |
| 404 | Not Found | Do not retry. Resources may have been deleted. Trigger reconciliation to sync state. |
| 409 | Conflict | Idempotency check: resources may already exist. Query GET endpoint to verify state before retrying. |
| 422 | Quota Exceeded | Do not retry. Prompt tenant to upgrade subscription. Log quota breach event. |
| 429 | Rate Limited | Implement exponential backoff with jitter. Respect Retry-After header. |
| 503 | Service Unavailable | Exponential backoff: initial 1s, max 60s, max 5 retries. Alert Ops team if sustained. |
| 5xx | Server Error | Retry with exponential backoff (max 3 retries). Trigger circuit breaker after 5 consecutive failures. |
8.2 Retry and Circuit Breaker Policy
| Integration Call | Max Retries | Initial Backoff | Max Backoff | Circuit Breaker |
|---|
| BM Node Provisioning | 3 | 5 seconds | 60 seconds | 5 failures / 30s window |
| Quota Check | 2 | 500 ms | 5 seconds | 10 failures / 10s window |
| Token Refresh | 1 | Immediate | — | 3 failures → re-auth |
| Metering Export | 3 | 2 seconds | 30 seconds | 5 failures / 60s window |
| Alert Webhook Delivery | 5 | 1 second | 120 seconds | Dead-letter queue after 5 failures |
| State Change Webhook | 5 | 2 seconds | 60 seconds | Dead-letter queue after 5 failures |
9. Pre-Requisites and Deployment Considerations
9.1 CCP Pre-Requisites
- Wildcard SSL certificates for CCP hosting and dynamic customer account URLs
- Load Balancer VIPs for each CCP endpoint (console, API gateway, orbiter, auth)
- DNS server with credentials to create dynamic domains per customer account
- Accessible container registry for CCP component images
- Kubernetes-compliant storage with high IOPS performance (NVMe-backed NFS)
- SMTP server credentials for CCP Notification Service
- NTP and DNS server connectivity
- Connectivity and API credentials to integrate with platform
- Private network link to Coredge GPUaaS integration API endpoints
9.2 Coredge GPUaaS Pre-Requisites
- MAAS HA controller accessible at 172.26.5.8 with IPMI credentials for all GPU node BMCs
- Cisco NDFC access (REST API) for VRF/VLAN automation
- NVIDIA UFM management interface access for InfiniBand PKey management
- DDN AI400 MGS SSH access for Storage Plugin tenant directory management
- VAST Data CSI driver deployed in GPUaaS management Kubernetes cluster
- Palo Alto firewall API access for tenant network rule management
- CCP Keycloak JWKS URI reachable from Coredge GPUaaS (for JWT validation)
- Private network link to CCP for webhook delivery and integration API calls
- NVIDIA GPU drivers and GPU Operator images in accessible container registry
9.3 Deployment Constraints
- CCP must be deployed in the control plane of each availability zone, not in the workload pod.
- The Coredge GPUaaS management cluster must be on a dedicated infrastructure separate from GPU workload nodes.
- All VMs within a cluster (Postgres, Kubernetes) must have anti-affinity rules enabled to prevent co-location on a single physical host.
- Database clusters use a 3+3 node setup (3 VMs per AZ). The two-AZ setup requires manual failover scripts (developed by Coredge team) due to the absence of a third AZ for automatic arbiter node placement.
- OpenFGA Postgres DB and Global MongoDB VMs are stretched across the 2 AZs per region and routed accordingly.
10. RACI Matrix — Integration Responsibilities
💡Note
R = Responsible | A = Accountable | C = Consulted | I = Informed
| Task | R | A | C | I |
|---|
| CCP Major / Minor Upgrade | Coredge | Coredge | Coredge | Coredge |
| OS Patching — CCP Cluster VMs | Coredge | Coredge | Coredge | Coredge |
| CCP Kubernetes Cluster Patching | Coredge | Coredge | Coredge | Coredge |
| GPU Node OS Patching (MAAS golden image update) | Coredge | Coredge | Coredge | Coredge |
| Infrastructure for CCP Management Cluster | Coredge | Coredge | Coredge | Coredge |
| Storage Driver Plugin for CCP PVCs | Coredge | Coredge | Coredge | Coredge |
| SSL Certificates and LB Config | Coredge | Coredge | Coredge | Coredge |
| Keycloak Federation Configuration (CCP ↔ GPUaaS) | Coredge | Coredge | Coredge | Coredge |
| Integration APIs (CCP) | Coredge | Coredge | Coredge | Coredge |
| Integration API Development (CCP ↔ GPUaaS) | Coredge | Coredge | Coredge | Coredge |
| Cisco NDFC VRF/VLAN Configuration | Coredge | Coredge | Coredge | Coredge |
| Palo Alto Firewall Rule Management | Coredge | Coredge | Coredge | Coredge |
| DDN Storage Provisioning Setup | Coredge | Coredge | Coredge | Coredge |
| orbiter-metering Rate Card Configuration | Coredge | Coredge | Coredge | Coredge |
| Database Failover Script Development | Coredge | Coredge | Coredge | Coredge |
| Service Catalogue and Rate Card | Coredge | Coredge | Coredge | Coredge |
| Integration Testing Execution | Coredge | Coredge | Coredge | Coredge |
11. Integration Testing Strategy
11.1 Test Categories
| Test Type | Scope | Success Criteria |
|---|
| Unit / Contract Testing | Individual API endpoint request/response schema validation | 100% schema conformance. All error codes return correct structure. |
| Integration Testing | End-to-end provisioning flow: CCP → GPUaaS → bare-metal node ACTIVE | Node provisioned within SLA. Webhook received. Portal reflects ACTIVE state. |
| IAM / Auth Testing | JWT federation: CCP token accepted by GPUaaS orbiter-auth | All role mappings enforce correct RBAC + ABAC. Expired/tampered tokens rejected. |
| Quota Testing | Quota enforcement: resource creation blocked when quota exceeded | HTTP 422 returned immediately on quota breach. Dashboard reflects 90% warning. |
| Failover Testing | CCP AZ failover: traffic switches from AZ1 to AZ2 | Recovery within RTO. No data loss. Tenant sessions restored. |
| Billing Accuracy Testing | GPU metering: DCGM data flows through to format billing export | GPU-hours within 1% variance of expected. Correct tenant attribution. |
| Monitoring Integration | Alert delivery: GPU temp alert fires and reaches CCP Notification Service | Alert delivered within 60 seconds of threshold breach. |
| Performance Testing (CCP) | CCP portal load: 50,000 VMs and 200,000 pods under management | Portal response < 3s P95. No degradation at peak load. |
11.2 Exclusions from Test Scope
- Penetration testing (scheduled separately, not in scope of this integration IDD)
- Performance testing for infrastructure components other than CCP (handled by respective component owners)
- Day-2 operations testing for underlying bare-metal physical infrastructure
12. Open Items and Known Constraints
| # | Item | Status | Resolution Plan |
|---|
| 1 | Network Load Balancer — no out-of-the-box integration from CCP. Requires Automation Platform. | Open | Integration approach to be defined with Coredge Network team. Automation Platform API contract to be agreed. |
| 2 | Dynamic VRF creation in Coredge requires manual Palo Alto firewall rule addition. | Open | Pre-created (pooled) VRF assignment is the interim production approach. Automated firewall API integration on roadmap. |
| 3 | Database failover script (3+3 node setup, two-AZ region) needs joint development. | In Progress | Script to be developed collaboratively by Coredge. Target completion before MVP1 UAT. |
| 4 | NAT Gateway integration approach (SNAT / Software Appliance) is TBD. | Open | Integration approach to be confirmed by Network team post-MVP1 planning review. |
| 5 | VPN Gateway (Site-to-Site and Point-to-Site) integration uses Zscaler APIs — no out-of-the-box support. | Open | Zscaler API integration to be scoped and agreed as separate work item. |
| 6 | CCP API mapping for Party, Billing Account (BA), and Logical Subscriber Identity (LSI) entities is incomplete. | Open | Mapping to be finalized by Business team guidance. Required before go-live. |
| 7 | MVP2 and MVP3 integration requirements (CDN, DRaaS, MariaDB, NoSQL, Kafka, etc.) are deferred. | Deferred | To be defined post-MVP1. Out of scope for this IDD version. |
13. References
| # | Document | Owner | Version |
|---|
| 1 | Cloud Management Platform (CMP) for Cloud — High Level Design | Coredge | 1.10 (Aug 2025) |
| 2 | Coredge GPUaaS Platform — Technical Reference Document (Unified Architecture Guide) | Coredge Cloud Infrastructure | 1.0 (Feb 2026) |
| 3 | Coredge Cloud — Service Catalogue (portal) | Coredge | Current |
| 4 | Coredge Statement of Work — Cloud CMP Engagement | Coredge | Current |
| 5 | NIST Special Publication 800-53 Rev 5 — Security and Privacy Controls | NIST | Rev 5 |
| 6 | ISO/IEC 27001:2022 — Information Security Management | ISO/IEC | 2022 |
| 7 | Keycloak Server Administration Guide | Red Hat / Keycloak Community | Current |
| 8 | Cisco NDFC — REST API Reference | Cisco | Current |
| 9 | NVIDIA DCGM User Guide | NVIDIA | Current |
| 10 | DDN AI400 Lustre Administration Guide | DDN | Current |