Skip to main content

Integration Design Document

Acronyms & Abbreviations

AcronymDefinition
BMaaSBare Metal as a Service
CCPCirrus Cloud Platform
CMPCloud Management Platform
GPUaaSGPU as a Service
Runtime Environment (Coredge GPUaaS Platform)
IAMIdentity and Access Management
IBInfiniBand
JWTJSON Web Token
MAASMetal as a Service (Canonical)
NDFCNexus Dashboard Fabric Controller (Cisco)
OCPOpenShift Container Platform
RBACRole-Based Access Control
ABACAttribute-Based Access Control
VPCVirtual Private Cloud
VRFVirtual Routing and Forwarding
VXLANVirtual Extensible LAN

1. Purpose and Scope

1.1 Purpose

This Integration Design Document (IDD) defines the technical integration architecture, data flows, API contracts, security controls, and operational procedures for connecting with the Coredge GPUaaS Platform (Runtime Environment). The document serves as the authoritative reference for development, testing, operations, and governance teams involved in delivering unified GPU-accelerated cloud services on Coredge's sovereign cloud infrastructure.

1.2 Scope

The integration covers the following functional domains:

  • Identity and Access Management (IAM) — Keycloak federation between CCP and GPUaaS
  • Compute — GPU as a Service provisioning via OpenStack APIs and Coredge Bare Metal
  • Container Orchestration — Kubernetes (Cloud Orbiter / OCP) cluster lifecycle integration
  • Network — VPC, VRF, VLAN, and VXLAN/EVPN fabric orchestration (Cisco NDFC)
  • Storage — NetApp (CCP) and DDN AI400 / VAST Data (GPUaaS) integration paths
  • Metering and Billing — orbiter-metering to billing pipeline
  • Monitoring and Alerting — Prometheus / VictoriaMetrics / Zabbix federation
  • Security — mTLS service mesh, RBAC/ABAC policy enforcement, audit trails

1.3 Out of Scope

  • Hardware procurement, cabling, and physical data center operations
  • Penetration testing or third-party security audits
  • Day-2 operations for underlying bare-metal physical infrastructure
  • Application-layer changes unrelated to integration APIs
  • MVP2 and MVP3 services unless explicitly noted

2. System Overview

2.1 Cirrus Cloud Platform (CCP)

Coredge is building a sovereign cloud platform for government and enterprise customers across India. The Cloud Management Platform layer is delivered by Cirrus Cloud Platform (CCP), which provides a hyper-scaler-grade self-service portal spanning IaaS, PaaS, and SaaS services. CCP is composed of the following orchestration layers:

  • Cirrus Cloud Platform (CCP) — Cloud Management Platform and self-service portal
  • Cirrus Cloud Platform (CCP) — IaaS orchestrator (OpenStack-based)
  • Cloud Orbiter — Kubernetes orchestrator for container workloads

CCP runs as a microservices application deployed on Kubernetes (management cluster) within each availability zone, with active-passive high availability across two AZs per region and global services replicated across North and South regions.

2.2 Coredge GPUaaS Platform — Runtime Environment

The Coredge GPUaaS Platform is a purpose-built, bare-metal GPU cloud that provisions, orchestrates, and meters NVIDIA (H100) GPU infrastructure. It delivers:

  • Bare Metal Provisioning via Canonical MAAS over IPMI/PXE
  • Kubernetes Cluster Orchestration on bare metal GPU nodes
  • Slurm HPC Cluster Orchestration via Slinky Operator
  • High-performance storage via DDN AI400 (Lustre/InfiniBand) and VAST Data (CSI/S3)
  • Network isolation via Cisco NDFC (VXLAN/EVPN, per-tenant VRFs)
  • Full metering pipeline (DCGM, Prometheus, VictoriaMetrics, orbiter-metering)

2.3 Integration Relationship

CCP acts as the unified customer-facing control plane. The Coredge GPUaaS Platform acts as a specialized compute substrate exposed through CCP service catalogue entries. When a customer provisions GPU as a Service (GPUaaS) through the our Cloud Portal, CCP delegates to Coredge platform APIs for resource lifecycle management, while retaining control of IAM, billing aggregation, quota governance, and customer onboarding.

ConcernOwner: CCPOwner: Coredge GPUaaS
Customer Portal / UICCP Self-Service ConsoleNo direct portal exposure
Tenant Identity & SSOKeycloak (CCP Auth) — FederationKeycloak (GPUaaS) — realm per tenant
Service CatalogueCCP Service Catalogue + subscriptionGPUaaS API endpoints
GPU Compute ProvisioningAPI delegation via OpenStack / Coredge APIsBare Metal Provisioning (MAAS)
Kubernetes OrchestrationCloud Orbiter (CCP) for CaaScompass-orchestrator for GPU K8s
Network FabricVPC / Firewall / LB (OpenStack + CheckPoint/Palo Alto)VRF/VLAN (Cisco NDFC) for GPU nodes
StorageNetApp (Block, Object, File)DDN AI400 (GPU workload), VAST (platform)
Metering & Billingorbiter-metering aggregation, billingDCGM, Prometheus, orbiter-metering
MonitoringZabbix, Grafana (CCP)VictoriaMetrics, Prometheus, Grafana
Quota EnforcementCCP quota service per tenant/cellorbiter-metering + domain quota
Backup & DRVeritas backup agent, geo-replicated object storageVAST S3 (pg_dump, mongodump, etcd)

3. Integration Architecture

3.1 Architecture Principles

  • API-First: All integrations are implemented through well-defined REST APIs or gRPC contracts with versioned endpoints.
  • Loose Coupling: CCP and GPUaaS communicate through defined interface contracts; internal implementation changes in either system must not break the integration.
  • Zero-Trust Security: Every inter-service call requires mutual authentication (mTLS) and JWT-based authorization. No implicit trust based on network location.
  • Single Source of Truth per Domain: CCP owns customer identity and billing records; Coredge GPUaaS owns GPU hardware state and real-time metrics.
  • Idempotency: All provisioning API calls must be idempotent. Retry logic must not produce duplicate resources.
  • Observability: Every integration point emits structured logs with correlation IDs for end-to-end request tracing.

3.2 Logical Integration Layers

The integration is organized into four logical layers:

LayerNameDescription
L1Identity & AccessFederated Keycloak realms, JWT propagation, RBAC/ABAC synchronization between CCP and GPUaaS.
L2Control PlaneResource lifecycle APIs (provision, scale, delete) for GPU Bare Metal, Kubernetes clusters, VPCs, and storage volumes.
L3Data PlaneTenant networking fabric (VXLAN/EVPN), InfiniBand storage access (DDN/UFM), and workload traffic paths.
L4Observability & BillingMetering event streams, quota synchronization, billing aggregation, monitoring federation, and audit log consolidation.

3.3 Integration Topology

The following describes the high-level request flow from customer to GPU hardware:

  1. Customer accesses the CCP Cloud Portal (CCP Self-Service Console) authenticated identity, federated through CCP Keycloak.
  2. Customers place a GPU service order through the CCP service catalogue. Coredge calls the CCP onboarding API (POST /api/organizations) to create or update the tenant.
  3. CCP validates the request against tenant quota (orbiter-metering quota service), then dispatches a provisioning event to the Coredge GPUaaS API (baremetal-manager or compass-orchestrator) over the private integration network.
  4. Coredge GPUaaS executes the provisioning pipeline: network fabric setup (Cisco NDFC), IPMI/PXE boot (MAAS), OS install, GPU agent deployment, storage allocation (DDN/UFM), and cluster formation (K8s or Slurm).
  5. The provisioned resource is registered back in CCP inventory. Tenant gains self-service access from the CCP portal.
  6. GPU usage data flows from DCGM Exporter to Prometheus to VictoriaMetrics to orbiter-metering. CCP pulls aggregated billing records for invoice generation.

4. Integration Points

4.1 Identity and Access Management

4.1.1 Overview

Both CCP and the Coredge GPUaaS Platform use Keycloak as their Identity Provider. The integration federates these two Keycloak deployments so that a customer authenticated on CCP does not need to re-authenticate when their workloads are dispatched to the GPUaaS platform.

4.1.2 Federation Model

CCP Keycloak (master IdP) issues signed JWT tokens (RS256) containing tenant realm, role claims, and project/organization attributes. The Coredge GPUaaS Keycloak is configured as a relying party — it trusts tokens issued by CCP Keycloak after validating the token signature against the CCP Keycloak public key endpoint (JWKS URI).

💡
Note
Coredge serves as the canonical identity store. All customer accounts are created, modified, and deactivated exclusively. CCP Keycloak federation uses SAML 2.0 or OIDC, depending on provider's configuration.

4.1.3 Token Structure

JWT ClaimSourceUsage in GPUaaS
subCCP KeycloakUser unique identifier — maps to Coredge domain user record
realm_nameCCP KeycloakTenant realm — maps to GPUaaS Domain (tenant isolation boundary)
rolesCCP KeycloakRBAC roles (e.g., Cell Administrator, Cell VM Admin) — translated to GPUaaS Project Admin or Domain Admin
domain_idCCP (custom claim)CCP tenant identifier — maps to GPUaaS Domain ID
project_idCCP (custom claim)CCP cell/project identifier — maps to GPUaaS Project scope
org_idCCP (custom claim)CCP organization identifier — maps to GPUaaS Organization scope
expKeycloakToken expiry: 5–15 minute TTL for access tokens

4.1.4 Role Mapping

CCP RoleGPUaaS RoleEffective Permissions
Tenant Super AdministratorPlatform Super Admin (scoped to domain)Full control within tenant domain
Tenant AdministratorDomain AdminFull domain control: clusters, BM, networks, quotas
Cell AdministratorProject AdminFull project control: GPU clusters, storage, networks
Cell VM AdminProject Admin (Compute scope)GPU node and cluster lifecycle management
Cell Container AdminProject Admin (K8s scope)Kubernetes cluster creation, management, workload deployment
Cell Viewer / Tenant ViewerViewerRead-only access to resources and dashboards
Tenant Billing AdminDomain Admin (Billing scope only)Metering dashboard, cost reports, quota usage

4.1.5 Authentication Flow

  1. Customer logs into CCP Cloud Portal — CCP redirects to identity provider (SAML/OIDC).
  2. It authenticates the user (with optional MFA: TOTP, SMS, or hardware key).
  3. Coredge issues an assertion to CCP Keycloak; CCP Keycloak issues a signed JWT containing domain, org, project claims.
  4. CCP backend services validate the JWT on every request (signature + expiry + realm).
  5. When CCP dispatches a request to the Coredge GPUaaS API, it includes the JWT in the Authorization header. GPUaaS orbiter-auth validates the token against the CCP Keycloak JWKS endpoint.
  6. orbiter-auth performs RBAC (role check) + ABAC (domain/project attribute check) before allowing the request to proceed.

4.2 GPU Compute — Bare Metal and VM Provisioning

4.2.1 GPU as a Service Integration

GPU as a Service (GPUaaS) is listed in the CCP Service Catalogue. Integration is via OpenStack APIs (for VM-based GPU slices where applicable) and via the Coredge baremetal-manager API for dedicated bare-metal GPU node provisioning. Both paths are fronted by the CCP API Gateway.

4.2.2 Bare Metal GPU Provisioning — API Contract

MethodEndpointRequest BodyDescription
POST/api/baremetal-manager/allocate{flavor, os_image, network_id, project_id, tenant_id}Allocate a bare-metal GPU node to a tenant. Triggers NDFC network setup, MAAS provisioning.
GET/api/baremetal-manager/{node_id}/statusPoll provisioning state: PENDING, PROVISIONING, ACTIVE, FAILED.
POST/api/baremetal-manager/{node_id}/release{drain: true}Drain workloads and release node back to available pool.
GET/api/baremetal-manager/flavorsList available GPU node flavors (H100 8-GPU, etc.)

4.2.3 End-to-End Provisioning Data Flow

  1. CCP Self-Service Console receives a GPU node provisioning request from tenant.
  2. CCP API Gateway validates the JWT and forwards to platform microservice.
  3. Platform service checks quota against orbiter-metering. If quota exceeded, returns HTTP 422 with quota-exceeded error to the tenant.
  4. CCP calls POST /api/baremetal-manager/allocate on Coredge GPUaaS over the private integration network (mTLS).
  5. Coredge baremetal-manager triggers: (a) NDFC network fabric setup — VRF + VLAN allocation, (b) MAAS IPMI power-on + PXE boot, (c) OS install via golden image, (d) Cloud-init, agent deployment.
  6. Storage allocation: DDN tenant directory created, NodeMap assigned, IB PKey created via UFM.
  7. Agent registers with GPUaaS portal via gRPC (port 8030/8040). Admin approves host.
  8. Provisioning state transitions to ACTIVE. GPUaaS notifies CCP via webhook (POST /ccp/webhooks/baremetal/state-change).
  9. CCP registers the node in its inventory, updates tenant resource view. WebSocket notification sent to portal.

4.3 Container Orchestration — Kubernetes

4.3.1 Integration Model

CCP delivers Container as a Service (CaaS) via Cloud Orbiter, which supports both OCP (OpenShift Container Platform) and standard Kubernetes clusters. The Coredge GPUaaS Platform delivers GPU-accelerated Kubernetes clusters (via compass-orchestrator) on bare-metal nodes. These two systems integrate at the cluster registration and cluster agent level.

GPU Kubernetes clusters provisioned by Coredge are registered back into CCP Cloud Orbiter using the Cluster Agent protocol (gRPC, port 8030/8040), enabling CCP tenants to manage GPU workloads from the unified CCP portal.

4.3.2 Cluster Registration API

MethodEndpoint (CCP Cloud Orbiter)Description
POST/api/orbiter/clusters/registerRegister an externally provisioned GPU K8s cluster with Cloud Orbiter. Body: {cluster_name, kubeconfig_secret, node_count, gpu_type, tenant_id, project_id}.
GET/api/orbiter/clusters/{cluster_id}Get cluster status, node health, GPU availability, addon state.
POST/api/orbiter/clusters/{cluster_id}/scaleScale worker nodes up or down. GPUaaS executes kubeadm join/drain.
DELETE/api/orbiter/clusters/{cluster_id}Initiate cluster teardown: drain -> kubeadm reset -> deregister -> release BM nodes.

4.3.3 GPU Operator Integration

When a GPU Kubernetes cluster is provisioned, the Coredge compass-orchestrator deploys the NVIDIA GPU Operator DaemonSet, which registers GPU resources with the Kubernetes node resource API (nvidia.com/gpu: 8 per node). These resource labels are propagated to Cloud Orbiter and are available for workload scheduling via node selectors and resource requests in CCP-deployed workloads.

4.4 Network Integration

4.4.1 Overview

CCP manages tenant VPC, Firewall, and Load Balancer resources via OpenStack APIs and CheckPoint/Palo Alto integrations. The Coredge GPUaaS Platform manages GPU node networking via Cisco NDFC (VXLAN/EVPN). These two network domains are connected through a defined inter-domain routing policy that allows GPU workload traffic to reach CCP-managed services (e.g., load balancers, object storage endpoints) while maintaining tenant isolation.

4.4.2 Network Segmentation Model

Network SegmentOwnerIntegration Point
Tenant VPC (CCP)CCP / OpenStackCustomer workload network. GPU nodes egress through CCP-managed VPC gateways for external services.
Tenant VRF (GPUaaS)Coredge / Cisco NDFCGPU bare-metal node isolation. 4 VLANs per tenant: Control Plane, GPU Worker, LB, Reserved.
GPU Node Management VLAN 901CoredgeCluster agent gRPC communications to GPUaaS portal.
Provisioning VRF VLAN 902CoredgeMAAS PXE/DHCP relay for OS provisioning. Not exposed to tenant.
InfiniBand Fabric (UFM PKey)Coredge / NVIDIA UFMGPU-to-GPU RDMA (NCCL) and GPU-to-DDN storage (Lustre). Per-tenant PKey isolation.
External / Internet GatewayCCP (NAT Gateway)GPU nodes access external services (package updates, ML model registries) through CCP NAT Gateway.
CCP–GPUaaS Private LinkBothDedicated private network segment for integration API calls between CCP and Coredge GPUaaS. mTLS enforced.

4.4.3 VPC Lifecycle Coordination

When a tenant requests a VPC through the CCP portal, CCP calls OpenStack APIs to create the VPC constructs. If GPU bare-metal nodes are allocated to the tenant in the same provisioning request, CCP additionally notifies the Coredge network-manager to allocate a matching tenant VRF. The VRF is linked to the tenant VPC through a pre-configured L3 routing policy on the Palo Alto firewall, enabling east-west traffic between CCP VMs and GPU bare-metal nodes within the same tenant boundary.

💡
Note
Dynamic VRF creation in Coredge GPUaaS currently requires a manual firewall rule update on the Palo Alto firewall. Pre-created (pooled) VRF allocation is the preferred path for production deployments.

4.5 Storage Integration

4.5.1 Storage Architecture Mapping

Use CaseCCP Storage (NetApp)GPUaaS StorageIntegration Notes
VM Block StorageNetApp Block (iSCSI/FC)Not applicableCCP-owned. No integration required.
Object Storage (S3)NetApp S3-compatibleVAST S3 (platform internal)Tenant S3 endpoints served from CCP NetApp. GPUaaS uses VAST S3 internally for backup/config only.
File Storage (NFS)NetApp NFSDDN Lustre (GPU workloads)Separate NFS mounts. GPU workloads use DDN for high-throughput AI/ML data access.
GPU Training DataNot applicableDDN AI400 (Lustre over IB)Accessed by GPU bare-metal nodes via InfiniBand. 4 Tb/s per node aggregate throughput.
Platform DB BackupNetApp-backed Veritas agentVAST S3 (pg_dump, mongodump, etcd)CCP backup: Veritas. GPUaaS backup: VAST S3. Cross-region replication applies to CCP data only.
CCP Config & LogsObject storage in local region (5 TB)Not applicableManaged by CCP only. Backup copied cross-region.

4.5.2 Storage Provisioning Data Flow (GPU Workloads)

  1. Tenant subscribes to GPU service — CCP creates tenant record and notifies Coredge GPUaaS to create tenant storage allocation.
  2. Coredge Storage Plugin (via SSH to DDN MGS) creates tenant directory on Lustre: /lustre/{tenant-id}/.
  3. NVIDIA UFM creates an InfiniBand PKey for the tenant. All GPU node mlx GUIDs are added as PKey members.
  4. A DDN NodeMap is created, mapping the tenant's IB IP address ranges to the tenant directory. Only mapped IPs can mount the filesystem.
  5. NFS over VIP is configured for environments requiring Ethernet storage access.
  6. Quotas are set on the tenant directory according to the subscribed service tier.
  7. NodeMap is activated — tenant's GPU nodes can now access the DDN filesystem over InfiniBand with hardware-level isolation enforced by both UFM PKey and DDN NodeMap (dual isolation).

4.6 Metering, Billing, and Quota

4.6.1 Metering Pipeline

The Coredge GPUaaS Platform meters GPU resource consumption at hardware level (15-second granularity) using DCGM Exporter, Node Exporter, Slurm job accounting (slurmdbd), and storage/network telemetry. The orbiter-metering service aggregates these raw metrics into billable usage records and exposes a billing export API consumed by CCP for invoice generation.

MetricSourceGranularityBilling Unit
GPU-HoursDCGM Exporter (per GPU, per node)15-sec raw → hourly billableGPU-node-hours × rate card
CPU-HoursNode Exporter15-sec → hourlyvCPU-hours × rate card
Bare Metal Node-Hoursbaremetal-manager (MongoDB state)Allocation start/end timestampNode-hours × rate card
K8s Cluster-Hourscompass-orchestrator (MongoDB)Cluster create/delete timestampCluster-hours × rate card
Slurm Job GPU-Hoursslurmdbd → MariaDBPer job on completionAllocGRES (GPU count) × Elapsed × rate
Storage (DDN)DDN Storage PluginPer tenant directory, polledGB-hours × rate card
InfiniBand BandwidthNVIDIA UFM per PKeyReal-time, per tenant PKeyTB transferred × rate (if applicable)

4.6.2 Billing Export API (Coredge → CCP)

MethodEndpointDescription
GET/api/metering/usage?tenant_id={id}&from={ts}&to={ts}Retrieve aggregated usage records for a tenant within a time window. Returns GPU-hrs, CPU-hrs, node-hrs, cluster-hrs, storage-GB, per project.
GET/api/metering/quota/{tenant_id}Current quota usage vs. allocated quota per resource type.
POST/api/metering/quota/{tenant_id}Update quota allocation (called by CCP when customer upgrades/downgrades subscription).
GET/api/metering/export/csv?tenant_id={id}&period={month}Download CSV billing export for a billing period. Compatible with invoice import format.

4.6.3 Quota Synchronization Flow

  1. Customer subscribes/upgrades GPU service tier on CCP portal.
  2. Coredge calls CCP quota management API to update tenant allocation.
  3. CCP propagates the new quota to Coredge GPUaaS via POST /api/metering/quota/{tenant_id}.
  4. orbiter-metering updates the in-memory quota counter. New quota takes effect immediately for subsequent resource requests.
  5. If a resource request would exceed quota, the Coredge API returns HTTP 422 (Quota Exceeded). CCP displays the error to the tenant with a prompt to upgrade their subscription.
  6. Quota dashboard (80% and 90% warning thresholds) is updated in both CCP portal and Coredge tenant dashboard.

4.7 Monitoring and Alerting Integration

4.7.1 Monitoring Stack Federation

ComponentCCPGPUaaS (Coredge)Integration
Metrics CollectionZabbix Agent (node, service)DCGM, Node, K8s State ExportersFederated scrape
Metrics StorageZabbix DBVictoriaMetrics (HA)API bridge for CCP consumption
AlertingZabbix Alert RulesPrometheus AlertManagerCCP Notification Service (SMTP/SMS)
DashboardsGrafana (CCP-managed)External + Internal Grafana (GPUaaS)GPU dashboards embedded in CCP portal via iFrame / API
Log AggregationAPM/NPM/IPM (CCP Log Analyzer)VictoriaMetrics + log rotationLog forwarding via syslog/agent to CCP Log Analyzer
Audit LogsCCP audit trail (ordr_mgmt)Correlation ID log stream (S3 every 6h)Cross-correlated via X-Correlation-ID header

4.7.2 Alert Thresholds (GPU Infrastructure)

AlertWarningCriticalNotification Target
GPU Temperature> 80°C for 5 min> 90°C for 2 minCoredge Ops + CCP Notification Service → Tenant
GPU Memory Utilization> 90% for 15 min> 95%CCP Alarm Service → Tenant dashboard
CPU Utilization (K8s nodes)> 75%> 90%Coredge Ops team
Memory Utilization> 70%> 85%Coredge Ops team
DDN Storage Quota> 80% used> 90% usedCCP Notification Service → Tenant
Cluster Node Not Ready1 node down > 2 min2+ nodes downCoredge Ops + CCP Alarm Service → Tenant
Storage Latency (DDN)> 10 ms> 15 msCoredge Ops team

4.8 Tenant Onboarding Integration

4.8.1 End-to-End Onboarding Flow

  1. Customer self-registers on CCP or is onboarded by the Coredge Business team.
  2. Customer subscribes to Cirrus Cloud Platform. The platform calls CCP POST /api/organizations with party/billing account details.
  3. CCP Keycloak auto-provisions a tenant realm. Default roles (Tenant Super Administrator, Tenant Administrator) are created.
  4. CCP creates default resources: project/cell/VPC in the default region, default service catalogue.
  5. CCP notifies Coredge GPUaaS tenant onboarding API (POST /api/tenants) to create the corresponding domain, allocate a VRF pool (4 VLANs), and set initial resource quotas.
  6. Coredge GPUaaS auto-provisions: (a) Keycloak realm for the domain, (b) VRF/VLAN allocation in Cisco NDFC, (c) Initial storage directory structure in DDN Lustre, (d) Quota records in orbiter-metering.
  7. Tenant administrator receives credentials and can log in to CCP portal. CCP identity federation is active.
  8. Resource hierarchy is enforced: Tenant (mapped to LSI) → Cell → Resources (in CCP); Domain → Organization → Project (in GPUaaS).

4.8.2 Coredge to GPUaaS Entity Mapping

Coredge EntityCCP (ACP) EntityGPUaaS EntityNotes
Partymaster entity
Billing Account (BA)billing scope
Logical Subscriber Identity (LSI)subscriber record
TenantTenantDomain (Tenant)1:1:1 mapping enforced
CellProjectMultiple cells per tenant allowed
ResourcesResources (BM/K8s/Storage)Scoped within cell/project

5. Security Architecture

5.1 Security Principles

  • Zero Trust: No implicit trust between any systems. Every request authenticated and authorized regardless of network origin.
  • Least Privilege: Users and services receive only the minimum permissions required for their function.
  • Defense in Depth: Multiple independent security controls at network, identity, data, and application layers.
  • Encryption Everywhere: All data in transit encrypted with TLS 1.2+ / mTLS. All data at rest encrypted with AES-256.
  • Auditability: All actions logged with correlation IDs, timestamps, and user identity. Logs retained per compliance requirements.

5.2 Security Controls per Layer

LayerControlImplementation
IdentityFederated IAMKeycloak with SAML/OIDC federation. RS256-signed JWTs. 5–15 min access token TTL. Single-use refresh tokens.
IdentityMFATOTP, SMS, email, hardware keys. Configurable per realm and per role.
IdentitySession ManagementAdmin force-logout. Token revocation. Correlation ID tracking (X-Correlation-ID header).
NetworkTransport EncryptionTLS 1.2+ on all external endpoints. HSTS enforced. Cipher: TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384.
NetworkmTLS (Service Mesh)Mutual TLS with automated PKI for all inter-service calls (CCP microservices, Coredge microservices, cross-system integration).
NetworkNetwork SegmentationPalo Alto stateful firewall between orchestration and GPU nodes. Per-tenant VRF isolation. InfiniBand PKey per tenant.
NetworkAPI GatewayCCP API Gateway enforces authentication and rate limiting on all inbound API calls.
DataEncryption at RestAES-256 for all stored data (databases, object storage, backups). Key management via platform PKI.
DataTenant Data IsolationDatabase-level tenant tagging. RBAC + ABAC enforcement. Grafana data source isolation per tenant.
ApplicationRBAC + ABACorbiter-auth performs role check AND attribute (domain/project/org) check on every request.
ApplicationImage SecurityCIS-benchmarked golden OS images. Non-privileged containers. Vulnerability scanning in CI/CD pipeline.
OperationsAudit LoggingAll provisioning, access, and configuration changes logged with user ID, timestamp, correlation ID. Backed to S3 every 6 hours.
OperationsComplianceNIST 800-53 Rev 5 (AC, AU), ISO/IEC 27001 (A.9, A.10), HIPAA (IAM, RBAC, audit).

6. High Availability and Disaster Recovery

6.1 CCP High Availability

  • Each region has two CCP clusters: AZ1 (Primary / Active) and AZ2 (Standby / Passive).
  • 3-node web layer (reverse proxy) in DMZ per AZ. Kubernetes cluster: 3 master + 5 worker nodes per AZ.
  • PostgreSQL: Active-Passive with Logical/Streaming Replication across AZs. 3-node cluster per AZ (3+3 node setup, with arbiter VM for manual failover).
  • MongoDB: Active-Passive replication within region. MongoDB Active-Active replication (change-stream) for global services (tenant/project/user metadata) across North and South regions.
  • OpenFGA (AuthZ DB): Active-Passive between two regions. Writes always go to primary region. Read-heavy access pattern.
  • Global GSLB probe detects Active cluster failure and routes traffic to Passive. Internal 2n+1 quorum system validates failover decision.

6.2 GPUaaS High Availability

  • Kubernetes control plane: 3 or 5 etcd nodes for HA quorum.
  • Slurm: slurmctld deployed as HA Kubernetes Deployment. MariaDB StatefulSet with VAST CSI persistent storage.
  • MAAS: Region + Rack Controller HA configuration.
  • VictoriaMetrics: HA deployment with replication. No data loss on single-node failure.
  • DDN AI400: Redundant MGS/MDS nodes. Multiple OSTs for distributed parallel I/O.

6.3 Backup Strategy

Data SourceMethodFrequencyRetentionDestination
CCP Keycloak PostgreSQLVeritas backup agentIncremental 30 min / Full 24 hrs3 monthsCross-region object storage
CCP Config MongoDBVeritas backup agentIncremental 30 min / Full 24 hrs3 monthsCross-region object storage
CCP Metrics MongoDBVeritas backup agentIncremental 30 min / Full 24 hrs3 monthsCross-region object storage
CCP K8s etcdetcdctl snapshotIncremental 30 min / Full 24 hrs3 monthsCross-region object storage
GPUaaS PostgreSQL (metering)pg_dump (full + incremental)Every 6 hours12–24 monthsVAST S3 'NetBackup'
GPUaaS MongoDB (orchestration)mongodump (full + incremental)Every 6 hours12–24 monthsVAST S3 'NetBackup'
GPUaaS etcd (K8s clusters)etcdctl snapshotEvery 6 hoursCritical / indefiniteVAST S3 'NetBackup'

7. API Integration Matrix

The following table provides a consolidated view of all integration APIs between CCP and the Coredge GPUaaS Platform. All APIs are secured with mTLS between systems and require a valid JWT in the Authorization header.

#DomainAPI / EndpointDirectionDescription
1IAMJWKS URI: /realms/{realm}/protocol/openid-connect/certsCCP ← GPUaaSKeycloak public key endpoint for JWT signature verification.
2IAMPOST /realms/{realm}/protocol/openid-connect/tokenCCP → GPUaaSService-to-service token exchange for integration calls.
3OnboardingPOST /api/organizationsCCPcreates tenant organization in CCP upon subscription.
4OnboardingPOST /api/tenantsCCP → GPUaaSCCP creates corresponding domain in GPUaaS after CCP tenant is created.
5ComputePOST /api/baremetal-manager/allocateCCP → GPUaaSAllocate bare-metal GPU node to tenant.
6ComputeGET /api/baremetal-manager/{node_id}/statusCCP → GPUaaSPoll GPU node provisioning state.
7ComputePOST /api/baremetal-manager/{node_id}/releaseCCP → GPUaaSRelease bare-metal GPU node.
8WebhookPOST /ccp/webhooks/baremetal/state-changeGPUaaS → CCPAsync state-change notification (ACTIVE, FAILED, RELEASED).
9K8sPOST /api/orbiter/clusters/registerCCP ← GPUaaSRegister GPU K8s cluster with CCP Cloud Orbiter.
10K8sPOST /api/orbiter/clusters/{id}/scaleCCP → GPUaaSScale GPU K8s cluster worker nodes.
11K8sDELETE /api/orbiter/clusters/{id}CCP → GPUaaSTear down GPU K8s cluster.
12NetworkPOST /api/network-manager/vpcCCP → GPUaaSAllocate tenant VRF/VLAN from pool (triggered alongside CCP VPC creation).
13StoragePOST /api/storage/tenantCCP → GPUaaSCreate DDN tenant directory + NodeMap + UFM PKey.
14MeteringGET /api/metering/usageCCP → GPUaaSRetrieve aggregated GPU usage records for billing period.
15MeteringPOST /api/metering/quota/{id}CCP → GPUaaSUpdate tenant quota allocation after subscription change.
16MeteringGET /api/metering/export/csvCCP → GPUaaSDownload billing CSV for invoice generation.
17MonitoringGET /api/metrics/gpu/{tenant_id}CCP → GPUaaSFetch GPU utilization metrics for CCP tenant dashboard.
18MonitoringPOST /api/alerts/subscribeCCP → GPUaaSRegister CCP webhook to receive GPU infrastructure alerts.
19MonitoringPOST /ccp/webhooks/alertsGPUaaS → CCPGPU alert delivery to CCP Notification Service (SMTP/SMS).

8. Error Handling and Resilience

8.1 Error Response Standard

All integration API responses follow a consistent error schema:

HTTP CodeError TypeHandling Strategy
400Bad RequestValidate request schema before retrying. Do not retry. Log with correlation ID and surface to tenant.
401UnauthorizedRefresh JWT token and retry once. If still 401, re-initiate auth flow. Log token expiry event.
403ForbiddenDo not retry. RBAC/ABAC rejection. Log for audit. Surface 'Insufficient permissions' to tenant.
404Not FoundDo not retry. Resources may have been deleted. Trigger reconciliation to sync state.
409ConflictIdempotency check: resources may already exist. Query GET endpoint to verify state before retrying.
422Quota ExceededDo not retry. Prompt tenant to upgrade subscription. Log quota breach event.
429Rate LimitedImplement exponential backoff with jitter. Respect Retry-After header.
503Service UnavailableExponential backoff: initial 1s, max 60s, max 5 retries. Alert Ops team if sustained.
5xxServer ErrorRetry with exponential backoff (max 3 retries). Trigger circuit breaker after 5 consecutive failures.

8.2 Retry and Circuit Breaker Policy

Integration CallMax RetriesInitial BackoffMax BackoffCircuit Breaker
BM Node Provisioning35 seconds60 seconds5 failures / 30s window
Quota Check2500 ms5 seconds10 failures / 10s window
Token Refresh1Immediate3 failures → re-auth
Metering Export32 seconds30 seconds5 failures / 60s window
Alert Webhook Delivery51 second120 secondsDead-letter queue after 5 failures
State Change Webhook52 seconds60 secondsDead-letter queue after 5 failures

9. Pre-Requisites and Deployment Considerations

9.1 CCP Pre-Requisites

  • Wildcard SSL certificates for CCP hosting and dynamic customer account URLs
  • Load Balancer VIPs for each CCP endpoint (console, API gateway, orbiter, auth)
  • DNS server with credentials to create dynamic domains per customer account
  • Accessible container registry for CCP component images
  • Kubernetes-compliant storage with high IOPS performance (NVMe-backed NFS)
  • SMTP server credentials for CCP Notification Service
  • NTP and DNS server connectivity
  • Connectivity and API credentials to integrate with platform
  • Private network link to Coredge GPUaaS integration API endpoints

9.2 Coredge GPUaaS Pre-Requisites

  • MAAS HA controller accessible at 172.26.5.8 with IPMI credentials for all GPU node BMCs
  • Cisco NDFC access (REST API) for VRF/VLAN automation
  • NVIDIA UFM management interface access for InfiniBand PKey management
  • DDN AI400 MGS SSH access for Storage Plugin tenant directory management
  • VAST Data CSI driver deployed in GPUaaS management Kubernetes cluster
  • Palo Alto firewall API access for tenant network rule management
  • CCP Keycloak JWKS URI reachable from Coredge GPUaaS (for JWT validation)
  • Private network link to CCP for webhook delivery and integration API calls
  • NVIDIA GPU drivers and GPU Operator images in accessible container registry

9.3 Deployment Constraints

  • CCP must be deployed in the control plane of each availability zone, not in the workload pod.
  • The Coredge GPUaaS management cluster must be on a dedicated infrastructure separate from GPU workload nodes.
  • All VMs within a cluster (Postgres, Kubernetes) must have anti-affinity rules enabled to prevent co-location on a single physical host.
  • Database clusters use a 3+3 node setup (3 VMs per AZ). The two-AZ setup requires manual failover scripts (developed by Coredge team) due to the absence of a third AZ for automatic arbiter node placement.
  • OpenFGA Postgres DB and Global MongoDB VMs are stretched across the 2 AZs per region and routed accordingly.

10. RACI Matrix — Integration Responsibilities

💡
Note
R = Responsible  |  A = Accountable  |  C = Consulted  |  I = Informed
TaskRACI
CCP Major / Minor UpgradeCoredgeCoredgeCoredgeCoredge
OS Patching — CCP Cluster VMsCoredgeCoredgeCoredgeCoredge
CCP Kubernetes Cluster PatchingCoredgeCoredgeCoredgeCoredge
GPU Node OS Patching (MAAS golden image update)CoredgeCoredgeCoredgeCoredge
Infrastructure for CCP Management ClusterCoredgeCoredgeCoredgeCoredge
Storage Driver Plugin for CCP PVCsCoredgeCoredgeCoredgeCoredge
SSL Certificates and LB ConfigCoredgeCoredgeCoredgeCoredge
Keycloak Federation Configuration (CCP ↔ GPUaaS)CoredgeCoredgeCoredgeCoredge
Integration APIs (CCP)CoredgeCoredgeCoredgeCoredge
Integration API Development (CCP ↔ GPUaaS)CoredgeCoredgeCoredgeCoredge
Cisco NDFC VRF/VLAN ConfigurationCoredgeCoredgeCoredgeCoredge
Palo Alto Firewall Rule ManagementCoredgeCoredgeCoredgeCoredge
DDN Storage Provisioning SetupCoredgeCoredgeCoredgeCoredge
orbiter-metering Rate Card ConfigurationCoredgeCoredgeCoredgeCoredge
Database Failover Script DevelopmentCoredgeCoredgeCoredgeCoredge
Service Catalogue and Rate CardCoredgeCoredgeCoredgeCoredge
Integration Testing ExecutionCoredgeCoredgeCoredgeCoredge

11. Integration Testing Strategy

11.1 Test Categories

Test TypeScopeSuccess Criteria
Unit / Contract TestingIndividual API endpoint request/response schema validation100% schema conformance. All error codes return correct structure.
Integration TestingEnd-to-end provisioning flow: CCP → GPUaaS → bare-metal node ACTIVENode provisioned within SLA. Webhook received. Portal reflects ACTIVE state.
IAM / Auth TestingJWT federation: CCP token accepted by GPUaaS orbiter-authAll role mappings enforce correct RBAC + ABAC. Expired/tampered tokens rejected.
Quota TestingQuota enforcement: resource creation blocked when quota exceededHTTP 422 returned immediately on quota breach. Dashboard reflects 90% warning.
Failover TestingCCP AZ failover: traffic switches from AZ1 to AZ2Recovery within RTO. No data loss. Tenant sessions restored.
Billing Accuracy TestingGPU metering: DCGM data flows through to format billing exportGPU-hours within 1% variance of expected. Correct tenant attribution.
Monitoring IntegrationAlert delivery: GPU temp alert fires and reaches CCP Notification ServiceAlert delivered within 60 seconds of threshold breach.
Performance Testing (CCP)CCP portal load: 50,000 VMs and 200,000 pods under managementPortal response < 3s P95. No degradation at peak load.

11.2 Exclusions from Test Scope

  • Penetration testing (scheduled separately, not in scope of this integration IDD)
  • Performance testing for infrastructure components other than CCP (handled by respective component owners)
  • Day-2 operations testing for underlying bare-metal physical infrastructure

12. Open Items and Known Constraints

#ItemStatusResolution Plan
1Network Load Balancer — no out-of-the-box integration from CCP. Requires Automation Platform.OpenIntegration approach to be defined with Coredge Network team. Automation Platform API contract to be agreed.
2Dynamic VRF creation in Coredge requires manual Palo Alto firewall rule addition.OpenPre-created (pooled) VRF assignment is the interim production approach. Automated firewall API integration on roadmap.
3Database failover script (3+3 node setup, two-AZ region) needs joint development.In ProgressScript to be developed collaboratively by Coredge. Target completion before MVP1 UAT.
4NAT Gateway integration approach (SNAT / Software Appliance) is TBD.OpenIntegration approach to be confirmed by Network team post-MVP1 planning review.
5VPN Gateway (Site-to-Site and Point-to-Site) integration uses Zscaler APIs — no out-of-the-box support.OpenZscaler API integration to be scoped and agreed as separate work item.
6CCP API mapping for Party, Billing Account (BA), and Logical Subscriber Identity (LSI) entities is incomplete.OpenMapping to be finalized by Business team guidance. Required before go-live.
7MVP2 and MVP3 integration requirements (CDN, DRaaS, MariaDB, NoSQL, Kafka, etc.) are deferred.DeferredTo be defined post-MVP1. Out of scope for this IDD version.

13. References

#DocumentOwnerVersion
1Cloud Management Platform (CMP) for Cloud — High Level DesignCoredge1.10 (Aug 2025)
2Coredge GPUaaS Platform — Technical Reference Document (Unified Architecture Guide)Coredge Cloud Infrastructure1.0 (Feb 2026)
3Coredge Cloud — Service Catalogue (portal)CoredgeCurrent
4Coredge Statement of Work — Cloud CMP EngagementCoredgeCurrent
5NIST Special Publication 800-53 Rev 5 — Security and Privacy ControlsNISTRev 5
6ISO/IEC 27001:2022 — Information Security ManagementISO/IEC2022
7Keycloak Server Administration GuideRed Hat / Keycloak CommunityCurrent
8Cisco NDFC — REST API ReferenceCiscoCurrent
9NVIDIA DCGM User GuideNVIDIACurrent
10DDN AI400 Lustre Administration GuideDDNCurrent