Integration Design Document

Acronyms & Abbreviations

Acronym	Definition
BMaaS	Bare Metal as a Service
CCP	Cirrus Cloud Platform
CMP	Cloud Management Platform
GPUaaS	GPU as a Service
	Runtime Environment (Coredge GPUaaS Platform)
IAM	Identity and Access Management
IB	InfiniBand
JWT	JSON Web Token
MAAS	Metal as a Service (Canonical)
NDFC	Nexus Dashboard Fabric Controller (Cisco)
OCP	OpenShift Container Platform
RBAC	Role-Based Access Control
ABAC	Attribute-Based Access Control
VPC	Virtual Private Cloud
VRF	Virtual Routing and Forwarding
VXLAN	Virtual Extensible LAN

1. Purpose and Scope

1.1 Purpose

This Integration Design Document (IDD) defines the technical integration architecture, data flows, API contracts, security controls, and operational procedures for connecting with the Coredge GPUaaS Platform (Runtime Environment). The document serves as the authoritative reference for development, testing, operations, and governance teams involved in delivering unified GPU-accelerated cloud services on Coredge's sovereign cloud infrastructure.

1.2 Scope

The integration covers the following functional domains:

Identity and Access Management (IAM) — Keycloak federation between CCP and GPUaaS
Compute — GPU as a Service provisioning via OpenStack APIs and Coredge Bare Metal
Container Orchestration — Kubernetes (Cloud Orbiter / OCP) cluster lifecycle integration
Network — VPC, VRF, VLAN, and VXLAN/EVPN fabric orchestration (Cisco NDFC)
Storage — NetApp (CCP) and DDN AI400 / VAST Data (GPUaaS) integration paths
Metering and Billing — orbiter-metering to billing pipeline
Monitoring and Alerting — Prometheus / VictoriaMetrics / Zabbix federation
Security — mTLS service mesh, RBAC/ABAC policy enforcement, audit trails

1.3 Out of Scope

Hardware procurement, cabling, and physical data center operations
Penetration testing or third-party security audits
Day-2 operations for underlying bare-metal physical infrastructure
Application-layer changes unrelated to integration APIs
MVP2 and MVP3 services unless explicitly noted

2. System Overview

2.1 Cirrus Cloud Platform (CCP)

Coredge is building a sovereign cloud platform for government and enterprise customers across India. The Cloud Management Platform layer is delivered by Cirrus Cloud Platform (CCP), which provides a hyper-scaler-grade self-service portal spanning IaaS, PaaS, and SaaS services. CCP is composed of the following orchestration layers:

Cirrus Cloud Platform (CCP) — Cloud Management Platform and self-service portal
Cirrus Cloud Platform (CCP) — IaaS orchestrator (OpenStack-based)
Cloud Orbiter — Kubernetes orchestrator for container workloads

CCP runs as a microservices application deployed on Kubernetes (management cluster) within each availability zone, with active-passive high availability across two AZs per region and global services replicated across North and South regions.

2.2 Coredge GPUaaS Platform — Runtime Environment

The Coredge GPUaaS Platform is a purpose-built, bare-metal GPU cloud that provisions, orchestrates, and meters NVIDIA (H100) GPU infrastructure. It delivers:

Bare Metal Provisioning via Canonical MAAS over IPMI/PXE
Kubernetes Cluster Orchestration on bare metal GPU nodes
Slurm HPC Cluster Orchestration via Slinky Operator
High-performance storage via DDN AI400 (Lustre/InfiniBand) and VAST Data (CSI/S3)
Network isolation via Cisco NDFC (VXLAN/EVPN, per-tenant VRFs)
Full metering pipeline (DCGM, Prometheus, VictoriaMetrics, orbiter-metering)

2.3 Integration Relationship

CCP acts as the unified customer-facing control plane. The Coredge GPUaaS Platform acts as a specialized compute substrate exposed through CCP service catalogue entries. When a customer provisions GPU as a Service (GPUaaS) through the our Cloud Portal, CCP delegates to Coredge platform APIs for resource lifecycle management, while retaining control of IAM, billing aggregation, quota governance, and customer onboarding.

Concern	Owner: CCP	Owner: Coredge GPUaaS
Customer Portal / UI	CCP Self-Service Console	No direct portal exposure
Tenant Identity & SSO	Keycloak (CCP Auth) — Federation	Keycloak (GPUaaS) — realm per tenant
Service Catalogue	CCP Service Catalogue + subscription	GPUaaS API endpoints
GPU Compute Provisioning	API delegation via OpenStack / Coredge APIs	Bare Metal Provisioning (MAAS)
Kubernetes Orchestration	Cloud Orbiter (CCP) for CaaS	compass-orchestrator for GPU K8s
Network Fabric	VPC / Firewall / LB (OpenStack + CheckPoint/Palo Alto)	VRF/VLAN (Cisco NDFC) for GPU nodes
Storage	NetApp (Block, Object, File)	DDN AI400 (GPU workload), VAST (platform)
Metering & Billing	orbiter-metering aggregation, billing	DCGM, Prometheus, orbiter-metering
Monitoring	Zabbix, Grafana (CCP)	VictoriaMetrics, Prometheus, Grafana
Quota Enforcement	CCP quota service per tenant/cell	orbiter-metering + domain quota
Backup & DR	Veritas backup agent, geo-replicated object storage	VAST S3 (pg_dump, mongodump, etcd)

3. Integration Architecture

3.1 Architecture Principles

API-First: All integrations are implemented through well-defined REST APIs or gRPC contracts with versioned endpoints.
Loose Coupling: CCP and GPUaaS communicate through defined interface contracts; internal implementation changes in either system must not break the integration.
Zero-Trust Security: Every inter-service call requires mutual authentication (mTLS) and JWT-based authorization. No implicit trust based on network location.
Single Source of Truth per Domain: CCP owns customer identity and billing records; Coredge GPUaaS owns GPU hardware state and real-time metrics.
Idempotency: All provisioning API calls must be idempotent. Retry logic must not produce duplicate resources.
Observability: Every integration point emits structured logs with correlation IDs for end-to-end request tracing.

3.2 Logical Integration Layers

The integration is organized into four logical layers:

Layer	Name	Description
L1	Identity & Access	Federated Keycloak realms, JWT propagation, RBAC/ABAC synchronization between CCP and GPUaaS.
L2	Control Plane	Resource lifecycle APIs (provision, scale, delete) for GPU Bare Metal, Kubernetes clusters, VPCs, and storage volumes.
L3	Data Plane	Tenant networking fabric (VXLAN/EVPN), InfiniBand storage access (DDN/UFM), and workload traffic paths.
L4	Observability & Billing	Metering event streams, quota synchronization, billing aggregation, monitoring federation, and audit log consolidation.

3.3 Integration Topology

The following describes the high-level request flow from customer to GPU hardware:

Customer accesses the CCP Cloud Portal (CCP Self-Service Console) authenticated identity, federated through CCP Keycloak.
Customers place a GPU service order through the CCP service catalogue. Coredge calls the CCP onboarding API (POST /api/organizations) to create or update the tenant.
CCP validates the request against tenant quota (orbiter-metering quota service), then dispatches a provisioning event to the Coredge GPUaaS API (baremetal-manager or compass-orchestrator) over the private integration network.
Coredge GPUaaS executes the provisioning pipeline: network fabric setup (Cisco NDFC), IPMI/PXE boot (MAAS), OS install, GPU agent deployment, storage allocation (DDN/UFM), and cluster formation (K8s or Slurm).
The provisioned resource is registered back in CCP inventory. Tenant gains self-service access from the CCP portal.
GPU usage data flows from DCGM Exporter to Prometheus to VictoriaMetrics to orbiter-metering. CCP pulls aggregated billing records for invoice generation.

4. Integration Points

4.1 Identity and Access Management

4.1.1 Overview

Both CCP and the Coredge GPUaaS Platform use Keycloak as their Identity Provider. The integration federates these two Keycloak deployments so that a customer authenticated on CCP does not need to re-authenticate when their workloads are dispatched to the GPUaaS platform.

4.1.2 Federation Model

CCP Keycloak (master IdP) issues signed JWT tokens (RS256) containing tenant realm, role claims, and project/organization attributes. The Coredge GPUaaS Keycloak is configured as a relying party — it trusts tokens issued by CCP Keycloak after validating the token signature against the CCP Keycloak public key endpoint (JWKS URI).

💡

Note

Coredge serves as the canonical identity store. All customer accounts are created, modified, and deactivated exclusively. CCP Keycloak federation uses SAML 2.0 or OIDC, depending on provider's configuration.

4.1.3 Token Structure

JWT Claim	Source	Usage in GPUaaS
`sub`	CCP Keycloak	User unique identifier — maps to Coredge domain user record
`realm_name`	CCP Keycloak	Tenant realm — maps to GPUaaS Domain (tenant isolation boundary)
`roles`	CCP Keycloak	RBAC roles (e.g., Cell Administrator, Cell VM Admin) — translated to GPUaaS Project Admin or Domain Admin
`domain_id`	CCP (custom claim)	CCP tenant identifier — maps to GPUaaS Domain ID
`project_id`	CCP (custom claim)	CCP cell/project identifier — maps to GPUaaS Project scope
`org_id`	CCP (custom claim)	CCP organization identifier — maps to GPUaaS Organization scope
`exp`	Keycloak	Token expiry: 5–15 minute TTL for access tokens

4.1.4 Role Mapping

CCP Role	GPUaaS Role	Effective Permissions
Tenant Super Administrator	Platform Super Admin (scoped to domain)	Full control within tenant domain
Tenant Administrator	Domain Admin	Full domain control: clusters, BM, networks, quotas
Cell Administrator	Project Admin	Full project control: GPU clusters, storage, networks
Cell VM Admin	Project Admin (Compute scope)	GPU node and cluster lifecycle management
Cell Container Admin	Project Admin (K8s scope)	Kubernetes cluster creation, management, workload deployment
Cell Viewer / Tenant Viewer	Viewer	Read-only access to resources and dashboards
Tenant Billing Admin	Domain Admin (Billing scope only)	Metering dashboard, cost reports, quota usage

4.1.5 Authentication Flow

Customer logs into CCP Cloud Portal — CCP redirects to identity provider (SAML/OIDC).
It authenticates the user (with optional MFA: TOTP, SMS, or hardware key).
Coredge issues an assertion to CCP Keycloak; CCP Keycloak issues a signed JWT containing domain, org, project claims.
CCP backend services validate the JWT on every request (signature + expiry + realm).
When CCP dispatches a request to the Coredge GPUaaS API, it includes the JWT in the Authorization header. GPUaaS orbiter-auth validates the token against the CCP Keycloak JWKS endpoint.
orbiter-auth performs RBAC (role check) + ABAC (domain/project attribute check) before allowing the request to proceed.

4.2 GPU Compute — Bare Metal and VM Provisioning

4.2.1 GPU as a Service Integration

GPU as a Service (GPUaaS) is listed in the CCP Service Catalogue. Integration is via OpenStack APIs (for VM-based GPU slices where applicable) and via the Coredge baremetal-manager API for dedicated bare-metal GPU node provisioning. Both paths are fronted by the CCP API Gateway.

4.2.2 Bare Metal GPU Provisioning — API Contract

Method	Endpoint	Request Body	Description
POST	`/api/baremetal-manager/allocate`	`{flavor, os_image, network_id, project_id, tenant_id}`	Allocate a bare-metal GPU node to a tenant. Triggers NDFC network setup, MAAS provisioning.
GET	`/api/baremetal-manager/{node_id}/status`	—	Poll provisioning state: PENDING, PROVISIONING, ACTIVE, FAILED.
POST	`/api/baremetal-manager/{node_id}/release`	`{drain: true}`	Drain workloads and release node back to available pool.
GET	`/api/baremetal-manager/flavors`	—	List available GPU node flavors (H100 8-GPU, etc.)

4.2.3 End-to-End Provisioning Data Flow

CCP Self-Service Console receives a GPU node provisioning request from tenant.
CCP API Gateway validates the JWT and forwards to platform microservice.
Platform service checks quota against orbiter-metering. If quota exceeded, returns HTTP 422 with quota-exceeded error to the tenant.
CCP calls POST /api/baremetal-manager/allocate on Coredge GPUaaS over the private integration network (mTLS).
Coredge baremetal-manager triggers: (a) NDFC network fabric setup — VRF + VLAN allocation, (b) MAAS IPMI power-on + PXE boot, (c) OS install via golden image, (d) Cloud-init, agent deployment.
Storage allocation: DDN tenant directory created, NodeMap assigned, IB PKey created via UFM.
Agent registers with GPUaaS portal via gRPC (port 8030/8040). Admin approves host.
Provisioning state transitions to ACTIVE. GPUaaS notifies CCP via webhook (POST /ccp/webhooks/baremetal/state-change).
CCP registers the node in its inventory, updates tenant resource view. WebSocket notification sent to portal.

4.3 Container Orchestration — Kubernetes

4.3.1 Integration Model

CCP delivers Container as a Service (CaaS) via Cloud Orbiter, which supports both OCP (OpenShift Container Platform) and standard Kubernetes clusters. The Coredge GPUaaS Platform delivers GPU-accelerated Kubernetes clusters (via compass-orchestrator) on bare-metal nodes. These two systems integrate at the cluster registration and cluster agent level.

GPU Kubernetes clusters provisioned by Coredge are registered back into CCP Cloud Orbiter using the Cluster Agent protocol (gRPC, port 8030/8040), enabling CCP tenants to manage GPU workloads from the unified CCP portal.

4.3.2 Cluster Registration API

Method	Endpoint (CCP Cloud Orbiter)	Description
POST	`/api/orbiter/clusters/register`	Register an externally provisioned GPU K8s cluster with Cloud Orbiter. Body: {cluster_name, kubeconfig_secret, node_count, gpu_type, tenant_id, project_id}.
GET	`/api/orbiter/clusters/{cluster_id}`	Get cluster status, node health, GPU availability, addon state.
POST	`/api/orbiter/clusters/{cluster_id}/scale`	Scale worker nodes up or down. GPUaaS executes kubeadm join/drain.
DELETE	`/api/orbiter/clusters/{cluster_id}`	Initiate cluster teardown: drain -> kubeadm reset -> deregister -> release BM nodes.

4.3.3 GPU Operator Integration

When a GPU Kubernetes cluster is provisioned, the Coredge compass-orchestrator deploys the NVIDIA GPU Operator DaemonSet, which registers GPU resources with the Kubernetes node resource API (nvidia.com/gpu: 8 per node). These resource labels are propagated to Cloud Orbiter and are available for workload scheduling via node selectors and resource requests in CCP-deployed workloads.

4.4 Network Integration

4.4.1 Overview

CCP manages tenant VPC, Firewall, and Load Balancer resources via OpenStack APIs and CheckPoint/Palo Alto integrations. The Coredge GPUaaS Platform manages GPU node networking via Cisco NDFC (VXLAN/EVPN). These two network domains are connected through a defined inter-domain routing policy that allows GPU workload traffic to reach CCP-managed services (e.g., load balancers, object storage endpoints) while maintaining tenant isolation.

4.4.2 Network Segmentation Model

Network Segment	Owner	Integration Point
Tenant VPC (CCP)	CCP / OpenStack	Customer workload network. GPU nodes egress through CCP-managed VPC gateways for external services.
Tenant VRF (GPUaaS)	Coredge / Cisco NDFC	GPU bare-metal node isolation. 4 VLANs per tenant: Control Plane, GPU Worker, LB, Reserved.
GPU Node Management VLAN 901	Coredge	Cluster agent gRPC communications to GPUaaS portal.
Provisioning VRF VLAN 902	Coredge	MAAS PXE/DHCP relay for OS provisioning. Not exposed to tenant.
InfiniBand Fabric (UFM PKey)	Coredge / NVIDIA UFM	GPU-to-GPU RDMA (NCCL) and GPU-to-DDN storage (Lustre). Per-tenant PKey isolation.
External / Internet Gateway	CCP (NAT Gateway)	GPU nodes access external services (package updates, ML model registries) through CCP NAT Gateway.
CCP–GPUaaS Private Link	Both	Dedicated private network segment for integration API calls between CCP and Coredge GPUaaS. mTLS enforced.

4.4.3 VPC Lifecycle Coordination

When a tenant requests a VPC through the CCP portal, CCP calls OpenStack APIs to create the VPC constructs. If GPU bare-metal nodes are allocated to the tenant in the same provisioning request, CCP additionally notifies the Coredge network-manager to allocate a matching tenant VRF. The VRF is linked to the tenant VPC through a pre-configured L3 routing policy on the Palo Alto firewall, enabling east-west traffic between CCP VMs and GPU bare-metal nodes within the same tenant boundary.

💡

Note

Dynamic VRF creation in Coredge GPUaaS currently requires a manual firewall rule update on the Palo Alto firewall. Pre-created (pooled) VRF allocation is the preferred path for production deployments.

4.5 Storage Integration

4.5.1 Storage Architecture Mapping

Use Case	CCP Storage (NetApp)	GPUaaS Storage	Integration Notes
VM Block Storage	NetApp Block (iSCSI/FC)	Not applicable	CCP-owned. No integration required.
Object Storage (S3)	NetApp S3-compatible	VAST S3 (platform internal)	Tenant S3 endpoints served from CCP NetApp. GPUaaS uses VAST S3 internally for backup/config only.
File Storage (NFS)	NetApp NFS	DDN Lustre (GPU workloads)	Separate NFS mounts. GPU workloads use DDN for high-throughput AI/ML data access.
GPU Training Data	Not applicable	DDN AI400 (Lustre over IB)	Accessed by GPU bare-metal nodes via InfiniBand. 4 Tb/s per node aggregate throughput.
Platform DB Backup	NetApp-backed Veritas agent	VAST S3 (pg_dump, mongodump, etcd)	CCP backup: Veritas. GPUaaS backup: VAST S3. Cross-region replication applies to CCP data only.
CCP Config & Logs	Object storage in local region (5 TB)	Not applicable	Managed by CCP only. Backup copied cross-region.

4.5.2 Storage Provisioning Data Flow (GPU Workloads)

Tenant subscribes to GPU service — CCP creates tenant record and notifies Coredge GPUaaS to create tenant storage allocation.
Coredge Storage Plugin (via SSH to DDN MGS) creates tenant directory on Lustre: /lustre/{tenant-id}/.
NVIDIA UFM creates an InfiniBand PKey for the tenant. All GPU node mlx GUIDs are added as PKey members.
A DDN NodeMap is created, mapping the tenant's IB IP address ranges to the tenant directory. Only mapped IPs can mount the filesystem.
NFS over VIP is configured for environments requiring Ethernet storage access.
Quotas are set on the tenant directory according to the subscribed service tier.
NodeMap is activated — tenant's GPU nodes can now access the DDN filesystem over InfiniBand with hardware-level isolation enforced by both UFM PKey and DDN NodeMap (dual isolation).

4.6 Metering, Billing, and Quota

4.6.1 Metering Pipeline

The Coredge GPUaaS Platform meters GPU resource consumption at hardware level (15-second granularity) using DCGM Exporter, Node Exporter, Slurm job accounting (slurmdbd), and storage/network telemetry. The orbiter-metering service aggregates these raw metrics into billable usage records and exposes a billing export API consumed by CCP for invoice generation.

Metric	Source	Granularity	Billing Unit
GPU-Hours	DCGM Exporter (per GPU, per node)	15-sec raw → hourly billable	GPU-node-hours × rate card
CPU-Hours	Node Exporter	15-sec → hourly	vCPU-hours × rate card
Bare Metal Node-Hours	baremetal-manager (MongoDB state)	Allocation start/end timestamp	Node-hours × rate card
K8s Cluster-Hours	compass-orchestrator (MongoDB)	Cluster create/delete timestamp	Cluster-hours × rate card
Slurm Job GPU-Hours	slurmdbd → MariaDB	Per job on completion	AllocGRES (GPU count) × Elapsed × rate
Storage (DDN)	DDN Storage Plugin	Per tenant directory, polled	GB-hours × rate card
InfiniBand Bandwidth	NVIDIA UFM per PKey	Real-time, per tenant PKey	TB transferred × rate (if applicable)

4.6.2 Billing Export API (Coredge → CCP)

Method	Endpoint	Description
GET	`/api/metering/usage?tenant_id={id}&from={ts}&to={ts}`	Retrieve aggregated usage records for a tenant within a time window. Returns GPU-hrs, CPU-hrs, node-hrs, cluster-hrs, storage-GB, per project.
GET	`/api/metering/quota/{tenant_id}`	Current quota usage vs. allocated quota per resource type.
POST	`/api/metering/quota/{tenant_id}`	Update quota allocation (called by CCP when customer upgrades/downgrades subscription).
GET	`/api/metering/export/csv?tenant_id={id}&period={month}`	Download CSV billing export for a billing period. Compatible with invoice import format.

4.6.3 Quota Synchronization Flow

Customer subscribes/upgrades GPU service tier on CCP portal.
Coredge calls CCP quota management API to update tenant allocation.
CCP propagates the new quota to Coredge GPUaaS via POST /api/metering/quota/{tenant_id}.
orbiter-metering updates the in-memory quota counter. New quota takes effect immediately for subsequent resource requests.
If a resource request would exceed quota, the Coredge API returns HTTP 422 (Quota Exceeded). CCP displays the error to the tenant with a prompt to upgrade their subscription.
Quota dashboard (80% and 90% warning thresholds) is updated in both CCP portal and Coredge tenant dashboard.

4.7 Monitoring and Alerting Integration

4.7.1 Monitoring Stack Federation

Component	CCP	GPUaaS (Coredge)	Integration
Metrics Collection	Zabbix Agent (node, service)	DCGM, Node, K8s State Exporters	Federated scrape
Metrics Storage	Zabbix DB	VictoriaMetrics (HA)	API bridge for CCP consumption
Alerting	Zabbix Alert Rules	Prometheus AlertManager	CCP Notification Service (SMTP/SMS)
Dashboards	Grafana (CCP-managed)	External + Internal Grafana (GPUaaS)	GPU dashboards embedded in CCP portal via iFrame / API
Log Aggregation	APM/NPM/IPM (CCP Log Analyzer)	VictoriaMetrics + log rotation	Log forwarding via syslog/agent to CCP Log Analyzer
Audit Logs	CCP audit trail (ordr_mgmt)	Correlation ID log stream (S3 every 6h)	Cross-correlated via X-Correlation-ID header

4.7.2 Alert Thresholds (GPU Infrastructure)

Alert	Warning	Critical	Notification Target
GPU Temperature	> 80°C for 5 min	> 90°C for 2 min	Coredge Ops + CCP Notification Service → Tenant
GPU Memory Utilization	> 90% for 15 min	> 95%	CCP Alarm Service → Tenant dashboard
CPU Utilization (K8s nodes)	> 75%	> 90%	Coredge Ops team
Memory Utilization	> 70%	> 85%	Coredge Ops team
DDN Storage Quota	> 80% used	> 90% used	CCP Notification Service → Tenant
Cluster Node Not Ready	1 node down > 2 min	2+ nodes down	Coredge Ops + CCP Alarm Service → Tenant
Storage Latency (DDN)	> 10 ms	> 15 ms	Coredge Ops team

4.8 Tenant Onboarding Integration

4.8.1 End-to-End Onboarding Flow

Customer self-registers on CCP or is onboarded by the Coredge Business team.
Customer subscribes to Cirrus Cloud Platform. The platform calls CCP POST /api/organizations with party/billing account details.
CCP Keycloak auto-provisions a tenant realm. Default roles (Tenant Super Administrator, Tenant Administrator) are created.
CCP creates default resources: project/cell/VPC in the default region, default service catalogue.
CCP notifies Coredge GPUaaS tenant onboarding API (POST /api/tenants) to create the corresponding domain, allocate a VRF pool (4 VLANs), and set initial resource quotas.
Coredge GPUaaS auto-provisions: (a) Keycloak realm for the domain, (b) VRF/VLAN allocation in Cisco NDFC, (c) Initial storage directory structure in DDN Lustre, (d) Quota records in orbiter-metering.
Tenant administrator receives credentials and can log in to CCP portal. CCP identity federation is active.
Resource hierarchy is enforced: Tenant (mapped to LSI) → Cell → Resources (in CCP); Domain → Organization → Project (in GPUaaS).

4.8.2 Coredge to GPUaaS Entity Mapping

Coredge Entity	CCP (ACP) Entity	GPUaaS Entity	Notes
Party	—	—	master entity
Billing Account (BA)	—	—	billing scope
Logical Subscriber Identity (LSI)	—	—	subscriber record
Tenant	Tenant	Domain (Tenant)	1:1:1 mapping enforced
—	Cell	Project	Multiple cells per tenant allowed
—	Resources	Resources (BM/K8s/Storage)	Scoped within cell/project

5. Security Architecture

5.1 Security Principles

Zero Trust: No implicit trust between any systems. Every request authenticated and authorized regardless of network origin.
Least Privilege: Users and services receive only the minimum permissions required for their function.
Defense in Depth: Multiple independent security controls at network, identity, data, and application layers.
Encryption Everywhere: All data in transit encrypted with TLS 1.2+ / mTLS. All data at rest encrypted with AES-256.
Auditability: All actions logged with correlation IDs, timestamps, and user identity. Logs retained per compliance requirements.

5.2 Security Controls per Layer

Layer	Control	Implementation
Identity	Federated IAM	Keycloak with SAML/OIDC federation. RS256-signed JWTs. 5–15 min access token TTL. Single-use refresh tokens.
Identity	MFA	TOTP, SMS, email, hardware keys. Configurable per realm and per role.
Identity	Session Management	Admin force-logout. Token revocation. Correlation ID tracking (X-Correlation-ID header).
Network	Transport Encryption	TLS 1.2+ on all external endpoints. HSTS enforced. Cipher: TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384.
Network	mTLS (Service Mesh)	Mutual TLS with automated PKI for all inter-service calls (CCP microservices, Coredge microservices, cross-system integration).
Network	Network Segmentation	Palo Alto stateful firewall between orchestration and GPU nodes. Per-tenant VRF isolation. InfiniBand PKey per tenant.
Network	API Gateway	CCP API Gateway enforces authentication and rate limiting on all inbound API calls.
Data	Encryption at Rest	AES-256 for all stored data (databases, object storage, backups). Key management via platform PKI.
Data	Tenant Data Isolation	Database-level tenant tagging. RBAC + ABAC enforcement. Grafana data source isolation per tenant.
Application	RBAC + ABAC	orbiter-auth performs role check AND attribute (domain/project/org) check on every request.
Application	Image Security	CIS-benchmarked golden OS images. Non-privileged containers. Vulnerability scanning in CI/CD pipeline.
Operations	Audit Logging	All provisioning, access, and configuration changes logged with user ID, timestamp, correlation ID. Backed to S3 every 6 hours.
Operations	Compliance	NIST 800-53 Rev 5 (AC, AU), ISO/IEC 27001 (A.9, A.10), HIPAA (IAM, RBAC, audit).

6. High Availability and Disaster Recovery

6.1 CCP High Availability

Each region has two CCP clusters: AZ1 (Primary / Active) and AZ2 (Standby / Passive).
3-node web layer (reverse proxy) in DMZ per AZ. Kubernetes cluster: 3 master + 5 worker nodes per AZ.
PostgreSQL: Active-Passive with Logical/Streaming Replication across AZs. 3-node cluster per AZ (3+3 node setup, with arbiter VM for manual failover).
MongoDB: Active-Passive replication within region. MongoDB Active-Active replication (change-stream) for global services (tenant/project/user metadata) across North and South regions.
OpenFGA (AuthZ DB): Active-Passive between two regions. Writes always go to primary region. Read-heavy access pattern.
Global GSLB probe detects Active cluster failure and routes traffic to Passive. Internal 2n+1 quorum system validates failover decision.

6.2 GPUaaS High Availability

Kubernetes control plane: 3 or 5 etcd nodes for HA quorum.
Slurm: slurmctld deployed as HA Kubernetes Deployment. MariaDB StatefulSet with VAST CSI persistent storage.
MAAS: Region + Rack Controller HA configuration.
VictoriaMetrics: HA deployment with replication. No data loss on single-node failure.
DDN AI400: Redundant MGS/MDS nodes. Multiple OSTs for distributed parallel I/O.

6.3 Backup Strategy

Data Source	Method	Frequency	Retention	Destination
CCP Keycloak PostgreSQL	Veritas backup agent	Incremental 30 min / Full 24 hrs	3 months	Cross-region object storage
CCP Config MongoDB	Veritas backup agent	Incremental 30 min / Full 24 hrs	3 months	Cross-region object storage
CCP Metrics MongoDB	Veritas backup agent	Incremental 30 min / Full 24 hrs	3 months	Cross-region object storage
CCP K8s etcd	etcdctl snapshot	Incremental 30 min / Full 24 hrs	3 months	Cross-region object storage
GPUaaS PostgreSQL (metering)	pg_dump (full + incremental)	Every 6 hours	12–24 months	VAST S3 'NetBackup'
GPUaaS MongoDB (orchestration)	mongodump (full + incremental)	Every 6 hours	12–24 months	VAST S3 'NetBackup'
GPUaaS etcd (K8s clusters)	etcdctl snapshot	Every 6 hours	Critical / indefinite	VAST S3 'NetBackup'

7. API Integration Matrix

The following table provides a consolidated view of all integration APIs between CCP and the Coredge GPUaaS Platform. All APIs are secured with mTLS between systems and require a valid JWT in the Authorization header.

#	Domain	API / Endpoint	Direction	Description
1	IAM	`JWKS URI: /realms/{realm}/protocol/openid-connect/certs`	CCP ← GPUaaS	Keycloak public key endpoint for JWT signature verification.
2	IAM	`POST /realms/{realm}/protocol/openid-connect/token`	CCP → GPUaaS	Service-to-service token exchange for integration calls.
3	Onboarding	`POST /api/organizations`	CCP	creates tenant organization in CCP upon subscription.
4	Onboarding	`POST /api/tenants`	CCP → GPUaaS	CCP creates corresponding domain in GPUaaS after CCP tenant is created.
5	Compute	`POST /api/baremetal-manager/allocate`	CCP → GPUaaS	Allocate bare-metal GPU node to tenant.
6	Compute	`GET /api/baremetal-manager/{node_id}/status`	CCP → GPUaaS	Poll GPU node provisioning state.
7	Compute	`POST /api/baremetal-manager/{node_id}/release`	CCP → GPUaaS	Release bare-metal GPU node.
8	Webhook	`POST /ccp/webhooks/baremetal/state-change`	GPUaaS → CCP	Async state-change notification (ACTIVE, FAILED, RELEASED).
9	K8s	`POST /api/orbiter/clusters/register`	CCP ← GPUaaS	Register GPU K8s cluster with CCP Cloud Orbiter.
10	K8s	`POST /api/orbiter/clusters/{id}/scale`	CCP → GPUaaS	Scale GPU K8s cluster worker nodes.
11	K8s	`DELETE /api/orbiter/clusters/{id}`	CCP → GPUaaS	Tear down GPU K8s cluster.
12	Network	`POST /api/network-manager/vpc`	CCP → GPUaaS	Allocate tenant VRF/VLAN from pool (triggered alongside CCP VPC creation).
13	Storage	`POST /api/storage/tenant`	CCP → GPUaaS	Create DDN tenant directory + NodeMap + UFM PKey.
14	Metering	`GET /api/metering/usage`	CCP → GPUaaS	Retrieve aggregated GPU usage records for billing period.
15	Metering	`POST /api/metering/quota/{id}`	CCP → GPUaaS	Update tenant quota allocation after subscription change.
16	Metering	`GET /api/metering/export/csv`	CCP → GPUaaS	Download billing CSV for invoice generation.
17	Monitoring	`GET /api/metrics/gpu/{tenant_id}`	CCP → GPUaaS	Fetch GPU utilization metrics for CCP tenant dashboard.
18	Monitoring	`POST /api/alerts/subscribe`	CCP → GPUaaS	Register CCP webhook to receive GPU infrastructure alerts.
19	Monitoring	`POST /ccp/webhooks/alerts`	GPUaaS → CCP	GPU alert delivery to CCP Notification Service (SMTP/SMS).

8. Error Handling and Resilience

8.1 Error Response Standard

All integration API responses follow a consistent error schema:

HTTP Code	Error Type	Handling Strategy
400	Bad Request	Validate request schema before retrying. Do not retry. Log with correlation ID and surface to tenant.
401	Unauthorized	Refresh JWT token and retry once. If still 401, re-initiate auth flow. Log token expiry event.
403	Forbidden	Do not retry. RBAC/ABAC rejection. Log for audit. Surface 'Insufficient permissions' to tenant.
404	Not Found	Do not retry. Resources may have been deleted. Trigger reconciliation to sync state.
409	Conflict	Idempotency check: resources may already exist. Query GET endpoint to verify state before retrying.
422	Quota Exceeded	Do not retry. Prompt tenant to upgrade subscription. Log quota breach event.
429	Rate Limited	Implement exponential backoff with jitter. Respect Retry-After header.
503	Service Unavailable	Exponential backoff: initial 1s, max 60s, max 5 retries. Alert Ops team if sustained.
5xx	Server Error	Retry with exponential backoff (max 3 retries). Trigger circuit breaker after 5 consecutive failures.

8.2 Retry and Circuit Breaker Policy

Integration Call	Max Retries	Initial Backoff	Max Backoff	Circuit Breaker
BM Node Provisioning	3	5 seconds	60 seconds	5 failures / 30s window
Quota Check	2	500 ms	5 seconds	10 failures / 10s window
Token Refresh	1	Immediate	—	3 failures → re-auth
Metering Export	3	2 seconds	30 seconds	5 failures / 60s window
Alert Webhook Delivery	5	1 second	120 seconds	Dead-letter queue after 5 failures
State Change Webhook	5	2 seconds	60 seconds	Dead-letter queue after 5 failures

9. Pre-Requisites and Deployment Considerations

9.1 CCP Pre-Requisites

Wildcard SSL certificates for CCP hosting and dynamic customer account URLs
Load Balancer VIPs for each CCP endpoint (console, API gateway, orbiter, auth)
DNS server with credentials to create dynamic domains per customer account
Accessible container registry for CCP component images
Kubernetes-compliant storage with high IOPS performance (NVMe-backed NFS)
SMTP server credentials for CCP Notification Service
NTP and DNS server connectivity
Connectivity and API credentials to integrate with platform
Private network link to Coredge GPUaaS integration API endpoints

9.2 Coredge GPUaaS Pre-Requisites

MAAS HA controller accessible at 172.26.5.8 with IPMI credentials for all GPU node BMCs
Cisco NDFC access (REST API) for VRF/VLAN automation
NVIDIA UFM management interface access for InfiniBand PKey management
DDN AI400 MGS SSH access for Storage Plugin tenant directory management
VAST Data CSI driver deployed in GPUaaS management Kubernetes cluster
Palo Alto firewall API access for tenant network rule management
CCP Keycloak JWKS URI reachable from Coredge GPUaaS (for JWT validation)
Private network link to CCP for webhook delivery and integration API calls
NVIDIA GPU drivers and GPU Operator images in accessible container registry

9.3 Deployment Constraints

CCP must be deployed in the control plane of each availability zone, not in the workload pod.
The Coredge GPUaaS management cluster must be on a dedicated infrastructure separate from GPU workload nodes.
All VMs within a cluster (Postgres, Kubernetes) must have anti-affinity rules enabled to prevent co-location on a single physical host.
Database clusters use a 3+3 node setup (3 VMs per AZ). The two-AZ setup requires manual failover scripts (developed by Coredge team) due to the absence of a third AZ for automatic arbiter node placement.
OpenFGA Postgres DB and Global MongoDB VMs are stretched across the 2 AZs per region and routed accordingly.

10. RACI Matrix — Integration Responsibilities

💡

Note

R = Responsible | A = Accountable | C = Consulted | I = Informed

Task	R	A	C	I
CCP Major / Minor Upgrade	Coredge	Coredge	Coredge	Coredge
OS Patching — CCP Cluster VMs	Coredge	Coredge	Coredge	Coredge
CCP Kubernetes Cluster Patching	Coredge	Coredge	Coredge	Coredge
GPU Node OS Patching (MAAS golden image update)	Coredge	Coredge	Coredge	Coredge
Infrastructure for CCP Management Cluster	Coredge	Coredge	Coredge	Coredge
Storage Driver Plugin for CCP PVCs	Coredge	Coredge	Coredge	Coredge
SSL Certificates and LB Config	Coredge	Coredge	Coredge	Coredge
Keycloak Federation Configuration (CCP ↔ GPUaaS)	Coredge	Coredge	Coredge	Coredge
Integration APIs (CCP)	Coredge	Coredge	Coredge	Coredge
Integration API Development (CCP ↔ GPUaaS)	Coredge	Coredge	Coredge	Coredge
Cisco NDFC VRF/VLAN Configuration	Coredge	Coredge	Coredge	Coredge
Palo Alto Firewall Rule Management	Coredge	Coredge	Coredge	Coredge
DDN Storage Provisioning Setup	Coredge	Coredge	Coredge	Coredge
orbiter-metering Rate Card Configuration	Coredge	Coredge	Coredge	Coredge
Database Failover Script Development	Coredge	Coredge	Coredge	Coredge
Service Catalogue and Rate Card	Coredge	Coredge	Coredge	Coredge
Integration Testing Execution	Coredge	Coredge	Coredge	Coredge

11. Integration Testing Strategy

11.1 Test Categories

Test Type	Scope	Success Criteria
Unit / Contract Testing	Individual API endpoint request/response schema validation	100% schema conformance. All error codes return correct structure.
Integration Testing	End-to-end provisioning flow: CCP → GPUaaS → bare-metal node ACTIVE	Node provisioned within SLA. Webhook received. Portal reflects ACTIVE state.
IAM / Auth Testing	JWT federation: CCP token accepted by GPUaaS orbiter-auth	All role mappings enforce correct RBAC + ABAC. Expired/tampered tokens rejected.
Quota Testing	Quota enforcement: resource creation blocked when quota exceeded	HTTP 422 returned immediately on quota breach. Dashboard reflects 90% warning.
Failover Testing	CCP AZ failover: traffic switches from AZ1 to AZ2	Recovery within RTO. No data loss. Tenant sessions restored.
Billing Accuracy Testing	GPU metering: DCGM data flows through to format billing export	GPU-hours within 1% variance of expected. Correct tenant attribution.
Monitoring Integration	Alert delivery: GPU temp alert fires and reaches CCP Notification Service	Alert delivered within 60 seconds of threshold breach.
Performance Testing (CCP)	CCP portal load: 50,000 VMs and 200,000 pods under management	Portal response < 3s P95. No degradation at peak load.

11.2 Exclusions from Test Scope

Penetration testing (scheduled separately, not in scope of this integration IDD)
Performance testing for infrastructure components other than CCP (handled by respective component owners)
Day-2 operations testing for underlying bare-metal physical infrastructure

12. Open Items and Known Constraints

#	Item	Status	Resolution Plan
1	Network Load Balancer — no out-of-the-box integration from CCP. Requires Automation Platform.	Open	Integration approach to be defined with Coredge Network team. Automation Platform API contract to be agreed.
2	Dynamic VRF creation in Coredge requires manual Palo Alto firewall rule addition.	Open	Pre-created (pooled) VRF assignment is the interim production approach. Automated firewall API integration on roadmap.
3	Database failover script (3+3 node setup, two-AZ region) needs joint development.	In Progress	Script to be developed collaboratively by Coredge. Target completion before MVP1 UAT.
4	NAT Gateway integration approach (SNAT / Software Appliance) is TBD.	Open	Integration approach to be confirmed by Network team post-MVP1 planning review.
5	VPN Gateway (Site-to-Site and Point-to-Site) integration uses Zscaler APIs — no out-of-the-box support.	Open	Zscaler API integration to be scoped and agreed as separate work item.
6	CCP API mapping for Party, Billing Account (BA), and Logical Subscriber Identity (LSI) entities is incomplete.	Open	Mapping to be finalized by Business team guidance. Required before go-live.
7	MVP2 and MVP3 integration requirements (CDN, DRaaS, MariaDB, NoSQL, Kafka, etc.) are deferred.	Deferred	To be defined post-MVP1. Out of scope for this IDD version.

13. References

#	Document	Owner	Version
1	Cloud Management Platform (CMP) for Cloud — High Level Design	Coredge	1.10 (Aug 2025)
2	Coredge GPUaaS Platform — Technical Reference Document (Unified Architecture Guide)	Coredge Cloud Infrastructure	1.0 (Feb 2026)
3	Coredge Cloud — Service Catalogue (portal)	Coredge	Current
4	Coredge Statement of Work — Cloud CMP Engagement	Coredge	Current
5	NIST Special Publication 800-53 Rev 5 — Security and Privacy Controls	NIST	Rev 5
6	ISO/IEC 27001:2022 — Information Security Management	ISO/IEC	2022
7	Keycloak Server Administration Guide	Red Hat / Keycloak Community	Current
8	Cisco NDFC — REST API Reference	Cisco	Current
9	NVIDIA DCGM User Guide	NVIDIA	Current
10	DDN AI400 Lustre Administration Guide	DDN	Current

Acronyms & Abbreviations​

1. Purpose and Scope​

1.1 Purpose​

1.2 Scope​

1.3 Out of Scope​

2. System Overview​

2.1 Cirrus Cloud Platform (CCP)​

2.2 Coredge GPUaaS Platform — Runtime Environment​

2.3 Integration Relationship​

3. Integration Architecture​

3.1 Architecture Principles​

3.2 Logical Integration Layers​

3.3 Integration Topology​

4. Integration Points​

4.1 Identity and Access Management​

4.1.1 Overview​

4.1.2 Federation Model​

4.1.3 Token Structure​

4.1.4 Role Mapping​

4.1.5 Authentication Flow​

4.2 GPU Compute — Bare Metal and VM Provisioning​

4.2.1 GPU as a Service Integration​

4.2.2 Bare Metal GPU Provisioning — API Contract​

4.2.3 End-to-End Provisioning Data Flow​

4.3 Container Orchestration — Kubernetes​

4.3.1 Integration Model​

4.3.2 Cluster Registration API​

4.3.3 GPU Operator Integration​

4.4 Network Integration​

4.4.1 Overview​

4.4.2 Network Segmentation Model​

4.4.3 VPC Lifecycle Coordination​

4.5 Storage Integration​

4.5.1 Storage Architecture Mapping​

4.5.2 Storage Provisioning Data Flow (GPU Workloads)​

4.6 Metering, Billing, and Quota​

4.6.1 Metering Pipeline​

4.6.2 Billing Export API (Coredge → CCP)​

4.6.3 Quota Synchronization Flow​

4.7 Monitoring and Alerting Integration​

4.7.1 Monitoring Stack Federation​

4.7.2 Alert Thresholds (GPU Infrastructure)​

4.8 Tenant Onboarding Integration​

4.8.1 End-to-End Onboarding Flow​

4.8.2 Coredge to GPUaaS Entity Mapping​

5. Security Architecture​

5.1 Security Principles​

5.2 Security Controls per Layer​

6. High Availability and Disaster Recovery​

6.1 CCP High Availability​

6.2 GPUaaS High Availability​

6.3 Backup Strategy​

7. API Integration Matrix​

8. Error Handling and Resilience​

8.1 Error Response Standard​

8.2 Retry and Circuit Breaker Policy​

9. Pre-Requisites and Deployment Considerations​

9.1 CCP Pre-Requisites​

9.2 Coredge GPUaaS Pre-Requisites​

9.3 Deployment Constraints​

10. RACI Matrix — Integration Responsibilities​

11. Integration Testing Strategy​

11.1 Test Categories​

11.2 Exclusions from Test Scope​

12. Open Items and Known Constraints​

13. References​

Acronyms & Abbreviations

1. Purpose and Scope

1.1 Purpose

1.2 Scope

1.3 Out of Scope

2. System Overview

2.1 Cirrus Cloud Platform (CCP)

2.2 Coredge GPUaaS Platform — Runtime Environment

2.3 Integration Relationship

3. Integration Architecture

3.1 Architecture Principles

3.2 Logical Integration Layers

3.3 Integration Topology

4. Integration Points

4.1 Identity and Access Management

4.1.1 Overview

4.1.2 Federation Model

4.1.3 Token Structure

4.1.4 Role Mapping

4.1.5 Authentication Flow

4.2 GPU Compute — Bare Metal and VM Provisioning

4.2.1 GPU as a Service Integration

4.2.2 Bare Metal GPU Provisioning — API Contract

4.2.3 End-to-End Provisioning Data Flow

4.3 Container Orchestration — Kubernetes

4.3.1 Integration Model

4.3.2 Cluster Registration API

4.3.3 GPU Operator Integration

4.4 Network Integration

4.4.1 Overview

4.4.2 Network Segmentation Model

4.4.3 VPC Lifecycle Coordination

4.5 Storage Integration

4.5.1 Storage Architecture Mapping

4.5.2 Storage Provisioning Data Flow (GPU Workloads)

4.6 Metering, Billing, and Quota

4.6.1 Metering Pipeline

4.6.2 Billing Export API (Coredge → CCP)

4.6.3 Quota Synchronization Flow

4.7 Monitoring and Alerting Integration

4.7.1 Monitoring Stack Federation

4.7.2 Alert Thresholds (GPU Infrastructure)

4.8 Tenant Onboarding Integration

4.8.1 End-to-End Onboarding Flow

4.8.2 Coredge to GPUaaS Entity Mapping

5. Security Architecture

5.1 Security Principles

5.2 Security Controls per Layer

6. High Availability and Disaster Recovery

6.1 CCP High Availability

6.2 GPUaaS High Availability

6.3 Backup Strategy

7. API Integration Matrix

8. Error Handling and Resilience

8.1 Error Response Standard

8.2 Retry and Circuit Breaker Policy

9. Pre-Requisites and Deployment Considerations

9.1 CCP Pre-Requisites

9.2 Coredge GPUaaS Pre-Requisites

9.3 Deployment Constraints

10. RACI Matrix — Integration Responsibilities

11. Integration Testing Strategy

11.1 Test Categories

11.2 Exclusions from Test Scope

12. Open Items and Known Constraints

13. References