Dflare AI Data Sheet
Modern AI workloads require massive GPU scale, high-throughput data pipelines, strict multi-tenant isolation, and support for both cloud-native and HPC workloads. Traditional cloud and on-prem systems fail to deliver all four simultaneously. Traditional cloud providers impose virtualization overhead and noisy-neighbor effects that degrade training throughput, while on-premises HPC clusters lack the self-service provisioning and lifecycle management that modern AI teams require.
Dflare AI was purpose-built to address this gap — providing unified GPU infrastructure that combines bare metal performance, hardware-enforced isolation, and full lifecycle automation.
Platform Overview
Dflare AI is an enterprise GPU infrastructure platform designed to deliver bare metal performance with cloud-like usability. The platform is composed of three primary layers:
- Access Layer — Portal UI, REST APIs, CLI
- Control Plane — Workflow Orchestrator, Cluster Manager, Network Manager, Identity & Access, Monitoring & Metering
- Data Plane — GPU nodes, Kubernetes / Slurm workloads, InfiniBand fabric, Parallel filesystem
- ML Platform — GPU notebooks, distributed training, LLM inference, fine-tuning, experiment tracking, dataset management
Dflare AI efficiently manages bare metal GPU nodes across any infrastructure, offering a unified experience for application and infrastructure lifecycle management. The platform caters to enterprises, AI labs, managed service providers, and government organizations, assisting in GPU provisioning, workload orchestration, security, and billing.
Industry Use Cases
Large-Scale AI Training
Multi-node, multi-GPU distributed training across hundreds of GPUs with RDMA-based InfiniBand interconnect. Scale AI training from 8 to 10,000+ GPUs without architectural changes, achieving near-linear scaling.
Unified Kubernetes + Slurm Orchestration
Run both containerized and Slurm workloads on the same bare metal infrastructure with unified networking, storage, security, and billing — managed through a single control plane.
GPU-as-a-Service for Multi-Tenant Environments
Offer dedicated bare metal GPU clusters to multiple tenants with hardware-level isolation at InfiniBand switch (partition key), filesystem (access control map), and network fabric (VRF/VXLAN).
Key Features
Bare Metal GPU Performance
Direct GPU access without virtualization overhead. Hardware-level BIOS and OS tuning pre-applied via golden images. Near-native efficiency (99%+ of bare metal GPU peak). Standard GPU slicing via NVIDIA MIG (Multi-Instance GPU) profiles enables efficient resource utilization.
Observability
GPU utilization, cluster health, and job performance monitoring. Metrics collectors, time-series database, and dashboards for real-time observability across compute, network, and storage layers.
Dual-Fabric Network Architecture
Ethernet (VXLAN/EVPN/BGP) for control plane and tenant VPCs. InfiniBand for RDMA-based high-performance GPU-to-GPU and GPU-to-storage communication with Partition Keys for tenant isolation.
Zero Trust Security
Defense-in-depth security model. OAuth2/OIDC identity, RBAC + ABAC authorization, VRF + VLAN + PKey network isolation, storage ACLs, and TLS 1.2+ / mTLS encrypted service-to-service transport.
Central IAM and Multi-Tenancy Controls
Integrates with enterprise identity providers (OIDC/OAuth2). Issues short-lived JWTs, enforces RBAC/ABAC at every API boundary, and isolates tenants via dedicated realms and scoped tokens for enhanced multi-tenancy support.
Unified Kubernetes + Slurm
Single control plane manages Kubernetes and Slurm. Kubernetes, powered by CKP (Coredge Kubernetes Platform), handles containerized workloads via device plugins. Supports CNCF Certified Kubernetes versions (1.33 - 1.35). Slurm handles batch workloads with GPU-aware scheduling using GRES. Both share the same underlying nodes, storage, and network.
ML Platform
Integrated machine learning environment with GPU notebooks, distributed training, LLM inference with OpenAI-compatible APIs, model fine-tuning, experiment tracking with MLflow, and dataset management.
Key Benefits
Automated Lifecycle Management
From bare metal power-on to production cluster — fully automated. No SSH, no manual configuration. Provision → operate → monitor → bill.
Near-Bare-Metal Performance
Eliminate virtualization overhead — achieve near-bare-metal performance with cloud-like operational simplicity. Zero hypervisor, direct device access.
Hardware-Level Tenant Isolation
Isolation at InfiniBand switch hardware (partition key), filesystem (access control map), and network fabric (VRF/VXLAN). Enforce strict multi-tenancy through hardware and software controls.
Comprehensive Observability
Real-time observability across compute, network, and storage layers enables proactive capacity management and rapid incident response.
Scalable Multi-Cluster Management
Scale AI training from 8 to 10,000+ GPUs without architectural changes. Horizontally scalable GPU nodes with leaf-spine fabric expansion.
Zero Trust Security
Defense-in-depth security model based on zero-trust principles ensures a high level of security for the managed GPU infrastructure.
Contact
For more information or questions about Coredge's Dflare AI:
- Website: https://coredge.io
- Email: info@coredge.io