Monitoring Services
Business Value: Full operational visibility into platform health, workload performance, and resource utilization — for both platform operators and tenants — with proactive alerting and structured incident response. CCP monitoring ensures problems are detected before users are impacted.
Monitoring Service Portfolio
| Service | Phase | Description |
|---|---|---|
| Log Analyzer | MVP1 | Centralized log aggregation, search, and analysis for platform and workload logs |
| Operational Metrics | MVP1 | Real-time collection and visualization of infrastructure and workload performance metrics |
| Alarm Service | MVP1 | Configurable threshold-based and anomaly-driven alerting across all monitored resources |
| Notification Service | MVP1 | Multi-channel alert delivery — email, SMS, webhook, and portal notifications |
Monitoring Architecture
CCP monitoring is delivered through two integrated toolsets:
- Zabbix v7.4.3: Operational metrics collection, alarm management, and notification delivery. The primary tool for infrastructure-level monitoring — servers, network devices, services, and platform components.
- Prometheus & Grafana v9.4.3: Cluster and database health monitoring. Prometheus collects time-series metrics from Kubernetes clusters, OpenStack nodes, and database components. Grafana provides rich visualization dashboards.
- APM / NPM / IPM Integration: Application performance, network performance, and infrastructure performance monitoring tools integrated via the Log Analyzer service.
Tier 1 — Platform Operations Monitoring
Platform operations monitoring is always active, deployed alongside the CCP control plane for the service provider's operations team. It covers the full infrastructure stack:
Infrastructure Metrics
- Compute Nodes: CPU utilization, memory usage, disk I/O, network throughput, and temperature
- Kubernetes Clusters: Control plane health (etcd, API server, scheduler, controller-manager), node readiness, pod status
- Storage Systems: Volume utilization, IOPS, throughput, latency, and capacity thresholds
- Network: Switch and router health, bandwidth utilization, packet loss, and latency
- OpenStack Services: Nova, Neutron, Cinder, Keystone, Glance service health and API response times
- Platform Microservices: Health endpoints for all CCP microservices; alerting on service degradation
Platform Service Health
- Database Health: PostgreSQL replication lag, MongoDB replica set status, Redis cluster health
- Kafka Queues: Consumer lag, topic throughput, broker availability
- API Gateway: Request rates, error rates, and latency percentiles
- Identity Service: Keycloak login rates, token error rates, realm availability
Tier 2 — Tenant Workload Monitoring
Tenant workload monitoring is available as a service from the Self-Service Console. When enabled for a cell or cluster, it deploys monitoring agents within the tenant's environment with full multi-tenant isolation.
Tenant Monitoring Capabilities
- VM Metrics: Per-VM CPU, memory, disk I/O, and network metrics visible in the tenant portal
- Kubernetes Cluster Metrics: Pod health, deployment status, namespace resource usage, HPA status
- Container Metrics: Container-level CPU and memory usage; restart counts and OOM events
- Custom Dashboards: Tenants can create custom Grafana dashboards based on their workload metrics
- Multi-Tenant Isolation: Tenant A's metrics are completely isolated from Tenant B — no cross-tenant metric visibility regardless of shared infrastructure
Database Monitoring
For DBaaS instances, dedicated monitoring is available via Prometheus exporters:
- PostgreSQL: Query performance, connection counts, replication lag, cache hit ratio
- MongoDB: Operation counts, replication status, index usage, document throughput
- MS SQL: Query wait times, blocking, transaction log usage, availability group health
- MariaDB: Thread status, slow query log, InnoDB buffer pool hit rate
Log Analyzer
The Log Analyzer service provides centralized log aggregation and search for platform and workload logs:
- Platform Logs: All CCP microservice logs, API gateway logs, authentication events, and infrastructure logs aggregated centrally
- Workload Logs: Application logs from VMs and containers forwarded to Log Analyzer for storage and search
- Security Logs: Firewall logs, VPN access logs, and access control decision logs forwarded to SIEM
- Log Retention: Configurable retention per log category; compliance-driven retention policies for security and audit logs
- Search Interface: Full-text search across log streams with time-range filtering and structured field queries
- Integration: APM / NPM / IPM tools integrate with Log Analyzer for correlated application and infrastructure analysis
Alarm Service
The Alarm Service provides configurable, threshold-based alerting across all monitored resources:
Pre-Configured Alert Thresholds
| Alert Category | Trigger Condition | Default Action |
|---|---|---|
| CPU Utilization | Sustained utilization above configurable threshold | Alert operations team; capacity planning review |
| Memory Utilization | Node-level memory above configurable threshold | Capacity warning; alert operations |
| Storage Capacity | Volume or pool utilization above threshold | Capacity warning to tenant; escalation to operations |
| Kubernetes Node Not Ready | Node transitions to NotReady state | Immediate operations alert; auto-recovery attempt |
| etcd Leader Changes | etcd leader election above configurable rate | Stability investigation; alert operations |
| API Error Rate | HTTP 5xx error rate above threshold | Service degradation alert; escalation |
| Database Replication Lag | Replication lag exceeds threshold | HA risk alert; DR readiness review |
| Backup Failure | Scheduled backup job fails or misses window | Immediate alert; data protection risk escalation |
| Service Unavailability | CCP microservice health check fails | Platform alert; auto-restart attempt |
Custom Alarm Configuration
- Users and administrators can create custom alarms on any metric with configurable:
- Threshold value and comparison operator
- Evaluation period (number of consecutive data points)
- Severity level (info, warning, critical)
- Notification channel (email, SMS, webhook, portal)
Notification Service
The Notification Service delivers alerts and system events through multiple channels:
- Email (SMTP): Alert emails delivered via configured SMTP server; supports HTML templates for rich notifications
- SMS: Critical alert SMS to on-call phone numbers via configured SMS gateway
- Portal Notifications: Real-time in-portal notifications via SocketIO push; visible in the notification bell in the console
- Webhook: POST notifications to external systems (PagerDuty, OpsGenie, Slack, custom ITSM systems) for alert integration
- Notification Policies: Configurable escalation rules — alert user first, then escalate to admin if unacknowledged within time window
Incident Management
Platform incidents follow a structured severity classification:
| Severity | Definition | Response Model |
|---|---|---|
| Critical | Platform-wide outage or data loss risk | Immediate response; war room activation; executive notification |
| High | Major service degradation affecting multiple tenants | Rapid response; senior engineer engaged; tenant notification |
| Medium | Service degradation for single tenant or service | Standard response; investigation and resolution within SLA |
| Low | Minor issue with available workaround | Next-business-day response; ticket tracking |