Monitoring Services

Business Value: Full operational visibility into platform health, workload performance, and resource utilization — for both platform operators and tenants — with proactive alerting and structured incident response. CCP monitoring ensures problems are detected before users are impacted.

Monitoring Service Portfolio

Service	Phase	Description
Log Analyzer	MVP1	Centralized log aggregation, search, and analysis for platform and workload logs
Operational Metrics	MVP1	Real-time collection and visualization of infrastructure and workload performance metrics
Alarm Service	MVP1	Configurable threshold-based and anomaly-driven alerting across all monitored resources
Notification Service	MVP1	Multi-channel alert delivery — email, SMS, webhook, and portal notifications

Monitoring Architecture

CCP monitoring is delivered through two integrated toolsets:

Zabbix v7.4.3: Operational metrics collection, alarm management, and notification delivery. The primary tool for infrastructure-level monitoring — servers, network devices, services, and platform components.
Prometheus & Grafana v9.4.3: Cluster and database health monitoring. Prometheus collects time-series metrics from Kubernetes clusters, OpenStack nodes, and database components. Grafana provides rich visualization dashboards.
APM / NPM / IPM Integration: Application performance, network performance, and infrastructure performance monitoring tools integrated via the Log Analyzer service.

Tier 1 — Platform Operations Monitoring

Platform operations monitoring is always active, deployed alongside the CCP control plane for the service provider's operations team. It covers the full infrastructure stack:

Infrastructure Metrics

Compute Nodes: CPU utilization, memory usage, disk I/O, network throughput, and temperature
Kubernetes Clusters: Control plane health (etcd, API server, scheduler, controller-manager), node readiness, pod status
Storage Systems: Volume utilization, IOPS, throughput, latency, and capacity thresholds
Network: Switch and router health, bandwidth utilization, packet loss, and latency
OpenStack Services: Nova, Neutron, Cinder, Keystone, Glance service health and API response times
Platform Microservices: Health endpoints for all CCP microservices; alerting on service degradation

Platform Service Health

Database Health: PostgreSQL replication lag, MongoDB replica set status, Redis cluster health
Kafka Queues: Consumer lag, topic throughput, broker availability
API Gateway: Request rates, error rates, and latency percentiles
Identity Service: Keycloak login rates, token error rates, realm availability

Tier 2 — Tenant Workload Monitoring

Tenant workload monitoring is available as a service from the Self-Service Console. When enabled for a cell or cluster, it deploys monitoring agents within the tenant's environment with full multi-tenant isolation.

Tenant Monitoring Capabilities

VM Metrics: Per-VM CPU, memory, disk I/O, and network metrics visible in the tenant portal
Kubernetes Cluster Metrics: Pod health, deployment status, namespace resource usage, HPA status
Container Metrics: Container-level CPU and memory usage; restart counts and OOM events
Custom Dashboards: Tenants can create custom Grafana dashboards based on their workload metrics
Multi-Tenant Isolation: Tenant A's metrics are completely isolated from Tenant B — no cross-tenant metric visibility regardless of shared infrastructure

Database Monitoring

For DBaaS instances, dedicated monitoring is available via Prometheus exporters:

PostgreSQL: Query performance, connection counts, replication lag, cache hit ratio
MongoDB: Operation counts, replication status, index usage, document throughput
MS SQL: Query wait times, blocking, transaction log usage, availability group health
MariaDB: Thread status, slow query log, InnoDB buffer pool hit rate

Log Analyzer

The Log Analyzer service provides centralized log aggregation and search for platform and workload logs:

Platform Logs: All CCP microservice logs, API gateway logs, authentication events, and infrastructure logs aggregated centrally
Workload Logs: Application logs from VMs and containers forwarded to Log Analyzer for storage and search
Security Logs: Firewall logs, VPN access logs, and access control decision logs forwarded to SIEM
Log Retention: Configurable retention per log category; compliance-driven retention policies for security and audit logs
Search Interface: Full-text search across log streams with time-range filtering and structured field queries
Integration: APM / NPM / IPM tools integrate with Log Analyzer for correlated application and infrastructure analysis

Alarm Service

The Alarm Service provides configurable, threshold-based alerting across all monitored resources:

Pre-Configured Alert Thresholds

Alert Category	Trigger Condition	Default Action
CPU Utilization	Sustained utilization above configurable threshold	Alert operations team; capacity planning review
Memory Utilization	Node-level memory above configurable threshold	Capacity warning; alert operations
Storage Capacity	Volume or pool utilization above threshold	Capacity warning to tenant; escalation to operations
Kubernetes Node Not Ready	Node transitions to NotReady state	Immediate operations alert; auto-recovery attempt
etcd Leader Changes	etcd leader election above configurable rate	Stability investigation; alert operations
API Error Rate	HTTP 5xx error rate above threshold	Service degradation alert; escalation
Database Replication Lag	Replication lag exceeds threshold	HA risk alert; DR readiness review
Backup Failure	Scheduled backup job fails or misses window	Immediate alert; data protection risk escalation
Service Unavailability	CCP microservice health check fails	Platform alert; auto-restart attempt

Custom Alarm Configuration

Users and administrators can create custom alarms on any metric with configurable:
- Threshold value and comparison operator
- Evaluation period (number of consecutive data points)
- Severity level (info, warning, critical)
- Notification channel (email, SMS, webhook, portal)

Notification Service

The Notification Service delivers alerts and system events through multiple channels:

Email (SMTP): Alert emails delivered via configured SMTP server; supports HTML templates for rich notifications
SMS: Critical alert SMS to on-call phone numbers via configured SMS gateway
Portal Notifications: Real-time in-portal notifications via SocketIO push; visible in the notification bell in the console
Webhook: POST notifications to external systems (PagerDuty, OpsGenie, Slack, custom ITSM systems) for alert integration
Notification Policies: Configurable escalation rules — alert user first, then escalate to admin if unacknowledged within time window

Incident Management

Platform incidents follow a structured severity classification:

Severity	Definition	Response Model
Critical	Platform-wide outage or data loss risk	Immediate response; war room activation; executive notification
High	Major service degradation affecting multiple tenants	Rapid response; senior engineer engaged; tenant notification
Medium	Service degradation for single tenant or service	Standard response; investigation and resolution within SLA
Low	Minor issue with available workaround	Next-business-day response; ticket tracking

Monitoring Service Portfolio​

Monitoring Architecture​

Tier 1 — Platform Operations Monitoring​

Infrastructure Metrics​

Platform Service Health​

Tier 2 — Tenant Workload Monitoring​

Tenant Monitoring Capabilities​

Database Monitoring​

Log Analyzer​

Alarm Service​

Pre-Configured Alert Thresholds​

Custom Alarm Configuration​

Notification Service​

Incident Management​