Skip to main content

Monitoring Services

Business Value: Full operational visibility into platform health, workload performance, and resource utilization — for both platform operators and tenants — with proactive alerting and structured incident response. CCP monitoring ensures problems are detected before users are impacted.

Monitoring Service Portfolio

ServicePhaseDescription
Log AnalyzerMVP1Centralized log aggregation, search, and analysis for platform and workload logs
Operational MetricsMVP1Real-time collection and visualization of infrastructure and workload performance metrics
Alarm ServiceMVP1Configurable threshold-based and anomaly-driven alerting across all monitored resources
Notification ServiceMVP1Multi-channel alert delivery — email, SMS, webhook, and portal notifications

Monitoring Architecture

CCP monitoring is delivered through two integrated toolsets:

  • Zabbix v7.4.3: Operational metrics collection, alarm management, and notification delivery. The primary tool for infrastructure-level monitoring — servers, network devices, services, and platform components.
  • Prometheus & Grafana v9.4.3: Cluster and database health monitoring. Prometheus collects time-series metrics from Kubernetes clusters, OpenStack nodes, and database components. Grafana provides rich visualization dashboards.
  • APM / NPM / IPM Integration: Application performance, network performance, and infrastructure performance monitoring tools integrated via the Log Analyzer service.

Tier 1 — Platform Operations Monitoring

Platform operations monitoring is always active, deployed alongside the CCP control plane for the service provider's operations team. It covers the full infrastructure stack:

Infrastructure Metrics

  • Compute Nodes: CPU utilization, memory usage, disk I/O, network throughput, and temperature
  • Kubernetes Clusters: Control plane health (etcd, API server, scheduler, controller-manager), node readiness, pod status
  • Storage Systems: Volume utilization, IOPS, throughput, latency, and capacity thresholds
  • Network: Switch and router health, bandwidth utilization, packet loss, and latency
  • OpenStack Services: Nova, Neutron, Cinder, Keystone, Glance service health and API response times
  • Platform Microservices: Health endpoints for all CCP microservices; alerting on service degradation

Platform Service Health

  • Database Health: PostgreSQL replication lag, MongoDB replica set status, Redis cluster health
  • Kafka Queues: Consumer lag, topic throughput, broker availability
  • API Gateway: Request rates, error rates, and latency percentiles
  • Identity Service: Keycloak login rates, token error rates, realm availability

Tier 2 — Tenant Workload Monitoring

Tenant workload monitoring is available as a service from the Self-Service Console. When enabled for a cell or cluster, it deploys monitoring agents within the tenant's environment with full multi-tenant isolation.

Tenant Monitoring Capabilities

  • VM Metrics: Per-VM CPU, memory, disk I/O, and network metrics visible in the tenant portal
  • Kubernetes Cluster Metrics: Pod health, deployment status, namespace resource usage, HPA status
  • Container Metrics: Container-level CPU and memory usage; restart counts and OOM events
  • Custom Dashboards: Tenants can create custom Grafana dashboards based on their workload metrics
  • Multi-Tenant Isolation: Tenant A's metrics are completely isolated from Tenant B — no cross-tenant metric visibility regardless of shared infrastructure

Database Monitoring

For DBaaS instances, dedicated monitoring is available via Prometheus exporters:

  • PostgreSQL: Query performance, connection counts, replication lag, cache hit ratio
  • MongoDB: Operation counts, replication status, index usage, document throughput
  • MS SQL: Query wait times, blocking, transaction log usage, availability group health
  • MariaDB: Thread status, slow query log, InnoDB buffer pool hit rate

Log Analyzer

The Log Analyzer service provides centralized log aggregation and search for platform and workload logs:

  • Platform Logs: All CCP microservice logs, API gateway logs, authentication events, and infrastructure logs aggregated centrally
  • Workload Logs: Application logs from VMs and containers forwarded to Log Analyzer for storage and search
  • Security Logs: Firewall logs, VPN access logs, and access control decision logs forwarded to SIEM
  • Log Retention: Configurable retention per log category; compliance-driven retention policies for security and audit logs
  • Search Interface: Full-text search across log streams with time-range filtering and structured field queries
  • Integration: APM / NPM / IPM tools integrate with Log Analyzer for correlated application and infrastructure analysis

Alarm Service

The Alarm Service provides configurable, threshold-based alerting across all monitored resources:

Pre-Configured Alert Thresholds

Alert CategoryTrigger ConditionDefault Action
CPU UtilizationSustained utilization above configurable thresholdAlert operations team; capacity planning review
Memory UtilizationNode-level memory above configurable thresholdCapacity warning; alert operations
Storage CapacityVolume or pool utilization above thresholdCapacity warning to tenant; escalation to operations
Kubernetes Node Not ReadyNode transitions to NotReady stateImmediate operations alert; auto-recovery attempt
etcd Leader Changesetcd leader election above configurable rateStability investigation; alert operations
API Error RateHTTP 5xx error rate above thresholdService degradation alert; escalation
Database Replication LagReplication lag exceeds thresholdHA risk alert; DR readiness review
Backup FailureScheduled backup job fails or misses windowImmediate alert; data protection risk escalation
Service UnavailabilityCCP microservice health check failsPlatform alert; auto-restart attempt

Custom Alarm Configuration

  • Users and administrators can create custom alarms on any metric with configurable:
    • Threshold value and comparison operator
    • Evaluation period (number of consecutive data points)
    • Severity level (info, warning, critical)
    • Notification channel (email, SMS, webhook, portal)

Notification Service

The Notification Service delivers alerts and system events through multiple channels:

  • Email (SMTP): Alert emails delivered via configured SMTP server; supports HTML templates for rich notifications
  • SMS: Critical alert SMS to on-call phone numbers via configured SMS gateway
  • Portal Notifications: Real-time in-portal notifications via SocketIO push; visible in the notification bell in the console
  • Webhook: POST notifications to external systems (PagerDuty, OpsGenie, Slack, custom ITSM systems) for alert integration
  • Notification Policies: Configurable escalation rules — alert user first, then escalate to admin if unacknowledged within time window

Incident Management

Platform incidents follow a structured severity classification:

SeverityDefinitionResponse Model
CriticalPlatform-wide outage or data loss riskImmediate response; war room activation; executive notification
HighMajor service degradation affecting multiple tenantsRapid response; senior engineer engaged; tenant notification
MediumService degradation for single tenant or serviceStandard response; investigation and resolution within SLA
LowMinor issue with available workaroundNext-business-day response; ticket tracking