Monitoring and Debugging

The SMI platform provides multiple layers of health checking:

  • Deployment health checks — These confirm that the infrastructure meets the application requirements.

    NOTE: Some deployment health checks (input/output operations per second (IOPS) validation and network throughput) may impact performance and should only be executed during the deployment phase.

  • Run time health checks — These checks are constantly running in the background to verify that logging and tracing are set to the lowest levels, and to check error rates and alarms.

  • Pod health checks — These confirm that the pod is alive and service availability. If the pod fails the health check, it is killed and re-scheduled onto another available node.

  • Performance checks — The checks provide such data as transactions per second (TPS), number of records (sessions), CPU and memory utilization, errors, etc.

Statistics are available for viewing through Grafana, as well as for streaming using Prometheus. They are also available in bulkstat format. The granularity of statistics can be as small as 1 second. Statistics are stored for up to 3 days using Thanos to compress and compact the data.

Logging utilizes journald and rsyslog to collect and distribute logs northbound to a fully featured logging platform. SMI also includes logging utilities to collect snapshots for troubleshooting and uploading to Cisco TAC support centers. Logging verbosity and detail levels are set via API, and can be set to Critical, Error, Warning, Informational, or Debug.

Application and platform events can be forwarded northbound using Prometheus plugins such as VES and/or SNMP.