Plan Your Cluster’s Alarms
Before you set up alarms, you should have an understanding of your clusters. By considering the problems that have previously occurred, you can predict which issues are most likely to happen again. Additionally, by knowing your system’s performance requirements and customer SLAs (service level agreements), you can configure alarms at thresholds that allow you to address small problems before they become major issues.
On This Page
Predefined Alarms (YARN)
A set of predefined alarms in the Pepperdata dashboard monitors basic system resource usage, such as CPU load and storage. You cannot delete these alarms, but you can disable and enable them, customize them for your system (see Edit Alarms), and use them as the basis for planning and configuring additional alarms.
The predefined alarms appear in the Alarms Settings tab of the Alarms & Alerts page. (To show the Alarms & Alerts page, click the Alarms icon () in the “top-nav” menu, and select View All Alarms. Or use the “left-nav” menu, and select Alarms.)
If you want to create a custom alarm that’s similar to a predefined alarm, it’s easier to use the predefined alarm’s query as the starting point for creating a new alarm directly instead of creating an alarm from a chart view or figuring out the required query string. For more information, see Create Alarms From the Alarms Page.
The table describes the predefined alarms for YARN clusters and their default queries, thresholds, and firing sensitivity—by how long the threshold must be exceeded, and by what percentage, before the alarm is actually fired (triggered).
Title | Description | Query | Threshold | Firing Sensitivity |
CPU Load | node Fifteen minute load average per core, by Host | h=*&m=n_15mlavg_per_core | > 5 | > 1% of time in any five minute period |
Memory | User RAM percentage, by Host | h=*&m=c_rsspct | > 90 | > 1% of time in any five minute period |
Disk I/O | node Percent of time doing I/O, by Host | h=*&m=n_dnmsdips_max_pct | > 90 | > 1% of time in any five minute period |
Storage | node Percent of disk space used, by Host | h=*&m=c_capacity_disk_pct | > 90 | > 1% of time in any five minute period |
HBase Garbage Collection | Task JVM old garbage collection time, by Host, by App | h=*&j=hbase&m=trfjgot | > 45000 | > 1% of time in any five minute period |
Swap Status | node Swap state, by Host | h=*&m=n_sdss | > 0 | > 1% of time in any five minute period |
Resource Manager Metrics | ResourceManager active nodes | m=rmc_act | < 1 | > 30% of time in any 30 minute period |
Too Few Kafka Broker Active Controllers * | Kafka active controller count | m=kafka_activecontroller | < 1 | > 100% of time in any one minute period |
Too Many Kafka Broker Active Controllers * | Kafka active controller count | m=kafka_activecontroller | > 1 | > 100% of time in any one minute period |
Kafka Offline Partitions * | Kafka offline partitions count | m=kafka_offlinepartitionscount | 0 | > 100% of time in any one minute period |
* Available only when Streaming Spotlight is enabled.
Alarms for Common Problems
Common cluster problems for which you might want to create alarms include:
- Slow (Long Running) Jobs
- Excessive Container Asks
- Specific (Named) Job Ran Too Long
- Job Failed or Killed
For information bout the associated metrics and the steps for creating corresponding alarms, see Create Alarms for Common Problems.