How to Write Rules for Prometheus

2019-06-06 14:51:03

First thing you should know is that there are two types of rules in Prometheus:

  • Recoding rules
  • Alerting rules

Rules are evaluated at regular intervals, and they can be included in prometheus.yml configuration file with the following line:

1
2
rule_files:
- /etc/prometheus/rules/*.rules

Recording Rules
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
groups:
- name: recording_rules
interval: 5s
rules:
- record: asia_shanghai_time
expr: time()
- record: asia_shanghai_hour
expr: hour(asia_shanghai_time)+8
- name: School Http Status Check
rules:
- alert: Http School Check Status
expr: probe_success{job="blackbox-school-http"} == 0 and ON() asia_shanghai_hour >= 6 < 22
for: 1m
labels:
severity: warning
env: school
annotations:
description: "机器:{{ $labels.instance }} 所属 job:{{ $labels.job }} http状态码: {{ printf `probe_http_status_code{instance='%s'}` $labels.instance | query | first | value }} http检测失败,请检查!"
summary: "http检测"
Alerting Rules
  • CPU

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    groups:
    - name: CPU报警规则
    rules:
    - alert: NodeCPUUsage
    expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[5m]) )) * 100 > 80
    for: 30m
    labels:
    serverity: warning
    annotations:
    description: "{{$labels.instance}}: High CPU usage detected"
    summary: "{{$labels.instance}}: CPU usage is above 90% (current value is: {{ $value }}"
    - alert: ContextSwitching
    expr: rate(node_context_switches_total[5m]) > 1000
    for: 30m
    labels:
    severity: warning
    annotations:
    summary: "Context switching (instance {{ $labels.instance }})"
    description: "Context switching is growing on node (> 1000 / s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
  • Memory

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    groups:
    - name: 内存报警规则
    rules:
    - alert: NodeMemoryUsage
    expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 80
    for: 1m
    labels:
    severity: warning
    annotations:
    description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}"
    summary: "{{$labels.instance}}: High Memory usage detected"
  • Disk

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    groups:
    - name: 磁盘报警规则
    rules:
    - alert: OutOfNodeDiskSpace
    expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 80
    for: 1m
    labels:
    severity: warning
    annotations:
    description: "{{$labels.instance}}: Disk usage is above 80% (current value is: {{ $value }}"
    summary: "{{$labels.instance}}: High Disk usage detected"
    - alert: UnusualDiskReadRate
    expr: sum by (instance) (irate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
    for: 30m
    labels:
    severity: warning
    annotations:
    summary: "Unusual disk read rate (instance {{ $labels.instance }})"
    description: "Disk is probably reading too much data (> 50 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
    - alert: UnusualDiskWriteRate
    expr: sum by (instance) (irate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
    for: 30m
    labels:
    severity: warning
    annotations:
    summary: "Unusual disk write rate (instance {{ $labels.instance }})"
    description: "Disk is probably writing too much data (> 50 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
    - alert: DiskWillFillIn16Hours
    expr: predict_linear(node_filesystem_free{filesystem!~"^/run(/|$)",fstype!~"tmpfs",mountpoint="/"}[1h], 16 * 3600) < 0 and on(instance, job) (time() - node_installation_time_seconds > 2 * 3600)
    for: 30m
    labels:
    severity: warning
    annotations:
    summary: "Out of inodes (instance {{ $labels.instance }})"
    description: "{{ $labels.instance }} will be soon out of disk space."
    - alert: UnusualDiskReadLatency
    expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 100
    for: 30m
    labels:
    severity: warning
    annotations:
    summary: "Unusual disk read latency (instance {{ $labels.instance }})"
    description: "Disk latency is growing (read operations > 100ms)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
    - alert: UnusualDiskWriteLatency
    expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 100
    for: 30m
    labels:
    severity: warning
    annotations:
    summary: "Unusual disk write latency (instance {{ $labels.instance }})"
    description: "Disk latency is growing (write operations > 100ms)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
  • Network

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    groups:
    - name: Unusual network throughput
    rules:
    - alert: UnusualNetworkThroughputIn
    expr: sum by (instance) (irate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100
    for: 30m
    labels:
    severity: warning
    annotations:
    summary: "Unusual network throughput in (instance {{ $labels.instance }})"
    description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
    - alert: UnusualNetworkThroughputOut
    expr: sum by (instance) (irate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
    for: 30m
    labels:
    severity: warning
    annotations:
    summary: "Unusual network throughput out (instance {{ $labels.instance }})"
    description: "Host network interfaces are probably sending too much data (> 100 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
  • blackbox for ssl_expiry

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    groups: 
    - name: ssl_expiry.rules
    rules:
    - alert: SSLCertExpiringSoon
    expr: probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time() < 86400 * 30
    for: 30m
    labels:
    severity: info
    annotations:
    summary: "SSL certificate will expire soon (instance {{ $labels.instance }})"
    description: "SSL certificate expires in 30 days\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
    - alert: SslCertificateHasExpired
    expr: probe_ssl_earliest_cert_expiry - time() <= 0
    for: 30m
    labels:
    severity: warning
    annotations:
    summary: "SSL certificate has expired (instance {{ $labels.instance }})"
    description: "SSL certificate has expired already\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
  • blackbox for http

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    groups:
    - name: Http Status Check
    rules:
    - alert: Http Status Check
    expr: probe_success{job="blackbox-http"} == 0
    for: 1m
    labels:
    severity: warning
    annotations:
    description: "机器:{{ $labels.instance }} 所属 job:{{ $labels.job }} http状态码: {{ printf `probe_http_status_code{instance='%s'}` $labels.instance | query | first | value }} http检测失败,请检查!"
    summary: "http检测"
  • mysql

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    groups:
    - name: MySQLStatsAlert
    rules:
    - alert: MySQL is down
    expr: mysql_up == 0
    for: 1m
    labels:
    severity: critical
    annotations:
    summary: "Instance {{ $labels.instance }} MySQL is down"
    description: "MySQL database is down. This requires immediate action!"
    - alert: Mysql_High_QPS
    expr: rate(mysql_global_status_questions[5m]) > 500
    for: 2m
    labels:
    severity: warning
    annotations:
    summary: "{{$labels.instance}}: Mysql_High_QPS detected"
    description: "{{$labels.instance}}: Mysql opreation is more than 500 per second ,(current value is: {{ $value }})"
    - alert: Mysql_Too_Many_Connections
    expr: rate(mysql_global_status_threads_connected[5m]) > 200
    for: 2m
    labels:
    severity: warning
    annotations:
    summary: "{{$labels.instance}}: Mysql Too Many Connections detected"
    description: "{{$labels.instance}}: Mysql Connections is more than 100 per second ,(current value is: {{ $value }})"
    - alert: Mysql_Too_Many_slow_queries
    expr: rate(mysql_global_status_slow_queries[5m]) > 3
    for: 2m
    labels:
    severity: warning
    annotations:
    summary: "{{$labels.instance}}: Mysql_Too_Many_slow_queries detected"
    description: "{{$labels.instance}}: Mysql slow_queries is more than 3 per second ,(current value is: {{ $value }})"
    - alert: SQL thread stopped
    expr: mysql_slave_status_slave_sql_running == 0
    for: 1m
    labels:
    severity: critical
    annotations:
    summary: "Instance {{ $labels.instance }} SQL thread stopped"
    description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."
    - alert: Slave lagging behind Master
    expr: rate(mysql_slave_status_seconds_behind_master[5m]) >30
    for: 1m
    labels:
    severity: warning
    annotations:
    summary: "Instance {{ $labels.instance }} Slave lagging behind Master"
    description: "Slave is lagging behind Master. Please check if Slave threads are running and if there are some performance issues!"
  • Ali Cloud

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    182
    183
    184
    185
    186
    187
    188
    189
    190
    191
    192
    193
    194
    195
    196
    197
    198
    199
    200
    201
    202
    203
    204
    205
    206
    207
    208
    209
    210
    211
    212
    213
    214
    215
    216
    217
    218
    219
    220
    221
    222
    223
    224
    225
    226
    227
    228
    229
    230
    231
    232
    233
    234
    235
    236
    237
    238
    239
    240
    241
    242
    243
    244
    245
    246
    247
    248
    249
    250
    251
    252
    253
    254
    255
    256
    257
    258
    259
    260
    261
    262
    263
    264
    265
    266
    267
    268
    269
    270
    271
    272
    273
    274
    275
    276
    277
    278
    279
    280
    281
    282
    283
    284
    285
    286
    287
    288
    289
    290
    291
    292
    293
    294
    295
    296
    297
    298
    299
    300
    301
    302
    303
    304
    305
    306
    307
    308
    309
    310
    311
    312
    313
    314
    315
    316
    317
    318
    319
    320
    321
    322
    323
    324
    325
    326
    327
    328
    329
    330
    331
    332
    333
    334
    335
    336
    337
    338
    339
    340
    341
    342
    343
    344
    345
    346
    347
    348
    349
    groups:
    - name: slb
    rules:
    - alert: slb_5xx_percent:warning
    expr: |-
    sum(aliyun_acs_slb_dashboard_StatusCode5xx) by (vip, port) /
    sum(aliyun_acs_slb_dashboard_Qps) by (vip, port) > 0.01
    labels:
    severity: 2
    for: 5m
    annotations:
    summary: 'SLB {{ $labels.vip }}:{{ $labels.port }} 5xx percent > 1%'
    - alert: slb_5xx_percent:high
    expr: |-
    sum(aliyun_acs_slb_dashboard_StatusCode5xx) by (vip, port) /
    sum(aliyun_acs_slb_dashboard_Qps) by (vip, port) > 0.05
    labels:
    severity: 1
    for: 5m
    annotations:
    summary: 'SLB {{ $labels.vip }}:{{ $labels.port }} 5xx percent > 5%'
    - alert: slb_5xx_percent:critical
    expr: |-
    sum(aliyun_acs_slb_dashboard_StatusCode5xx) by (vip, port) /
    sum(aliyun_acs_slb_dashboard_Qps) by (vip, port) > 0.1
    labels:
    severity: 0
    for: 5m
    annotations:
    summary: 'SLB {{ $labels.vip }}:{{ $labels.port }} 5xx percent > 10%'
    - alert: slb_response_time:high
    expr: |-
    avg(aliyun_acs_slb_dashboard_Rt) by (vip, port) > 200
    labels:
    severity: 1
    for: 5m
    annotations:
    summary: 'SLB {{ $labels.vip }}:{{ $labels.port }} RT > 200ms'
    - alert: slb_response_time:critical
    expr: |-
    avg(aliyun_acs_slb_dashboard_Rt) by (vip, port) > 500
    labels:
    severity: 0
    for: 5m
    annotations:
    summary: 'SLB {{ $labels.vip }}:{{ $labels.port }} RT > 500ms'
    - alert: slb_tx_traffic_drop_percent:critical
    expr: |-
    sum(aliyun_acs_slb_dashboard_DropTrafficTX) by (vip, port) /
    sum(aliyun_acs_slb_dashboard_TrafficTXNew) by (vip, port) > 0.001
    labels:
    severity: 0
    for: 5m
    annotations:
    summary: 'SLB {{ $labels.vip }}:{{ $labels.port }} tx traffic drop percent > 0.1%'
    - alert: slb_rx_traffic_drop_percent:critical
    expr: |-
    sum(aliyun_acs_slb_dashboard_DropTrafficRX) by (vip, port) /
    sum(aliyun_acs_slb_dashboard_TrafficRXNew) by (vip, port) > 0.001
    labels:
    severity: 0
    for: 5m
    annotations:
    summary: 'SLB {{ $labels.vip }}:{{ $labels.port }} rx traffic drop percent > 0.1%'

    - name: ecs
    rules:
    - alert: ecs_cpu_pressure:warning
    expr: |-
    (aliyun_acs_ecs_dashboard_CPUUtilization > 80)
    * on (instanceId) group_left(VpcAttributes,HostName,InnerIpAddress)
    label_replace(aliyun_meta_ecs_info,"instanceId","$1","InstanceId","(.*)")
    labels:
    severity: 2
    for: 5m
    annotations:
    summary: 'ECS {{ $labels.HostName }} cpu usage > 80%'
    - alert: ecs_cpu_pressure:high
    expr: |-
    (aliyun_acs_ecs_dashboard_CPUUtilization > 95)
    * on (instanceId) group_left(VpcAttributes,HostName,InnerIpAddress)
    label_replace(aliyun_meta_ecs_info,"instanceId","$1","InstanceId","(.*)")
    labels:
    severity: 1
    for: 5m
    annotations:
    summary: 'ECS {{ $labels.HostName }} cpu usage > 95%'
    - alert: ecs_memory_pressure:warning
    expr: |-
    (aliyun_acs_ecs_dashboard_memory_usedutilization > 80)
    * on (instanceId) group_left(VpcAttributes,HostName,InnerIpAddress)
    label_replace(aliyun_meta_ecs_info,"instanceId","$1","InstanceId","(.*)")
    labels:
    severity: 2
    for: 5m
    annotations:
    summary: 'ECS {{ $labels.HostName }} memory usage > 80%'
    - alert: ecs_memory_pressure:high
    expr: |-
    (aliyun_acs_ecs_dashboard_memory_usedutilization > 95)
    * on (instanceId) group_left(VpcAttributes,HostName,InnerIpAddress)
    label_replace(aliyun_meta_ecs_info,"instanceId","$1","InstanceId","(.*)")
    labels:
    severity: 1
    for: 5m
    annotations:
    summary: 'ECS {{ $labels.HostName }} memory usage > 95%'
    - alert: ecs_load_avg:warning
    expr: |-
    (aliyun_acs_ecs_dashboard_load_5m > 10)
    * on (instanceId) group_left(VpcAttributes,HostName,InnerIpAddress)
    label_replace(aliyun_meta_ecs_info,"instanceId","$1","InstanceId","(.*)")
    labels:
    severity: 2
    for: 5m
    annotations:
    summary: 'ECS {{ $labels.HostName }} loadAvg5m > 10'
    - alert: ecs_load_avg:high
    expr: |-
    (aliyun_acs_ecs_dashboard_load_5m > 20)
    * on (instanceId) group_left(VpcAttributes,HostName,InnerIpAddress)
    label_replace(aliyun_meta_ecs_info,"instanceId","$1","InstanceId","(.*)")
    labels:
    severity: 1
    for: 5m
    annotations:
    summary: 'ECS {{ $labels.HostName }} loadAvg5m > 20'
    - alert: ecs_disk_pressure:high
    expr: |-
    (aliyun_acs_ecs_dashboard_diskusage_utilization > 90)
    * on (instanceId) group_left(VpcAttributes,HostName,InnerIpAddress)
    label_replace(aliyun_meta_ecs_info,"instanceId","$1","InstanceId","(.*)")
    labels:
    severity: 1
    for: 5m
    annotations:
    summary: 'ECS {{ $labels.HostName }} disk usage > 80%'
    - alert: ecs_disk_pressure:critical
    expr: |-
    (aliyun_acs_ecs_dashboard_diskusage_utilization > 95)
    * on (instanceId) group_left(VpcAttributes,HostName,InnerIpAddress)
    label_replace(aliyun_meta_ecs_info,"instanceId","$1","InstanceId","(.*)")
    labels:
    severity: 0
    for: 5m
    annotations:
    summary: 'ECS {{ $labels.HostName }} disk usage > 95%'
    - alert: ecs_too_many_connections:warning
    expr: |-
    (aliyun_acs_ecs_dashboard_tcpconnection{state="TCP_TOTAL"} > 1000)
    * on (instanceId) group_left(VpcAttributes,HostName,InnerIpAddress)
    label_replace(aliyun_meta_ecs_info,"instanceId","$1","InstanceId","(.*)")
    labels:
    severity: 2
    for: 5m
    annotations:
    summary: 'ECS {{ $labels.HostName }} tcp_total > 1000'
    - alert: ecs_too_many_connections:high
    expr: |-
    (aliyun_acs_ecs_dashboard_tcpconnection{state="TCP_TOTAL"} > 2000)
    * on (instanceId) group_left(VpcAttributes,HostName,InnerIpAddress)
    label_replace(aliyun_meta_ecs_info,"instanceId","$1","InstanceId","(.*)")
    labels:
    severity: 1
    for: 5m
    annotations:
    summary: 'ECS {{ $labels.HostName }} tcp_total > 2000'

    - name: rds
    rules:
    - alert: rds_cpu_pressure:high
    expr: |-
    sum(aliyun_acs_rds_dashboard_CpuUsage
    * on (instanceId) group_left(DBInstanceDescription,ZoneId)
    label_replace(aliyun_meta_rds_info, "instanceId", "$1", "DBInstanceId", "(.*)"))
    without (instance, userId, job) > 85
    labels:
    severity: 1
    for: 5m
    annotations:
    summary: 'RDS {{ $labels.DBInstanceDescription }} under high cpu pressure > 85%'
    - alert: rds_cpu_pressure:critical
    expr: |-
    sum(aliyun_acs_rds_dashboard_CpuUsage
    * on (instanceId) group_left(DBInstanceDescription,ZoneId)
    label_replace(aliyun_meta_rds_info, "instanceId", "$1", "DBInstanceId", "(.*)"))
    without (instance, userId, job) > 95
    labels:
    severity: 0
    for: 5m
    annotations:
    summary: 'RDS {{ $labels.DBInstanceDescription }} under critical cpu pressure > 95%'
    - alert: rds_memory_pressure:high
    expr: |-
    sum(aliyun_acs_rds_dashboard_MemoryUsage
    * on (instanceId) group_left(DBInstanceDescription,ZoneId)
    label_replace(aliyun_meta_rds_info, "instanceId", "$1", "DBInstanceId", "(.*)"))
    without (instance, userId, job) > 85
    labels:
    severity: 1
    for: 5m
    annotations:
    summary: 'RDS {{ $labels.DBInstanceDescription }} under high memory pressure > 85%'
    - alert: rds_memory_pressure:critical
    expr: |-
    sum(aliyun_acs_rds_dashboard_MemoryUsage
    * on (instanceId) group_left(DBInstanceDescription,ZoneId)
    label_replace(aliyun_meta_rds_info, "instanceId", "$1", "DBInstanceId", "(.*)"))
    without (instance, userId, job) > 95
    labels:
    severity: 0
    for: 5m
    annotations:
    summary: 'RDS {{ $labels.DBInstanceDescription }} under critical memory pressure > 95%'
    - alert: rds_iops_pressure:high
    expr: |-
    sum(aliyun_acs_rds_dashboard_IOPSUsage
    * on (instanceId) group_left(DBInstanceDescription,ZoneId)
    label_replace(aliyun_meta_rds_info, "instanceId", "$1", "DBInstanceId", "(.*)"))
    without (instance, userId, job) > 80
    labels:
    severity: 1
    for: 5m
    annotations:
    summary: 'RDS {{ $labels.DBInstanceDescription }} under high iops pressure > 80%'
    - alert: rds_iops_pressure:critical
    expr: |-
    sum(aliyun_acs_rds_dashboard_IOPSUsage
    * on (instanceId) group_left(DBInstanceDescription,ZoneId)
    label_replace(aliyun_meta_rds_info, "instanceId", "$1", "DBInstanceId", "(.*)"))
    without (instance, userId, job) > 90
    labels:
    severity: 0
    for: 5m
    annotations:
    summary: 'RDS {{ $labels.DBInstanceDescription }} under high iops pressure > 90%'
    - alert: rds_disk_space_exhausted:warning
    expr: |-
    sum(aliyun_acs_rds_dashboard_DiskUsage
    * on (instanceId) group_left(DBInstanceDescription,ZoneId)
    label_replace(aliyun_meta_rds_info, "instanceId", "$1", "DBInstanceId", "(.*)"))
    without (instance, userId, job) > 85
    labels:
    severity: 2
    for: 5m
    annotations:
    summary: 'RDS {{ $labels.DBInstanceDescription }} disk space under pressure > 85%'
    - alert: rds_disk_space_exhausted:critical
    expr: |-
    sum(aliyun_acs_rds_dashboard_DiskUsage
    * on (instanceId) group_left(DBInstanceDescription,ZoneId)
    label_replace(aliyun_meta_rds_info, "instanceId", "$1", "DBInstanceId", "(.*)"))
    without (instance, userId, job) > 95
    labels:
    severity: 0
    for: 5m
    annotations:
    summary: 'RDS {{ $labels.DBInstanceDescription }} disk space will be exhausted soon > 95%'
    - alert: rds_connection_pressure:high
    expr: |-
    sum(aliyun_acs_rds_dashboard_ConnectionUsage
    * on (instanceId) group_left(DBInstanceDescription,ZoneId)
    label_replace(aliyun_meta_rds_info, "instanceId", "$1", "DBInstanceId", "(.*)"))
    without (instance, userId, job) > 85
    labels:
    severity: 1
    for: 5m
    annotations:
    summary: 'RDS {{ $labels.DBInstanceDescription }} connection usage > 85%'
    - alert: rds_connection_pressure:critical
    expr: |-
    sum(aliyun_acs_rds_dashboard_ConnectionUsage
    * on (instanceId) group_left(DBInstanceDescription,ZoneId)
    label_replace(aliyun_meta_rds_info, "instanceId", "$1", "DBInstanceId", "(.*)"))
    without (instance, userId, job) > 95
    labels:
    severity: 1
    for: 5m
    annotations:
    summary: 'RDS {{ $labels.DBInstanceDescription }} connection usage > 95%'

    - name: redis
    rules:
    - alert: redis_cpu_pressure:high
    expr: |-
    sum(aliyun_acs_kvstore_CpuUsage
    * on (instanceId) group_left(PrivateIp,InstanceName)
    label_replace(aliyun_meta_redis_info, "instanceId", "$1", "UserName", "(.*)"))
    without (instance, userId, job) > 85
    labels:
    severity: 1
    for: 5m
    annotations:
    summary: 'Redis {{ $labels.InstanceName }} under high cpu pressure > 85%'
    - alert: redis_cpu_pressure:critical
    expr: |-
    sum(aliyun_acs_kvstore_CpuUsage
    * on (instanceId) group_left(PrivateIp,InstanceName)
    label_replace(aliyun_meta_redis_info, "instanceId", "$1", "UserName", "(.*)"))
    without (instance, userId, job) > 95
    labels:
    severity: 0
    for: 5m
    annotations:
    summary: 'Redis {{ $labels.InstanceName }} under high cpu pressure > 95%'
    - alert: redis_memory_pressure:high
    expr: |-
    sum(aliyun_acs_kvstore_MemoryUsage
    * on (instanceId) group_left(PrivateIp,InstanceName)
    label_replace(aliyun_meta_redis_info, "instanceId", "$1", "UserName", "(.*)"))
    without (instance, userId, job) > 85
    labels:
    severity: 1
    for: 5m
    annotations:
    summary: 'Redis {{ $labels.InstanceName }} memory usage > 85%'
    - alert: redis_memory_pressure:critical
    expr: |-
    sum(aliyun_acs_kvstore_MemoryUsage
    * on (instanceId) group_left(PrivateIp,InstanceName)
    label_replace(aliyun_meta_redis_info, "instanceId", "$1", "UserName", "(.*)"))
    without (instance, userId, job) > 95
    labels:
    severity: 0
    for: 5m
    annotations:
    summary: 'Redis {{ $labels.InstanceName }} memory usage > 95%'
    - alert: redis_connection_pressure:high
    expr: |-
    sum(aliyun_acs_kvstore_ConnectionUsage
    * on (instanceId) group_left(PrivateIp,InstanceName)
    label_replace(aliyun_meta_redis_info, "instanceId", "$1", "UserName", "(.*)"))
    without (instance, userId, job) > 85
    labels:
    severity: 1
    for: 5m
    annotations:
    summary: 'Redis {{ $labels.InstanceName }} connection usage > 85%'
    - alert: redis_connection_pressure:critical
    expr: |-
    sum(aliyun_acs_kvstore_ConnectionUsage
    * on (instanceId) group_left(PrivateIp,InstanceName)
    label_replace(aliyun_meta_redis_info, "instanceId", "$1", "UserName", "(.*)"))
    without (instance, userId, job) > 95
    labels:
    severity: 0
    for: 5m
    annotations:
    summary: 'Redis {{ $labels.InstanceName }} connection usage > 95%'
Check Rules File

Prometheus includes a useful utility that you can use to check whether the rules you’ve written are OK. You can use it like this:

1
promtool check config /etc/prometheus/prometheus.yml

ref
Prometheus 操作符
prometheus
Combining alert conditions
Prometheus in Practice
Time of day based notifications with Prometheus and Alertmanager
aliyun-exporter