比如服务器A上面跑了service-a,service-b,service-c,service-d四个进程,现在需要监控这个四个进程,如果挂了告警到钉钉.

process_list.txt文件写入需要监控的进程名字

  1. [root@kafka1 scripts]# pwd
  2. /data/scripts
  3. [root@kafka1 scripts]# cat process_list.txt
  4. service-a
  5. service-b
  6. service-c
  7. service-d

编辑agent脚本,将数据推送pushgateway

vim /data/scripts/agent.sh
#!/bin/bash

#########################################################################
# File Name: agent.sh
# Created on: 2021-03-14 11:36:19
# Author: Wu Kang
# Last Modified: 2021-03-14 12:24:50
# Description: 采集进程相关信息推送pushgateway,每秒推送一次
#########################################################################

HOST=`hostname`
IP=`ip a|grep inet|egrep -v "127.0.0.1|fe80|::"|awk -F'/' '{print $1}'|awk  '{print $2}'|head -1`


function getdata(){
  >tmpdata.txt
  for process_name in  `cat process_list.txt`
  do
    count=`ps -ef | grep $process_name | grep -v grep | awk '{print $2}' | wc -l`
    #echo $count $process_name
    line='process_count{host="'$HOST'",process_name="'$process_name'",ip="'$IP'"} '$count''
    echo $line >>tmpdata.txt
    echo $line
  done
}


function pushdata(){
  curl -XPOST --data-binary @tmpdata.txt http://192.168.50.189:9091/metrics/job/process
}



function run(){
  while true
  do
    getdata
    pushdata
    sleep 1
  done

}

main(){
  run
}

main

将agent推送脚本做成服务脚本

cat > /usr/lib/systemd/system/agent.service <<EOF
[Unit]
Description=agent
Documentation=
After=network.target

[Service]
Type=simple
WorkingDirectory=/data/scripts
ExecStart=/data/scripts/agent.sh
ExecStop=/bin/kill -KILL \$MAINPID
ExecReload=/bin/kill -HUP \$MAINPID
KillMode=control-group
Restart=on-failure
RestartSec=3s

[Install]
WantedBy=multi-user.target
EOF
systemctl enable agent
systemctl start agent
systemctl status agent

因为没有这个进程,所以这里进程数都是0,正常情况下应该是1或者大于1

image.png

查看pushgateway metrics已经有相关数据了.

http://192.168.50.189:9091/metrics
image.png

prometheus界面使用process_count 可以查询到对应的数据了.

image.png

模拟启动服务service-a进程

[root@kafka1 ~]# touch service-a.sh
[root@kafka1 ~]# tailf service-a.sh

image.png
image.png

设置告警规则,在prometheus rules目录下新建文件process.yml

groups:
- name: process
  rules:
  - alert: 进程挂了 #告警名称
    expr: process_count == 0
    for: 3s #持续多久后发送
    labels:
      severity: warning
    annotations: #信息
      summary: "{{ $labels.process_name }}进程挂掉了"
      description: "{{ $labels.process_name }} is down,value is {{ $value }}"

查看有三个告警

image.png

这里将之前模拟启动的进程停掉,收到了第四个告警

image.png

image.png