比如服务器A上面跑了service-a,service-b,service-c,service-d四个进程,现在需要监控这个四个进程,如果挂了告警到钉钉.
process_list.txt文件写入需要监控的进程名字
[root@kafka1 scripts]# pwd
/data/scripts
[root@kafka1 scripts]# cat process_list.txt
service-a
service-b
service-c
service-d
编辑agent脚本,将数据推送pushgateway
vim /data/scripts/agent.sh
#!/bin/bash
#########################################################################
# File Name: agent.sh
# Created on: 2021-03-14 11:36:19
# Author: Wu Kang
# Last Modified: 2021-03-14 12:24:50
# Description: 采集进程相关信息推送pushgateway,每秒推送一次
#########################################################################
HOST=`hostname`
IP=`ip a|grep inet|egrep -v "127.0.0.1|fe80|::"|awk -F'/' '{print $1}'|awk '{print $2}'|head -1`
function getdata(){
>tmpdata.txt
for process_name in `cat process_list.txt`
do
count=`ps -ef | grep $process_name | grep -v grep | awk '{print $2}' | wc -l`
#echo $count $process_name
line='process_count{host="'$HOST'",process_name="'$process_name'",ip="'$IP'"} '$count''
echo $line >>tmpdata.txt
echo $line
done
}
function pushdata(){
curl -XPOST --data-binary @tmpdata.txt http://192.168.50.189:9091/metrics/job/process
}
function run(){
while true
do
getdata
pushdata
sleep 1
done
}
main(){
run
}
main
将agent推送脚本做成服务脚本
cat > /usr/lib/systemd/system/agent.service <<EOF
[Unit]
Description=agent
Documentation=
After=network.target
[Service]
Type=simple
WorkingDirectory=/data/scripts
ExecStart=/data/scripts/agent.sh
ExecStop=/bin/kill -KILL \$MAINPID
ExecReload=/bin/kill -HUP \$MAINPID
KillMode=control-group
Restart=on-failure
RestartSec=3s
[Install]
WantedBy=multi-user.target
EOF
systemctl enable agent
systemctl start agent
systemctl status agent
因为没有这个进程,所以这里进程数都是0,正常情况下应该是1或者大于1
查看pushgateway metrics已经有相关数据了.
http://192.168.50.189:9091/metrics
prometheus界面使用process_count 可以查询到对应的数据了.
模拟启动服务service-a进程
[root@kafka1 ~]# touch service-a.sh
[root@kafka1 ~]# tailf service-a.sh
设置告警规则,在prometheus rules目录下新建文件process.yml
groups:
- name: process
rules:
- alert: 进程挂了 #告警名称
expr: process_count == 0
for: 3s #持续多久后发送
labels:
severity: warning
annotations: #信息
summary: "{{ $labels.process_name }}进程挂掉了"
description: "{{ $labels.process_name }} is down,value is {{ $value }}"