起因
某台web服务器(跑cron)的突然负载特别高
因为有两台都在跑web,但是另一台正常,这一台有跑cron,所以决定再开一台服务器来跑cron
添加新服务器跑cron后,原服务器正常,但是新服务器还是负载高。
发现很多僵死进程
rails 16846 0.0 0.1 990992 18408 ? Ssl Feb29 0:00 /home/rails/.nvm/versions/node/v8.2.1/bin/node /home/rails/.nvm/versions/node/v8.2.1/lib/node_modules/npm/node_modules/update-notifier/check.js {"pkg":{"name":"npm","version":"5.3.0"}}
rails 17127 0.0 0.0 39104 3164 pts/1 R+ 21:59 0:00 ps aux
rails 17128 0.0 0.0 15984 1016 pts/1 S+ 21:59 0:00 grep --color=auto rails
rails 19128 0.0 2.0 985308 341748 ? Sl May31 0:21 bin/rails runner -e production FreeGo::Schedule.dispatch('comments')
rails 19345 0.0 0.1 989460 16624 ? Ssl 2019 0:00 /home/rails/.nvm/versions/node/v8.2.1/bin/node /home/rails/.nvm/versions/node/v8.2.1/lib/node_modules/npm/node_modules/update-notifier/check.js {"pkg":{"name":"npm","version":"5.3.0"}}
rails 20236 4.0 0.5 2304108 87100 ? Sl 2019 9991:16 main.rb
rails 23051 0.0 2.0 980624 337028 ? Sl May26 0:33 bin/rails runner -e production FreeGo::Schedule.dispatch('comments')
rails 26294 0.0 0.0 987608 13760 ? Ssl 2019 0:00 /home/rails/.nvm/versions/node/v8.2.1/bin/node /home/rails/.nvm/versions/node/v8.2.1/lib/node_modules/npm/node_modules/update-notifier/check.js {"pkg":{"name":"npm","version":"5.3.0"}}
rails 27071 0.0 2.1 988128 346472 ? Sl May24 0:36 bin/rails runner -e production FreeGo::Schedule.dispatch('comments')
rails 29847 0.0 2.1 988916 346160 ? Sl May28 0:31 bin/rails runner -e production FreeGo::Schedule.dispatch('gateway_inspection')
rails 31013 0.0 0.0 7540 156 ? S 2019 6:05 /rails_apps/elixir_obd_core/erts-10.0.5/bin/epmd -daemon
rails 31799 0.0 0.0 987884 15084 ? Ssl 2019 0:00 /home/rails/.nvm/versions/node/v8.2.1/bin/node /home/rails/.nvm/versions/node/v8.2.1/lib/node_modules/npm/node_modules/update-notifier/check.js {"pkg":{"name":"npm","version":"5.3.0"}}
rails 32317 0.0 0.1 1060984 17212 ? Ssl Mar20 0:00 /home/rails/.nvm/versions/node/v8.2.1/bin/node /home/rails/.nvm/versions/node/v8.2.1/lib/node_modules/npm/node_modules/update-notifier/check.js {"pkg":{"name":"npm","version":"5.3.0"}}
发现很多runner,一直占用大量内存 2%左右每个进程
判断是僵死的进程占用了大量的内存,导致服务器负载飙升
中间走了弯路
1,查为什么runner执行不完
实际在console里面直接运行runner很快就执行完了,不存在问题。
2,修改runner为task执行
一样也会有僵时的情况
3,觉得是connect不能释放的问题
对应的系统connect的 established 一直增加,查了redis pg的connect,觉得需要把相同时间的rake放在一起,减少connection
学会跟踪
之前知道看 /proc/{pid} 但是看status看不出什么问题
学会了 strace -p pid
对应僵死进程如下
ppoll([{fd=3, events=POLLIN}], 1, {tv_sec=0, tv_nsec=515857929}, NULL, 8) = 0 (Timeout)
read(3, 0x7ffe5643c790, 8) = -1 EAGAIN (Resource temporarily unavailable)
sched_yield() = 0
ppoll([{fd=3, events=POLLIN}], 1, {tv_sec=1, tv_nsec=0}, NULL, 8) = 0 (Timeout)
read(3, 0x7ffe5643c790, 8) = -1 EAGAIN (Resource temporarily unavailable)
sched_yield() = 0
ppoll([{fd=3, events=POLLIN}], 1, {tv_sec=1, tv_nsec=0}, NULL, 8) = 0 (Timeout)
read(3, 0x7ffe5643c790, 8) = -1 EAGAIN (Resource temporarily unavailable)
查了很久没有查到问题~不晚得为啥Resource temporarily unavailable。
后来网上查到 Resource temporarily unavailable 可能是无法create thread。
终于想到去摸cron的日志,production的日志,发现cron中有Resource temporarily unavailable的关键词(production中没有)
ThreadError: can't create Thread: Resource temporarily unavailable
/rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/threading/agent_thread.rb:25:in `initialize'
/rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/threading/agent_thread.rb:25:in `new'
/rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/threading/agent_thread.rb:25:in `create'
/rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/agent.rb:717:in `start_worker_thread'
/rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/agent.rb:500:in `setup_and_start_agent'
/rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/agent.rb:482:in `check_config_and_start_agent'
/rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/agent.rb:542:in `start'
/rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/control/instance_methods.rb:135:in `start_agent'
/rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/control/instance_methods.rb:75:in `init_plugin'
/rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/newrelic_rpm.rb:41:in `block in <class:Railtie>'
/rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/railties-6.0.3/lib/rails/initializable.rb:32:in `instance_exec'
/rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/railties-6.0.3/lib/rails/initializable.rb:32:in `run'
/rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/railties-6.0.3/lib/rails/initializable.rb:61:in `block in run_initializers'
大概是new_relic的问题
然后再发现有new_relic的log,真相大白~
[2020-06-03 01:04:43 +0800 olive-cron (31785)] ERROR : ThreadError: can't create Thread: Resource temporarily unavailable
[2020-06-03 01:04:43 +0800 olive-cron (31785)] ERROR : NewRelic::Agent::ServerConnectionException: Recoverable error connecting to collector-005.newrelic.com:443 after 62.510732672 seconds: Failed to open TCP connection to collector-005.newrelic.com:443 (getaddrinfo: Temporary failure in name resolution)
[2020-06-03 01:04:44 +0800 olive-cron (31815)] ERROR : 1 error(s) running task for event '60_second_harvest' in Agent Event Loop:
[2020-06-03 01:04:44 +0800 olive-cron (31815)] ERROR : NewRelic::Agent::ServerConnectionException: Recoverable error connecting to collector-005.newrelic.com:443 after 33.799507684 seconds: Failed to open TCP connection to collector-005.newrelic.com:443 (getaddrinfo: Temporary failure in name resolution)
[2020-06-03 01:04:45 +0800 olive-cron (26931)] ERROR : 2 error(s) running task for event '5_second_harvest' in Agent Event Loop:
[2020-06-03 01:04:45 +0800 olive-cron (27383)] ERROR : 2 error(s) running task for event '5_second_harvest' in Agent Event Loop:
[2020-06-03 01:04:45 +0800 olive-cron (26931)] ERROR : NewRelic::Agent::ServerConnectionException: Recoverable error connecting to collector-005.newrelic.com:443 after 18.096708066 seconds: Failed to open TCP connection to collector-005.newrelic.com:443 (getaddrinfo: Temporary failure in name resolution)
[2020-06-03 01:04:45 +0800 olive-cron (27383)] ERROR : NewRelic::Agent::ServerConnectionException: Recoverable error connecting to collector-005.newrelic.com:443 after 21.296923267 seconds: Failed to open TCP connection to collector-005.newrelic.com:443 (getaddrinfo: Temporary failure in name resolution)
[2020-06-03 01:04:45 +0800 olive-cron (26931)] ERROR : NewRelic::Agent::ServerConnectionException: Recoverable error connecting to@@@
Recoverable error connecting to collector-005.newrelic.com:443 after 62.510732672 seconds: Failed to open TCP connection to collector-005.newrelic.com:443
于是把cron中的new_relic去掉
暂时正常~