起因

某台web服务器(跑cron)的突然负载特别高
image.png
image.png
因为有两台都在跑web,但是另一台正常,这一台有跑cron,所以决定再开一台服务器来跑cron

image.png
添加新服务器跑cron后,原服务器正常,但是新服务器还是负载高。

发现很多僵死进程

  1. rails 16846 0.0 0.1 990992 18408 ? Ssl Feb29 0:00 /home/rails/.nvm/versions/node/v8.2.1/bin/node /home/rails/.nvm/versions/node/v8.2.1/lib/node_modules/npm/node_modules/update-notifier/check.js {"pkg":{"name":"npm","version":"5.3.0"}}
  2. rails 17127 0.0 0.0 39104 3164 pts/1 R+ 21:59 0:00 ps aux
  3. rails 17128 0.0 0.0 15984 1016 pts/1 S+ 21:59 0:00 grep --color=auto rails
  4. rails 19128 0.0 2.0 985308 341748 ? Sl May31 0:21 bin/rails runner -e production FreeGo::Schedule.dispatch('comments')
  5. rails 19345 0.0 0.1 989460 16624 ? Ssl 2019 0:00 /home/rails/.nvm/versions/node/v8.2.1/bin/node /home/rails/.nvm/versions/node/v8.2.1/lib/node_modules/npm/node_modules/update-notifier/check.js {"pkg":{"name":"npm","version":"5.3.0"}}
  6. rails 20236 4.0 0.5 2304108 87100 ? Sl 2019 9991:16 main.rb
  7. rails 23051 0.0 2.0 980624 337028 ? Sl May26 0:33 bin/rails runner -e production FreeGo::Schedule.dispatch('comments')
  8. rails 26294 0.0 0.0 987608 13760 ? Ssl 2019 0:00 /home/rails/.nvm/versions/node/v8.2.1/bin/node /home/rails/.nvm/versions/node/v8.2.1/lib/node_modules/npm/node_modules/update-notifier/check.js {"pkg":{"name":"npm","version":"5.3.0"}}
  9. rails 27071 0.0 2.1 988128 346472 ? Sl May24 0:36 bin/rails runner -e production FreeGo::Schedule.dispatch('comments')
  10. rails 29847 0.0 2.1 988916 346160 ? Sl May28 0:31 bin/rails runner -e production FreeGo::Schedule.dispatch('gateway_inspection')
  11. rails 31013 0.0 0.0 7540 156 ? S 2019 6:05 /rails_apps/elixir_obd_core/erts-10.0.5/bin/epmd -daemon
  12. rails 31799 0.0 0.0 987884 15084 ? Ssl 2019 0:00 /home/rails/.nvm/versions/node/v8.2.1/bin/node /home/rails/.nvm/versions/node/v8.2.1/lib/node_modules/npm/node_modules/update-notifier/check.js {"pkg":{"name":"npm","version":"5.3.0"}}
  13. rails 32317 0.0 0.1 1060984 17212 ? Ssl Mar20 0:00 /home/rails/.nvm/versions/node/v8.2.1/bin/node /home/rails/.nvm/versions/node/v8.2.1/lib/node_modules/npm/node_modules/update-notifier/check.js {"pkg":{"name":"npm","version":"5.3.0"}}

发现很多runner,一直占用大量内存 2%左右每个进程
image.png
判断是僵死的进程占用了大量的内存,导致服务器负载飙升

中间走了弯路

1,查为什么runner执行不完

实际在console里面直接运行runner很快就执行完了,不存在问题。

2,修改runner为task执行

一样也会有僵时的情况

3,觉得是connect不能释放的问题

对应的系统connect的 established 一直增加,查了redis pg的connect,觉得需要把相同时间的rake放在一起,减少connection

学会跟踪

之前知道看 /proc/{pid} 但是看status看不出什么问题
学会了 strace -p pid
对应僵死进程如下

  1. ppoll([{fd=3, events=POLLIN}], 1, {tv_sec=0, tv_nsec=515857929}, NULL, 8) = 0 (Timeout)
  2. read(3, 0x7ffe5643c790, 8) = -1 EAGAIN (Resource temporarily unavailable)
  3. sched_yield() = 0
  4. ppoll([{fd=3, events=POLLIN}], 1, {tv_sec=1, tv_nsec=0}, NULL, 8) = 0 (Timeout)
  5. read(3, 0x7ffe5643c790, 8) = -1 EAGAIN (Resource temporarily unavailable)
  6. sched_yield() = 0
  7. ppoll([{fd=3, events=POLLIN}], 1, {tv_sec=1, tv_nsec=0}, NULL, 8) = 0 (Timeout)
  8. read(3, 0x7ffe5643c790, 8) = -1 EAGAIN (Resource temporarily unavailable)

查了很久没有查到问题~不晚得为啥Resource temporarily unavailable。

后来网上查到 Resource temporarily unavailable 可能是无法create thread。

终于想到去摸cron的日志,production的日志,发现cron中有Resource temporarily unavailable的关键词(production中没有)

  1. ThreadError: can't create Thread: Resource temporarily unavailable
  2. /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/threading/agent_thread.rb:25:in `initialize'
  3. /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/threading/agent_thread.rb:25:in `new'
  4. /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/threading/agent_thread.rb:25:in `create'
  5. /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/agent.rb:717:in `start_worker_thread'
  6. /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/agent.rb:500:in `setup_and_start_agent'
  7. /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/agent.rb:482:in `check_config_and_start_agent'
  8. /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/agent.rb:542:in `start'
  9. /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/control/instance_methods.rb:135:in `start_agent'
  10. /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/control/instance_methods.rb:75:in `init_plugin'
  11. /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/newrelic_rpm.rb:41:in `block in <class:Railtie>'
  12. /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/railties-6.0.3/lib/rails/initializable.rb:32:in `instance_exec'
  13. /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/railties-6.0.3/lib/rails/initializable.rb:32:in `run'
  14. /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/railties-6.0.3/lib/rails/initializable.rb:61:in `block in run_initializers'

大概是new_relic的问题

然后再发现有new_relic的log,真相大白~

  1. [2020-06-03 01:04:43 +0800 olive-cron (31785)] ERROR : ThreadError: can't create Thread: Resource temporarily unavailable
  2. [2020-06-03 01:04:43 +0800 olive-cron (31785)] ERROR : NewRelic::Agent::ServerConnectionException: Recoverable error connecting to collector-005.newrelic.com:443 after 62.510732672 seconds: Failed to open TCP connection to collector-005.newrelic.com:443 (getaddrinfo: Temporary failure in name resolution)
  3. [2020-06-03 01:04:44 +0800 olive-cron (31815)] ERROR : 1 error(s) running task for event '60_second_harvest' in Agent Event Loop:
  4. [2020-06-03 01:04:44 +0800 olive-cron (31815)] ERROR : NewRelic::Agent::ServerConnectionException: Recoverable error connecting to collector-005.newrelic.com:443 after 33.799507684 seconds: Failed to open TCP connection to collector-005.newrelic.com:443 (getaddrinfo: Temporary failure in name resolution)
  5. [2020-06-03 01:04:45 +0800 olive-cron (26931)] ERROR : 2 error(s) running task for event '5_second_harvest' in Agent Event Loop:
  6. [2020-06-03 01:04:45 +0800 olive-cron (27383)] ERROR : 2 error(s) running task for event '5_second_harvest' in Agent Event Loop:
  7. [2020-06-03 01:04:45 +0800 olive-cron (26931)] ERROR : NewRelic::Agent::ServerConnectionException: Recoverable error connecting to collector-005.newrelic.com:443 after 18.096708066 seconds: Failed to open TCP connection to collector-005.newrelic.com:443 (getaddrinfo: Temporary failure in name resolution)
  8. [2020-06-03 01:04:45 +0800 olive-cron (27383)] ERROR : NewRelic::Agent::ServerConnectionException: Recoverable error connecting to collector-005.newrelic.com:443 after 21.296923267 seconds: Failed to open TCP connection to collector-005.newrelic.com:443 (getaddrinfo: Temporary failure in name resolution)
  9. [2020-06-03 01:04:45 +0800 olive-cron (26931)] ERROR : NewRelic::Agent::ServerConnectionException: Recoverable error connecting to@@@

Recoverable error connecting to collector-005.newrelic.com:443 after 62.510732672 seconds: Failed to open TCP connection to collector-005.newrelic.com:443

于是把cron中的new_relic去掉

暂时正常~