记一次cron运行rails task无法释放的问题 - 《Rails 学习》

起因
发现很多僵死进程
中间走了弯路
学会跟踪

起因

某台web服务器（跑cron）的突然负载特别高

因为有两台都在跑web，但是另一台正常，这一台有跑cron，所以决定再开一台服务器来跑cron

添加新服务器跑cron后，原服务器正常，但是新服务器还是负载高。

发现很多僵死进程

rails    16846  0.0  0.1 990992 18408 ?        Ssl  Feb29   0:00 /home/rails/.nvm/versions/node/v8.2.1/bin/node /home/rails/.nvm/versions/node/v8.2.1/lib/node_modules/npm/node_modules/update-notifier/check.js {"pkg":{"name":"npm","version":"5.3.0"}}
rails    17127  0.0  0.0  39104  3164 pts/1    R+   21:59   0:00 ps aux
rails    17128  0.0  0.0  15984  1016 pts/1    S+   21:59   0:00 grep --color=auto rails
rails    19128  0.0  2.0 985308 341748 ?       Sl   May31   0:21 bin/rails runner -e production FreeGo::Schedule.dispatch('comments')
rails    19345  0.0  0.1 989460 16624 ?        Ssl   2019   0:00 /home/rails/.nvm/versions/node/v8.2.1/bin/node /home/rails/.nvm/versions/node/v8.2.1/lib/node_modules/npm/node_modules/update-notifier/check.js {"pkg":{"name":"npm","version":"5.3.0"}}
rails    20236  4.0  0.5 2304108 87100 ?       Sl    2019 9991:16 main.rb
rails    23051  0.0  2.0 980624 337028 ?       Sl   May26   0:33 bin/rails runner -e production FreeGo::Schedule.dispatch('comments')
rails    26294  0.0  0.0 987608 13760 ?        Ssl   2019   0:00 /home/rails/.nvm/versions/node/v8.2.1/bin/node /home/rails/.nvm/versions/node/v8.2.1/lib/node_modules/npm/node_modules/update-notifier/check.js {"pkg":{"name":"npm","version":"5.3.0"}}
rails    27071  0.0  2.1 988128 346472 ?       Sl   May24   0:36 bin/rails runner -e production FreeGo::Schedule.dispatch('comments')
rails    29847  0.0  2.1 988916 346160 ?       Sl   May28   0:31 bin/rails runner -e production FreeGo::Schedule.dispatch('gateway_inspection')
rails    31013  0.0  0.0   7540   156 ?        S     2019   6:05 /rails_apps/elixir_obd_core/erts-10.0.5/bin/epmd -daemon
rails    31799  0.0  0.0 987884 15084 ?        Ssl   2019   0:00 /home/rails/.nvm/versions/node/v8.2.1/bin/node /home/rails/.nvm/versions/node/v8.2.1/lib/node_modules/npm/node_modules/update-notifier/check.js {"pkg":{"name":"npm","version":"5.3.0"}}
rails    32317  0.0  0.1 1060984 17212 ?       Ssl  Mar20   0:00 /home/rails/.nvm/versions/node/v8.2.1/bin/node /home/rails/.nvm/versions/node/v8.2.1/lib/node_modules/npm/node_modules/update-notifier/check.js {"pkg":{"name":"npm","version":"5.3.0"}}

发现很多runner，一直占用大量内存 2%左右每个进程

判断是僵死的进程占用了大量的内存，导致服务器负载飙升

中间走了弯路

1，查为什么runner执行不完

实际在console里面直接运行runner很快就执行完了，不存在问题。

2，修改runner为task执行

一样也会有僵时的情况

3，觉得是connect不能释放的问题

对应的系统connect的 established 一直增加，查了redis pg的connect，觉得需要把相同时间的rake放在一起，减少connection

学会跟踪

之前知道看 /proc/{pid} 但是看status看不出什么问题
学会了 strace -p pid
对应僵死进程如下

ppoll([{fd=3, events=POLLIN}], 1, {tv_sec=0, tv_nsec=515857929}, NULL, 8) = 0 (Timeout)
read(3, 0x7ffe5643c790, 8)              = -1 EAGAIN (Resource temporarily unavailable)
sched_yield()                           = 0
ppoll([{fd=3, events=POLLIN}], 1, {tv_sec=1, tv_nsec=0}, NULL, 8) = 0 (Timeout)
read(3, 0x7ffe5643c790, 8)              = -1 EAGAIN (Resource temporarily unavailable)
sched_yield()                           = 0
ppoll([{fd=3, events=POLLIN}], 1, {tv_sec=1, tv_nsec=0}, NULL, 8) = 0 (Timeout)
read(3, 0x7ffe5643c790, 8)              = -1 EAGAIN (Resource temporarily unavailable)

查了很久没有查到问题~不晚得为啥Resource temporarily unavailable。

后来网上查到 Resource temporarily unavailable 可能是无法create thread。

终于想到去摸cron的日志，production的日志，发现cron中有Resource temporarily unavailable的关键词（production中没有）

ThreadError: can't create Thread: Resource temporarily unavailable
  /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/threading/agent_thread.rb:25:in `initialize'
  /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/threading/agent_thread.rb:25:in `new'
  /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/threading/agent_thread.rb:25:in `create'
  /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/agent.rb:717:in `start_worker_thread'
  /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/agent.rb:500:in `setup_and_start_agent'
  /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/agent.rb:482:in `check_config_and_start_agent'
  /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/agent/agent.rb:542:in `start'
  /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/control/instance_methods.rb:135:in `start_agent'
  /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/new_relic/control/instance_methods.rb:75:in `init_plugin'
  /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/newrelic_rpm-6.9.0.363/lib/newrelic_rpm.rb:41:in `block in <class:Railtie>'
  /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/railties-6.0.3/lib/rails/initializable.rb:32:in `instance_exec'
  /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/railties-6.0.3/lib/rails/initializable.rb:32:in `run'
  /rails_apps/another_olive/shared/bundle/ruby/2.7.0/gems/railties-6.0.3/lib/rails/initializable.rb:61:in `block in run_initializers'

大概是new_relic的问题

然后再发现有new_relic的log，真相大白~

[2020-06-03 01:04:43 +0800 olive-cron (31785)] ERROR : ThreadError: can't create Thread: Resource temporarily unavailable
[2020-06-03 01:04:43 +0800 olive-cron (31785)] ERROR : NewRelic::Agent::ServerConnectionException: Recoverable error connecting to collector-005.newrelic.com:443 after 62.510732672 seconds: Failed to open TCP connection to collector-005.newrelic.com:443 (getaddrinfo: Temporary failure in name resolution)
[2020-06-03 01:04:44 +0800 olive-cron (31815)] ERROR : 1 error(s) running task for event '60_second_harvest' in Agent Event Loop:
[2020-06-03 01:04:44 +0800 olive-cron (31815)] ERROR : NewRelic::Agent::ServerConnectionException: Recoverable error connecting to collector-005.newrelic.com:443 after 33.799507684 seconds: Failed to open TCP connection to collector-005.newrelic.com:443 (getaddrinfo: Temporary failure in name resolution)
[2020-06-03 01:04:45 +0800 olive-cron (26931)] ERROR : 2 error(s) running task for event '5_second_harvest' in Agent Event Loop:
[2020-06-03 01:04:45 +0800 olive-cron (27383)] ERROR : 2 error(s) running task for event '5_second_harvest' in Agent Event Loop:
[2020-06-03 01:04:45 +0800 olive-cron (26931)] ERROR : NewRelic::Agent::ServerConnectionException: Recoverable error connecting to collector-005.newrelic.com:443 after 18.096708066 seconds: Failed to open TCP connection to collector-005.newrelic.com:443 (getaddrinfo: Temporary failure in name resolution)
[2020-06-03 01:04:45 +0800 olive-cron (27383)] ERROR : NewRelic::Agent::ServerConnectionException: Recoverable error connecting to collector-005.newrelic.com:443 after 21.296923267 seconds: Failed to open TCP connection to collector-005.newrelic.com:443 (getaddrinfo: Temporary failure in name resolution)
[2020-06-03 01:04:45 +0800 olive-cron (26931)] ERROR : NewRelic::Agent::ServerConnectionException: Recoverable error connecting to@@@

Recoverable error connecting to collector-005.newrelic.com:443 after 62.510732672 seconds: Failed to open TCP connection to collector-005.newrelic.com:443

于是把cron中的new_relic去掉

暂时正常~