1. <dependency>
  2. <groupId>org.springframework.cloud</groupId>
  3. <artifactId>spring-cloud-starter-gateway</artifactId>
  4. <version>2.2.3.RELEASE</version>
  5. </dependency>
  6. <dependency>
  7. <groupId>org.springframework.boot</groupId>
  8. <artifactId>spring-boot-starter-webflux</artifactId>
  9. <version>2.3.0.RELEASE</version>
  10. </dependency>

背景及现象

线上有一个业务需要调用算法同学的python服务,然后每次算法同学上一个新的服务都需要和运维同学沟通半天,于是将算法的服务全部都集中到网关中,然后通过path来进行转发,以解决算法和运维之间的沟通问题。

在切换完以后,所有原来走Nginx直接转发到python的流量,由nginx转发到gw,再由gw转发到python服务。在切换到网关之后,网关开始频繁出现堆外OOM,由于该网关的代码是小程序曾经使用过的一套网关,因此是不存在BUG的。

同时,前台的dubbo服务在网关OOM的时候频繁出现调用超时。

处理问题

一开始看的时候看到机器的CPU很高,当时认为CPU影响到gw的运行了,于是部署了2个新的Tomcat到线上,此时由于重新启动了Tomcat,导致OOM消失并缓解了。一度认为是CPU的问题。

过了几周又出现了超时且服务器CPU低,同时观察到日志中出现了OOM的情况。
image.png
这个日志是字节码植入的,只是知道发生了OOM,但是并不知道是由于什么引发了堆外的OOM。

然后仔细检查日志的是发现了如下的数据
image.png
由于Netty的DataBuffer导致了堆外内存泄露。于是我添加了 -Dio.netty.leakDetectionLevel=advanced 让Netty帮忙进行内存监测,添加该操作,会在FGC的时候进行监测,何处可能存在内存溢出 Netty.

过来几天,观察到了如下情况

  1. 2021-08-14 14:24:11.042 ERROR 9797 --- [r-http-kqueue-6] io.netty.util.ResourceLeakDetector : LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
  2. Recent access records:
  3. #1:
  4. io.netty.buffer.AdvancedLeakAwareByteBuf.readBytes(AdvancedLeakAwareByteBuf.java:502)
  5. io.netty.buffer.ByteBufInputStream.read(ByteBufInputStream.java:180)
  6. java.io.InputStream.read(InputStream.java:101)
  7. org.springframework.util.StreamUtils.copy(StreamUtils.java:165)
  8. org.springframework.util.FileCopyUtils.copy(FileCopyUtils.java:112)
  9. org.springframework.cloud.gateway.filter.factory.rewrite.GzipMessageBodyResolver.encode(GzipMessageBodyResolver.java:52)
  10. reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:100)
  11. reactor.core.publisher.MonoPublishOn$PublishOnSubscriber.run(MonoPublishOn.java:178)
  12. reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
  13. reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
  14. java.util.concurrent.FutureTask.run(FutureTask.java:266)
  15. java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
  16. java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
  17. java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  18. java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  19. java.lang.Thread.run(Thread.java:748)
  20. Created at:
  21. io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:363)
  22. io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
  23. io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:178)
  24. io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:115)
  25. org.springframework.core.io.buffer.NettyDataBufferFactory.allocateBuffer(NettyDataBufferFactory.java:71)
  26. org.springframework.core.io.buffer.NettyDataBufferFactory.allocateBuffer(NettyDataBufferFactory.java:39)
  27. org.springframework.core.codec.CharSequenceEncoder.encodeValue(CharSequenceEncoder.java:91)
  28. org.springframework.core.codec.CharSequenceEncoder.lambda$encode$0(CharSequenceEncoder.java:75)
  29. reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:100)
  30. reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:67)
  31. reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1755)
  32. reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:144)
  33. reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1755)
  34. reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:241)
  35. reactor.core.publisher.FluxOnAssembly$OnAssemblySubscriber.onNext(FluxOnAssembly.java:385)
  36. reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:121)
  37. reactor.core.publisher.FluxContextStart$ContextStartSubscriber.onNext(FluxContextStart.java:96)
  38. reactor.core.publisher.FluxMapFuseable$MapFuseableConditionalSubscriber.onNext(FluxMapFuseable.java:287)
  39. reactor.core.publisher.FluxFilterFuseable$FilterFuseableConditionalSubscriber.onNext(FluxFilterFuseable.java:330)
  40. reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1755)
  41. reactor.core.publisher.MonoCollect$CollectSubscriber.onComplete(MonoCollect.java:152)
  42. reactor.core.publisher.FluxJust$WeakScalarSubscription.request(FluxJust.java:101)
  43. reactor.core.publisher.MonoCollect$CollectSubscriber.onSubscribe(MonoCollect.java:112)
  44. reactor.core.publisher.FluxJust.subscribe(FluxJust.java:70)
  45. reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
  46. reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:150)
  47. reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:114)
  48. reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:114)
  49. reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:114)
  50. reactor.core.publisher.MonoPublishOn$PublishOnSubscriber.run(MonoPublishOn.java:178)
  51. reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
  52. reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
  53. java.util.concurrent.FutureTask.run(FutureTask.java:266)
  54. java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
  55. java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
  56. java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  57. java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  58. java.lang.Thread.run(Thread.java:748)
  59. : 1 leak records were discarded because they were duplicates

GzipMessageBodyResolver

  1. package org.springframework.cloud.gateway.filter.factory.rewrite;
  2. import org.springframework.core.io.buffer.DataBuffer;
  3. import org.springframework.core.io.buffer.DataBufferUtils;
  4. import org.springframework.util.FileCopyUtils;
  5. import java.io.ByteArrayInputStream;
  6. import java.io.ByteArrayOutputStream;
  7. import java.io.IOException;
  8. import java.io.InputStream;
  9. import java.util.zip.GZIPInputStream;
  10. import java.util.zip.GZIPOutputStream;
  11. public class GzipMessageBodyResolver implements MessageBodyDecoder, MessageBodyEncoder {
  12. @Override
  13. public String encodingType() {
  14. return "gzip";
  15. }
  16. @Override
  17. public byte[] decode(byte[] encoded) {
  18. try {
  19. ByteArrayInputStream bis = new ByteArrayInputStream(encoded);
  20. GZIPInputStream gis = new GZIPInputStream(bis);
  21. return FileCopyUtils.copyToByteArray(gis);
  22. } catch (IOException e) {
  23. throw new IllegalStateException("couldn't decode body from gzip", e);
  24. }
  25. }
  26. @Override
  27. public byte[] encode(DataBuffer original) {
  28. try {
  29. ByteArrayOutputStream bis = new ByteArrayOutputStream();
  30. GZIPOutputStream gos = new GZIPOutputStream(bis);
  31. //带上所有信息,这里是第52行
  32. FileCopyUtils.copy(original.asInputStream(), gos);
  33. return bis.toByteArray();
  34. } catch (IOException e) {
  35. throw new IllegalStateException("couldn't encode body to gzip", e);
  36. }
  37. }
  38. }

调用这里的地方是 ModifyResponseBodyGatewayFilterFactory#writeBody , 如果存在 Content-Encoding 这个请求头,那么就进行encode。

使用Arthas查看返回的Content-Encoding,果然偶尔出现了 gzip的情况,然后运行的时间一长,就出现了OOM。

如何修复

明白这一层关系以后,我在本地重现了这个问题。

GzipMessageBodyResolver#encode 处,使用 DataBufferUtils#release 释放掉buffer即可,但是需要改动源码。为了修复问题,先拷贝一个类,改动发布上去。

降低-XX:MaxDirectMemorySize,由原来的512M降低到 128M2天过去并没有出现这个问题。

小结

  • 使用 -Dio.netty.leakDetectionLevel=advanced 让Netty监测可能是哪里出现了OOM,然后针对进行查看。
  • Netty监测到了地方以后,使用Arthas进行佐证
  • 本地模型进一步验证结论。

这里还使用jdk#bin目录下的一些工具,用于查看某些bean的实例化个数等等。

  • jstat, 查看fgc次数
  • jmap -histo 查看实例个数
  • jmap dump文件