先来个GitHub地址: https://github.com/uber-common/jvm-profiler

    简介(机翻,将就看下):Uber JVM Profiler提供了一个Java Agent,以分布式的方式收集Hadoop/Spark JVM进程的各种指标和堆栈跟踪,例如CPU/内存/IO指标。

    Uber JVM Profiler还提供了高级分析功能,可以在不需要更改用户代码的情况下跟踪用户代码上的任意Java方法和参数。该特性可用于跟踪每个Spark应用程序的HDFS名称节点调用延迟,并识别名称节点的瓶颈。它还可以跟踪每个Spark应用程序读取或写入的HDFS文件路径,并识别热文件以进行进一步优化。

    这个分析器最初是用来分析Spark应用程序的,通常一个应用程序有几十个或几百个进程/机器,所以人们可以很容易地将这些不同进程/机器的指标联系起来。它也是一个通用的Java代理,也可以用于任何JVM进程。

    思路:这个插件呢,是个Java agent,不侵入代码,友好,带有这个插件的sparkjob会在后台每隔指定时间间隔往我们指定的reporter(console,Kafka等)发送内存cpu资源消耗数据。刚好我们在配置一个spark任务的时候指定cpu内存资源时都是按照经验写的🙃,多数情况下为了保证任务正常运行,资源都是给多了,其实浪费了,而有了这个插件就能为我们对资源参数优化提供数据支撑了。

    说完废话,开整~,我的任务运行在k8s中,通过spark-k8s-operator。

    1. conf.set("spark.executor.extraJavaOptions",
    2. s"-javaagent:/opt/spark/jars/jvm-profiler-1.0.0.jar=reporter=com.uber.profiling.reporters.KafkaOutputReporter,metricInterval=10000,brokerList=broker01:9092,topicPrefix=spark_executor_jvm_profiler_")
    3. /**
    4. * spark.executor.extraJavaOptions 这个是指定executor的额外的Java启动参数,当然还能指定driver的
    5. * -javaagent:/opt/spark/jars/jvm-profiler-1.0.0.jar= 指定Java agent的jar包位置。
    6. * reporter=com.uber.profiling.reporters.KafkaOutputReporter 指定为Kafkareporter
    7. * metricInterval=10000 数据采集时间间隔 毫秒
    8. * brokerList=broker01:9092 Kafka的broker地址
    9. * topicPrefix=spark_executor_jvm_profiler_ topic前缀,最后会有两个topic,spark_executor_jvm_profiler_CpuAndMemory 和 spark_executor_jvm_profiler_ProcessInfo
    10. */

    结果Kafka数据:

    1. {
    2. "heapMemoryMax":1049100288,
    3. "role":"executor",
    4. "nonHeapMemoryTotalUsed":131975152,
    5. "bufferPools":[
    6. {
    7. "totalCapacity":446329,
    8. "name":"direct",
    9. "count":9,
    10. "memoryUsed":446330
    11. },
    12. {
    13. "totalCapacity":0,
    14. "name":"mapped",
    15. "count":0,
    16. "memoryUsed":0
    17. }
    18. ],
    19. "heapMemoryTotalUsed":385725280,
    20. "vmRSS":926433280,
    21. "epochMillis":1705601229434,
    22. "nonHeapMemoryCommitted":134635520,
    23. "heapMemoryCommitted":1049100288,
    24. "memoryPools":[
    25. {
    26. "peakUsageMax":251658240,
    27. "usageMax":251658240,
    28. "peakUsageUsed":33978944,
    29. "name":"Code Cache",
    30. "peakUsageCommitted":34275328,
    31. "usageUsed":33978944,
    32. "type":"Non-heap memory",
    33. "usageCommitted":34275328
    34. },
    35. {
    36. "peakUsageMax":-1,
    37. "usageMax":-1,
    38. "peakUsageUsed":87132352,
    39. "name":"Metaspace",
    40. "peakUsageCommitted":88956928,
    41. "usageUsed":87132352,
    42. "type":"Non-heap memory",
    43. "usageCommitted":88956928
    44. },
    45. {
    46. "peakUsageMax":1073741824,
    47. "usageMax":1073741824,
    48. "peakUsageUsed":10863856,
    49. "name":"Compressed Class Space",
    50. "peakUsageCommitted":11403264,
    51. "usageUsed":10863856,
    52. "type":"Non-heap memory",
    53. "usageCommitted":11403264
    54. },
    55. {
    56. "peakUsageMax":344981504,
    57. "usageMax":309329920,
    58. "peakUsageUsed":340787200,
    59. "name":"PS Eden Space",
    60. "peakUsageCommitted":340787200,
    61. "usageUsed":92549024,
    62. "type":"Heap memory",
    63. "usageCommitted":308805632
    64. },
    65. {
    66. "peakUsageMax":58720256,
    67. "usageMax":24117248,
    68. "peakUsageUsed":44559888,
    69. "name":"PS Survivor Space",
    70. "peakUsageCommitted":58720256,
    71. "usageUsed":3010920,
    72. "type":"Heap memory",
    73. "usageCommitted":24117248
    74. },
    75. {
    76. "peakUsageMax":716177408,
    77. "usageMax":716177408,
    78. "peakUsageUsed":290179432,
    79. "name":"PS Old Gen",
    80. "peakUsageCommitted":716177408,
    81. "usageUsed":290179432,
    82. "type":"Heap memory",
    83. "usageCommitted":716177408
    84. }
    85. ],
    86. "processCpuLoad":0.10739614994934144,
    87. "systemCpuLoad":0.11448834853090172,
    88. "processCpuTime":93190000000,
    89. "vmHWM":926433280,
    90. "appId":"spark-2d8e2e97d77542a48132dc8cc5697049",
    91. "vmPeak":3852488704,
    92. "name":"37@jobname-2024-01-18-1705600859876-exec-1",
    93. "host":"jobname-2024-01-18-1705600859876-exec-1",
    94. "processUuid":"1493c603-54d3-424a-8343-65e9980338df",
    95. "nonHeapMemoryMax":-1,
    96. "vmSize":3852484608,
    97. "gc":[
    98. {
    99. "collectionTime":1351,
    100. "name":"PS Scavenge",
    101. "collectionCount":217
    102. },
    103. {
    104. "collectionTime":203,
    105. "name":"PS MarkSweep",
    106. "collectionCount":3
    107. }
    108. ]
    109. }

    ok,这就拿到了我们的监控数据了:

    接着我们可以写一个数据处理任务来处理这些监控数据,在此之前我们得知道这些数据是什么含义:

    更详细的官方文档:https://github.com/uber-common/jvm-profiler/blob/master/metricDetails.md

    指标key 含义 备注
    heapMemoryMax 可以使用的最大内存量(以字节为单位)
    role executor or driver
    heapMemoryTotalUsed 堆内存使用总量
    vmRSS 以字节为单位的JVM物理内存使用量(常驻集大小)。
    vmHWM JVM的常驻集的峰值大小(“高水位”),以字节为单位。
    appId spark任务的appid
    vmPeak JVM虚拟内存使用的峰值(以字节为单位)
    host 主机名称
    vmSize JVM的虚拟内存大小。
    systemCpuLoad 系统cpu负载
    processCpuLoad 进程cpu负载

    最后,离线任务跑出统计数据,上看板

    11、用jvm-profiler监控Spark任务的内存,cpu使用率 - 图1