先来个GitHub地址: https://github.com/uber-common/jvm-profiler
简介(机翻,将就看下):Uber JVM Profiler提供了一个Java Agent,以分布式的方式收集Hadoop/Spark JVM进程的各种指标和堆栈跟踪,例如CPU/内存/IO指标。
Uber JVM Profiler还提供了高级分析功能,可以在不需要更改用户代码的情况下跟踪用户代码上的任意Java方法和参数。该特性可用于跟踪每个Spark应用程序的HDFS名称节点调用延迟,并识别名称节点的瓶颈。它还可以跟踪每个Spark应用程序读取或写入的HDFS文件路径,并识别热文件以进行进一步优化。
这个分析器最初是用来分析Spark应用程序的,通常一个应用程序有几十个或几百个进程/机器,所以人们可以很容易地将这些不同进程/机器的指标联系起来。它也是一个通用的Java代理,也可以用于任何JVM进程。
思路:这个插件呢,是个Java agent,不侵入代码,友好,带有这个插件的sparkjob会在后台每隔指定时间间隔往我们指定的reporter(console,Kafka等)发送内存cpu资源消耗数据。刚好我们在配置一个spark任务的时候指定cpu内存资源时都是按照经验写的🙃,多数情况下为了保证任务正常运行,资源都是给多了,其实浪费了,而有了这个插件就能为我们对资源参数优化提供数据支撑了。
说完废话,开整~,我的任务运行在k8s中,通过spark-k8s-operator。
conf.set("spark.executor.extraJavaOptions",s"-javaagent:/opt/spark/jars/jvm-profiler-1.0.0.jar=reporter=com.uber.profiling.reporters.KafkaOutputReporter,metricInterval=10000,brokerList=broker01:9092,topicPrefix=spark_executor_jvm_profiler_")/*** spark.executor.extraJavaOptions 这个是指定executor的额外的Java启动参数,当然还能指定driver的* -javaagent:/opt/spark/jars/jvm-profiler-1.0.0.jar= 指定Java agent的jar包位置。* reporter=com.uber.profiling.reporters.KafkaOutputReporter 指定为Kafkareporter* metricInterval=10000 数据采集时间间隔 毫秒* brokerList=broker01:9092 Kafka的broker地址* topicPrefix=spark_executor_jvm_profiler_ topic前缀,最后会有两个topic,spark_executor_jvm_profiler_CpuAndMemory 和 spark_executor_jvm_profiler_ProcessInfo*/
结果Kafka数据:
{"heapMemoryMax":1049100288,"role":"executor","nonHeapMemoryTotalUsed":131975152,"bufferPools":[{"totalCapacity":446329,"name":"direct","count":9,"memoryUsed":446330},{"totalCapacity":0,"name":"mapped","count":0,"memoryUsed":0}],"heapMemoryTotalUsed":385725280,"vmRSS":926433280,"epochMillis":1705601229434,"nonHeapMemoryCommitted":134635520,"heapMemoryCommitted":1049100288,"memoryPools":[{"peakUsageMax":251658240,"usageMax":251658240,"peakUsageUsed":33978944,"name":"Code Cache","peakUsageCommitted":34275328,"usageUsed":33978944,"type":"Non-heap memory","usageCommitted":34275328},{"peakUsageMax":-1,"usageMax":-1,"peakUsageUsed":87132352,"name":"Metaspace","peakUsageCommitted":88956928,"usageUsed":87132352,"type":"Non-heap memory","usageCommitted":88956928},{"peakUsageMax":1073741824,"usageMax":1073741824,"peakUsageUsed":10863856,"name":"Compressed Class Space","peakUsageCommitted":11403264,"usageUsed":10863856,"type":"Non-heap memory","usageCommitted":11403264},{"peakUsageMax":344981504,"usageMax":309329920,"peakUsageUsed":340787200,"name":"PS Eden Space","peakUsageCommitted":340787200,"usageUsed":92549024,"type":"Heap memory","usageCommitted":308805632},{"peakUsageMax":58720256,"usageMax":24117248,"peakUsageUsed":44559888,"name":"PS Survivor Space","peakUsageCommitted":58720256,"usageUsed":3010920,"type":"Heap memory","usageCommitted":24117248},{"peakUsageMax":716177408,"usageMax":716177408,"peakUsageUsed":290179432,"name":"PS Old Gen","peakUsageCommitted":716177408,"usageUsed":290179432,"type":"Heap memory","usageCommitted":716177408}],"processCpuLoad":0.10739614994934144,"systemCpuLoad":0.11448834853090172,"processCpuTime":93190000000,"vmHWM":926433280,"appId":"spark-2d8e2e97d77542a48132dc8cc5697049","vmPeak":3852488704,"name":"37@jobname-2024-01-18-1705600859876-exec-1","host":"jobname-2024-01-18-1705600859876-exec-1","processUuid":"1493c603-54d3-424a-8343-65e9980338df","nonHeapMemoryMax":-1,"vmSize":3852484608,"gc":[{"collectionTime":1351,"name":"PS Scavenge","collectionCount":217},{"collectionTime":203,"name":"PS MarkSweep","collectionCount":3}]}
ok,这就拿到了我们的监控数据了:
接着我们可以写一个数据处理任务来处理这些监控数据,在此之前我们得知道这些数据是什么含义:
更详细的官方文档:https://github.com/uber-common/jvm-profiler/blob/master/metricDetails.md
| 指标key | 含义 | 备注 | 
|---|---|---|
| heapMemoryMax | 可以使用的最大内存量(以字节为单位) | |
| role | executor or driver | |
| heapMemoryTotalUsed | 堆内存使用总量 | |
| vmRSS | 以字节为单位的JVM物理内存使用量(常驻集大小)。 | |
| vmHWM | JVM的常驻集的峰值大小(“高水位”),以字节为单位。 | |
| appId | spark任务的appid | |
| vmPeak | JVM虚拟内存使用的峰值(以字节为单位) | |
| host | 主机名称 | |
| vmSize | JVM的虚拟内存大小。 | |
| systemCpuLoad | 系统cpu负载 | |
| processCpuLoad | 进程cpu负载 | 
最后,离线任务跑出统计数据,上看板

