RocksDB 跟踪、回放、分析和工作负载生成



trace_replay API允许用户将查询信息跟踪到跟踪文件中。在当前实现中,Get、WriteBatch (Put、Delete、Merge、SingleDelete和DeleteRange)、Iterator (Seek和SeekForPrev)是trace_replay API跟踪的查询。 键、查询时间戳、值(如果应用)、cf_id形成一条跟踪记录。由于使用一个锁来保护跟踪实例,并且跟踪文件将有额外的IOs,因此DB的性能将受到影响。 根据目前在MyRocks和ZippyDB影子服务器上的测试,在这些影子中,性能并不是一个问题。来自相同DB实例的跟踪记录被写入二进制跟踪文件。 用户可以指定跟踪文件的路径(例如,存储在不同的存储设备中,以减少IO的影响)。

目前,可以使用db_bench重新播放跟踪文件。跟踪文件中的查询记录根据时间戳重新播放到目标DB实例。 它可以重放几乎与所收集的工作负载相同的工作负载,这将提供更类似于生产的测试用例。


  1. Env* env = rocksdb::Env::Default();
  2. EnvOptions env_options;
  3. std::string trace_path = "/tmp/trace_test_example"
  4. std::unique_ptr<TraceWriter> trace_writer;
  5. DB* db = nullptr;
  6. std::string db_name = "/tmp/rocksdb"
  7. /*Create the trace file writer*/
  8. NewFileTraceWriter(env, env_options, trace_path, &trace_writer);
  9. DB::Open(options, dbname, &db);
  10. /*Start tracing*/
  11. db->StartTrace(trace_opt, std::move(trace_writer));
  12. /* your call of RocksDB APIs */
  13. /*End tracing*/
  14. db->EndTrace()


  1. ./db_bench --benchmarks=replay --trace_file=/tmp/trace_test_example --num_column_families=5


在用户使用trace_replay API完成跟踪步骤之后,用户将获得一个二进制跟踪文件。在跟踪文件中,Get、Seek和SeekForPrev使用单独的跟踪记录进行跟踪, 而Put、Merge、Delete、SingleDelete和DeleteRange的查询被打包到writebatch中。 需要一种工具: 1)将跟踪解释为人类可读的格式以便进一步分析; 2)提供丰富而强大的内存处理选项来分析跟踪并输出相应的结果; 3)易于向工具中添加新的分析选项和查询类型。


注意,大多数生成的分析结果输出文件将被分隔在不同的列族和不同的查询类型中,这意味着一个列族中的一个查询类型将有自己的输出文件。 通常,一个指定的输出选项将生成一个输出文件。



  1. -analyze_delete (Analyze the Delete query.) type: bool default: false
  2. -analyze_get (Analyze the Get query.) type: bool default: false
  3. -analyze_iterator ( Analyze the iterate query like seek() and
  4. seekForPrev().) type: bool default: false
  5. -analyze_merge (Analyze the Merge query.) type: bool default: false
  6. -analyze_put (Analyze the Put query.) type: bool default: false
  7. -analyze_range_delete (Analyze the DeleteRange query.) type: bool
  8. default: false
  9. -analyze_single_delete (Analyze the SingleDelete query.) type: bool
  10. default: false
  11. -convert_to_human_readable_trace (Convert the binary trace file to a human
  12. readable txt file for further processing. This file will be extremely
  13. large (similar size as the original binary trace file). You can specify
  14. 'no_key' to reduce the size, if key is not needed in the next step
  15. File name: <prefix>_human_readable_trace.txt
  16. Format:[type_id cf_id value_size time_in_micorsec <key>].) type: bool
  17. default: false
  18. -key_space_dir (<the directory stores full key space files>
  19. The key space files should be: <column family id>.txt) type: string
  20. default: ""
  21. -no_key ( Does not output the key to the result files to make smaller.)
  22. type: bool default: false
  23. -no_print (Do not print out any result) type: bool default: false
  24. -output_access_count_stats (Output the access count distribution statistics
  25. to file.
  26. File name: <prefix>-<query
  27. type>-<cf_id>-accessed_key_count_distribution.txt
  28. Format:[access_count number_of_access_count]) type: bool default: false
  29. -output_dir (The directory to store the output files.) type: string
  30. default: ""
  31. -output_ignore_count (<threshold>, ignores the access count <= this value,
  32. it will shorter the output.) type: int32 default: 0
  33. -output_key_distribution (Output the key size distribution.) type: bool
  34. default: false
  35. -output_key_stats (Output the key access count statistics to file
  36. for accessed keys:
  37. file name: <prefix>-<query type>-<cf_id>-accessed_key_stats.txt
  38. Format:[cf_id value_size access_keyid access_count]
  39. for the whole key space keys:
  40. File name: <prefix>-<query type>-<cf_id>-whole_key_stats.txt
  41. Format:[whole_key_space_keyid access_count]) type: bool default: false
  42. -output_prefix (The prefix used for all the output files.) type: string
  43. default: "trace"
  44. -output_prefix_cut (The number of bytes as prefix to cut the keys.
  45. if it is enabled, it will generate the following:
  46. for accessed keys:
  47. File name: <prefix>-<query type>-<cf_id>-accessed_key_prefix_cut.txt
  48. Format:[acessed_keyid access_count_of_prefix number_of_keys_in_prefix
  49. average_key_access prefix_succ_ratio prefix]
  50. for whole key space keys:
  51. File name: <prefix>-<query type>-<cf_id>-whole_key_prefix_cut.txt
  52. Format:[start_keyid_in_whole_keyspace prefix]
  53. if 'output_qps_stats' and 'top_k' are enabled, it will output:
  54. File name: <prefix>-<query
  55. type>-<cf_id>-accessed_top_k_qps_prefix_cut.txt
  56. Format:[the_top_ith_qps_time QPS], [prefix qps_of_this_second].)
  57. type: int32 default: 0
  58. -output_qps_stats (Output the query per second(qps) statistics
  59. For the overall qps, it will contain all qps of each query type. The time
  60. is started from the first trace record
  61. File name: <prefix>_qps_stats.txt
  62. Format: [qps_type_1 qps_type_2 ...... overall_qps]
  63. For each cf and query, it will have its own qps output
  64. File name: <prefix>-<query type>-<cf_id>_qps_stats.txt
  65. Format:[query_count_in_this_second].) type: bool default: false
  66. -output_time_series (Output the access time in second of each key, such
  67. that we can have the time series data of the queries
  68. File name: <prefix>-<query type>-<cf_id>-time_series.txt
  69. Format:[type_id time_in_sec access_keyid].) type: bool default: false
  70. -output_value_distribution (Out put the value size distribution, only
  71. available for Put and Merge.
  72. File name: <prefix>-<query
  73. type>-<cf_id>-accessed_value_size_distribution.txt
  74. Format:[Number_of_value_size_between x and x+value_interval is: <the
  75. count>]) type: bool default: false
  76. -print_correlation (intput format: [correlation pairs][.,.]
  77. Output the query correlations between the pairs of query types listed in
  78. the parameter, input should select the operations from:
  79. get, put, delete, single_delete, rangle_delete, merge. No space between
  80. the pairs separated by commar. Example: =[get,get]... It will print out
  81. the number of pairs of 'A after B' and the average time interval between
  82. the two query) type: string default: ""
  83. -print_overall_stats ( Print the stats of the whole trace, like total
  84. requests, keys, and etc.) type: bool default: true
  85. -print_top_k_access (<top K of the variables to be printed> Print the top k
  86. accessed keys, top k accessed prefix and etc.) type: int32 default: 1
  87. -trace_path (The trace file path.) type: string default: ""
  88. -value_interval (To output the value distribution, we need to set the value
  89. intervals and make the statistic of the value size distribution in
  90. different intervals. The default is 8.) type: int32 default: 8


  1. ./trace_analyzer -analyze_get -output_access_count_stats -output_dir=/data/trace/result -output_key_stats
  2. -output_qps_stats -convert_to_human_readable_trace -output_value_distribution -output_key_distribution
  3. -print_overall_stats -print_top_k_access=3 -output_prefix=test -trace_path=/data/trace/trace




原始二进制跟踪存储编码的数据结构和内容,要解释跟踪,工具应该使用RocksDB库。 因此,为了简化对跟踪的进一步分析,用户可以指定

  1. -convert_to_human_readable_trace

原始跟踪将转换为txt文件,内容为”[type_id cf_id value_size time_in_micorsec]”。如果不需要密钥,用户可以指定”-no_key”来减小文件大小。 此选项独立于所有其他选项,一旦指定,将生成转换后的跟踪。如果包含原始键,则txt文件大小可能与原始跟踪文件大小相似,甚至更大。



  1. -trace_path=<path to the trace>


  1. -output_dir=<the path to the output directory>

如果用户希望与现有密钥空间一起分析已访问的密钥。用户需要指定存储key空间文件的目录。 文件的名称应该是以”.txt”命名, 每一行都是一个键,通常,用户可以使用 “./ldb scan”的ldb工具,以转储所有现有的key。要指定目录

  1. -key_space_dir=<the path to the key space directory>


  1. -output_prefix=<the prefix, like "trace1">


  1. -no_print


目前,trace_analyzer工具提供了几种不同的分析选项来描述工作负载。一些结果直接打印出来(前缀为”-print”的选项),另一些将输出到文件(前缀为”-output”的选项)。 用户可以指定分析跟踪的选项的组合。注意,一些分析选项占用了大量内存(例如-output_time_series、-print_correlation和-key_space_dir)。 如果内存不够,尝试在不同的时间运行它们。


  1. -print_overall_stats


  1. -output_key_stats


在某些工作负载中,键的组成有一些共同的部分。例如,在MyRocks中,键的第一个X字节是表index_num。我们可以使用前X个字节将键切成不同的前缀范围。 通过指定要删除键空间的字节数,trace_analyzer将生成一个文件。文件中的一条记录表示前缀的剪切,相应的KeyID和前缀内容被存储。 如果指定-key_space_dir,将有两个单独的文件。一个文件用于访问的密钥,另一个文件用于整个密钥空间。 通常,前缀cut文件分别与accessed_key_stats.txt和whole_key_stats.txt一起使用。

  1. -output_prefix_cut=<number of bytes as prefix>


  1. -output_time_series



  1. -output_qps_stats

对于一个列族的每个查询类型,每秒将生成一个带有查询号的文件。 此外,在所有列族上都有一个包含每种查询类型的QPS的文件,以及所有QPS都输出到一个单独的文件中。将打印出平均QPS和峰值QPS。


  1. -print_top_k_access

顶部K个访问键,将打印访问号。此外,如果指定了prefix_cut选项,则输出包含总访问计数的前K个访问前缀。 同时,输出平均访问次数最高的前K个前缀。


  1. -output_value_distribution
  2. -value_interval

由于值大小变化很大,用户可能只想知道每个值大小范围中有多少个值。用户可以指定value_interval=x来生成[0,x), [x,2x]……之间的值


  1. -output_key_distribution



在这里,我们使用开源绘图工具GNUPLOT作为例子来生成图形。 关于GNUPLOT的更多细节可以在这里找到(。 用户可以直接编写GNUPLOT命令来绘制图形,或者简单地说,用户可以使用下面的shell脚本生成GNUPLOT源文件(在使用脚本之前,确保文件名和一些内容被有效的内容替换)。


  1. #!/bin/bash
  2. # The query type
  3. ops="iterator"
  4. # The column family ID
  5. cf="9"
  6. # The column family name if known, if not, replace it with some prefix
  7. cf_name="rev:cf-assoc-deleter-id1-type"
  8. form="accessed"
  9. # The column number that will be plotted
  10. use="4"
  11. # The higher bound of Y-axis
  12. y=2
  13. # The higher bound of X-axis
  14. x=29233
  15. echo "set output '${cf_name}-${ops}-${form}-key_heatmap.png'" > plot-${cf_name}-${ops}-${form}
  16. echo "set term png size 2000,500" >>plot-${cf_name}-${ops}-${form}
  17. echo "set title 'CF: ${cf_name} ${form} Key Space Heat Map'">>plot-${cf_name}-${ops}-${form}
  18. echo "set xlabel 'Key Sequence'">>plot-${cf_name}-${ops}-${form}
  19. echo "set ylabel 'Key access count'">>plot-${cf_name}-${ops}-${form}
  20. echo "set yrange [0:$y]">>plot-${cf_name}-${ops}-${form}
  21. echo "set xrange [0:$x]">>plot-${cf_name}-${ops}-${form}
  22. # If the preifx cut is avialable, it will draw the prefix cut
  23. while read f1 f2
  24. do
  25. echo "set arrow from $f1,0 to $f1,$y nohead lc rgb 'red'" >> plot-${cf_name}-${ops}-${form}
  26. done < "trace.1532381594728669-${ops}-${cf}-${form}_key_prefix_cut.txt"
  27. echo "plot 'trace.1532381594728669-${ops}-${cf}-${form}_key_stats.txt' using ${use} notitle w dots lt 2" >>plot-${cf_name}-${ops}-${form}
  28. gnuplot plot-${cf_name}-${ops}-${form}


  1. #!/bin/bash
  2. # The query type
  3. ops="iterator"
  4. # The higher bound of X-axis
  5. x=29233
  6. # The column family ID
  7. cf="8"
  8. # The column family name if known, if not, replace it with some prefix
  9. cf_name="rev:cf-assoc-deleter-id1-type"
  10. # The type of the output file
  11. form="time_series"
  12. # The column number that will be plotted
  13. use="3:2"
  14. # The total time of the tracing duration, in seconds
  15. y=88000
  16. echo "set output '${cf_name}-${ops}-${form}-key_heatmap.png'" > plot-${cf_name}-${ops}-${form}
  17. echo "set term png size 3000,3000" >>plot-${cf_name}-${ops}-${form}
  18. echo "set title 'CF: ${cf_name} time series'">>plot-${cf_name}-${ops}-${form}
  19. echo "set xlabel 'Key Sequence'">>plot-${cf_name}-${ops}-${form}
  20. echo "set ylabel 'Key access count'">>plot-${cf_name}-${ops}-${form}
  21. echo "set yrange [0:$y]">>plot-${cf_name}-${ops}-${form}
  22. echo "set xrange [0:$x]">>plot-${cf_name}-${ops}-${form}
  23. # If the preifx cut is avialable, it will draw the prefix cut
  24. while read f1 f2
  25. do
  26. echo "set arrow from $f1,0 to $f1,$y nohead lc rgb 'red'" >> plot-${cf_name}-${ops}-${form}
  27. done < "trace.1532381594728669-${ops}-${cf}-accessed_key_prefix_cut.txt"
  28. echo "plot 'trace.1532381594728669-${ops}-${cf}-${form}.txt' using ${use} notitle w dots lt 2" >>plot-${cf_name}-${ops}-${form}
  29. gnuplot plot-${cf_name}-${ops}-${form}


  1. #!/bin/bash
  2. # The query type
  3. ops="iterator"
  4. # The higher bound of the QPS
  5. y=5
  6. # The column family ID
  7. cf="9"
  8. # The column family name if known, if not, replace it with some prefix
  9. cf_name="rev:cf-assoc-deleter-id1-type"
  10. # The type of the output file
  11. form="qps_stats"
  12. # The column number that will be plotted
  13. use="1"
  14. # The total time of the tracing duration, in seconds
  15. x=88000
  16. echo "set output '${cf_name}-${ops}-${form}-IO_per_second.png'" > plot-${cf_name}-${ops}-${form}
  17. echo "set term png size 2000,1200" >>plot-${cf_name}-${ops}-${form}
  18. echo "set title 'CF: ${cf_name} QPS Over Time'">>plot-${cf_name}-${ops}-${form}
  19. echo "set xlabel 'Time in second'">>plot-${cf_name}-${ops}-${form}
  20. echo "set ylabel 'QPS'">>plot-${cf_name}-${ops}-${form}
  21. echo "set yrange [0:$y]">>plot-${cf_name}-${ops}-${form}
  22. echo "set xrange [0:$x]">>plot-${cf_name}-${ops}-${form}
  23. echo "plot 'trace.1532381594728669-${ops}-${cf}-${form}.txt' using ${use} notitle with linespoints" >>plot-${cf_name}-${ops}-${form}
  24. gnuplot plot-${cf_name}-${ops}-${form}


我们可以使用不同的工具、脚本和模型来适应工作负载统计数据。通常,用户可以使用密钥访问计数和前缀访问计数的分布来适应模型。 此外,还可以对QPS进行建模。在这里,我们以Matlab为例来拟合密钥访问计数、前缀访问计数和QPS。



  1. % This script is used to fit the key access count distribution
  2. % to the two-term exponential distirbution and get the parameters
  3. % The input file with surfix: accessed_key_stats.txt
  4. fileID = fopen('trace.1531329742187378-get-4-accessed_key_stats.txt');
  5. txt = textscan(fileID,'%f %f %f %f');
  6. fclose(fileID);
  7. % Get the number of keys that has access count x
  8. t2=sort(txt{4},'descend');
  9. % The number of access count that is used to fit the data
  10. % The value depends on the accuracy demond of your model fitting
  11. % and the value of count should be always not greater than
  12. % the size of t2
  13. count=30000;
  14. % Generate the access count x
  15. x=1:1:count;
  16. x=x';
  17. % Adjust the matrix and uniformed
  18. y=t2(1:count);
  19. y=y/(sum(y));
  20. figure;
  21. % fitting the data to the exp2 model
  22. f=fit(x,y,'exp2')
  23. %plot out the original data and fitted line to compare
  24. plot(f,x,y);


  1. % This script is used to fit the key access count distribution
  2. % to the two-term exponential distirbution and get the parameters
  3. % The input file with surfix: key_count_distribution.txt
  4. fileID = fopen('trace-get-9-accessed_key_count_distribution.txt');
  5. input = textscan(fileID,'%s %f %s %f');
  6. fclose(fileID);
  7. % Get the number of keys that has access count x
  8. t2=sort(input{4},'descend');
  9. % The number of access count that is used to fit the data
  10. % The value depends on the accuracy demond of your model fitting
  11. % and the value of count should be always not greater than
  12. % the size of t2
  13. count=100;
  14. % Generate the access count x
  15. x=1:1:count;
  16. x=x';
  17. y=t2(1:count);
  18. % Adjust the matrix and uniformed
  19. y=y/(sum(y));
  20. y=y(1:count);
  21. x=x(1:count);
  22. figure;
  23. % fitting the data to the exp2 model
  24. f=fit(x,y,'exp2')
  25. %plot out the original data and fitted line to compare
  26. plot(f,x,y);


  1. % This script is used to fit the prefix average access count distribution
  2. % to the two-term exponential distirbution and get the parameters
  3. % The input file with surfix: accessed_key_prefix_cut.txt
  4. fileID = fopen('trace-get-4-accessed_key_prefix_cut.txt');
  5. txt = textscan(fileID,'%f %f %f %f %s');
  6. fclose(fileID);
  7. % The per key access (average) of each prefix, sorted
  8. t2=sort(txt{4},'descend');
  9. % The number of access count that is used to fit the data
  10. % The value depends on the accuracy demond of your model fitting
  11. % and the value of count should be always not greater than
  12. % the size of t2
  13. count=1000;
  14. % Generate the access count x
  15. x=1:1:count;
  16. x=x';
  17. % Adjust the matrix and uniformed
  18. y=t2(0:count);
  19. y=y/(sum(y));
  20. x=x(1:count);
  21. % fitting the data to the exp2 model
  22. figure;
  23. f=fit(x,y,'exp2')
  24. %plot out the original data and fitted line to compare
  25. plot(f,x,y);


  1. % This script is used to fit the qps of the one query in one of the column
  2. % family to the sin'x' model. 'x' can be 1 to 10. With the higher value
  3. % of the 'x', you can get more accurate fitting of the qps. However,
  4. % the model will be more complex and some times will be overfitted.
  5. % The suggestion is to use sin1 or sin2
  6. % The input file shoud with surfix: qps_stats.txt
  7. fileID = fopen('trace-get-4-io_stats.txt');
  8. txt = textscan(fileID,'%f');
  9. fclose(fileID);
  10. t1=txt{1};
  11. % The input is the queries per second. If you directly use the qps
  12. % you may got a high value of noise. Here, 'n' is the number of qps
  13. % that you want to combined to one average value, such that you can
  14. % reduce it to queries per n*seconds.
  15. n=10;
  16. s1 = size(t1, 1);
  17. M = s1 - mod(s1, n);
  18. t2 = reshape(t1(1:M), n, []);
  19. y = transpose(sum(t2, 1) / n);
  20. % Up to this point, you need to move the data down to the x-axis,
  21. % the offset is the ave. So the model will be
  22. % s(x) = a1*sin(b1*x+c1) + a2*sin(b2*x+c2) + ave
  23. ave = mean(y);
  24. y=y-ave;
  25. % Adjust the matrix
  26. count = size(y,1);
  27. x=1:1:count;
  28. x=x';
  29. % Fit the model to 'sin2' in this example and draw the point and
  30. % fitted line to compare
  31. figure;
  32. s = fit(x,y,'sin2')
  33. plot(s,x,y);



在前一节中,用户可以使用Matlab的拟合函数将跟踪的工作负载拟合到不同的模型中,这样我们就可以使用一组参数和函数来对工作负载进行概要分析。 我们主要关注四个变量来分析工作负载: 1)值的大小; 2)KV-Pair访问; 3)QPS; 4)迭代器扫描长度。

根据我们目前的研究,值大小和迭代器扫描长度服从Generalized Pareto Distribution(广义帕累托分布)。

概率密度函数是: f(x) = (1/sigma)(1+k(x-theta)\sigma)^(-1-1/k).


概率密度函数是:f(x) = ax^b+c 正弦函数最适合QPS。F(x) = Asin(Bx + C) + D.


1.直大小: sigma = 226.409, k = 0.923$, theta = 0 2.KV-pair访问: a = 0.001636, b = -0.7094 , and c = 3.21710^-9 3.QPS: $A = 147.9, B = 8.310^-5, C = -1.734, D = 1064.2 4.迭代器扫描长度: sigma = 1.747, k = 0.0819, theta = 0

我们在db_bench中开发了一个名为”mixgraph”的基准测试,它可以使用四组参数生成合成工作负载。 工作负载在统计上与原来的工作负载相似。注意,只有适合用于这四个变量的模型的工作负载才能在mixgraph中使用。 例如,如果值大小遵循功率分布而不是广义的Pareto分布,那么我们就不能使用mixgraph来生成工作负载。


  1. ./db_bench benchmarks="mixgraph"


  1. -value_k=<> -value_sigma=<> -value_theta=<>


  1. -key_dist_a=<> -key_dist_b=<>


  1. -sine_a=<> -sine_b=<> -sine_c=<> -sine_d=<> -sine_mix_rate_interval_milliseconds=<>'



  1. -iter_k=<> -iter_sigma=<> -iter_theta=<>


  1. -mix_get_ratio=<> -mix_put_ratio=<> -mix_seek_ratio=<>


  1. -reads=<>


  1. -num=<>



  1. ./db_bench --benchmarks="mixgraph" -value_k=0.1033 -value_sigma=39 -key_dist_a=0.002312 -key_dist_b=0.3467 -sine_mix_rate_interval_milliseconds=500 -sine_a=350 -sine_b=0.0105 -sine_d=2300 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.806 -mix_put_ratio=0.159 -mix_seek_ratio=0.035 -reads=1000000 -num=5000000