通过 Jar 包安装 parquet-tools

:::color4

https://github.com/apache/parquet-mr/tree/apache-parquet-1.10.1/parquet-tools

:::

  • 安装 parquet-tools

:::info 官方文档

https://github.com/apache/parquet-mr/tree/apache-parquet-1.10.1/parquet-tools

:::

  1. wget https://github.com/apache/parquet-mr/archive/refs/tags/apache-parquet-1.10.1.tar.gz
  2. tar xvf apache-parquet-1.10.1.tar.gz
  3. cd parquet-mr-apache-parquet-1.10.1/parquet-tools
  4. # 在本地模式下使用需要加上 -Plocal,以便包含 hadoop 客户端依赖项,这样本地用 java 就可以查看
  5. mvn clean package -Plocal
  6. # 默认在 hadoop 模式下使用,将排除 hadoop 客户端依赖项
  7. mvn clean package
  • 命令用法
  1. # 本地模式
  2. java -jar parquet-tools-<VERSION>.jar <command> my_parquet_file.parquet
  3. # hadoop 模式
  4. hadoop jar parquet-tools-<VERSION>.jar <command> my_parquet_file.parquet

:::color4 下面将以 hadoop 模式进行说明,这里选择直接下载官方编译的 hadoop 模式的 Jar 包而不是自己编译

使用 hadoop 模式,文件需要保存在 hdfs

:::

  • 下载 parquet-tools
  1. wget https://repo1.maven.org/maven2/org/apache/parquet/parquet-tools/1.10.1/parquet-tools-1.10.1.jar
  • 查看帮助
  1. # 打印文件的全部内容信息
  2. $ hadoop jar parquet-tools-1.10.1.jar cat -h
  3. parquet-cat:
  4. Prints the content of a Parquet file. The output contains only the data, no
  5. metadata is displayed
  6. usage: parquet-cat [option...] <input>
  7. where option is one of:
  8. --debug Enable debug output
  9. -h,--help Show this help string
  10. -j,--json Show records in JSON format.
  11. --no-color Disable color output even if supported
  12. where <input> is the parquet file to print to stdout
  1. # 打印文件的头几行内容信息
  2. $ hadoop jar parquet-tools-1.10.1.jar head -h
  3. parquet-head:
  4. Prints the first n record of the Parquet file
  5. usage: parquet-head [option...] <input>
  6. where option is one of:
  7. --debug Enable debug output
  8. -h,--help Show this help string
  9. -n,--records <arg> The number of records to show (default: 5)
  10. --no-color Disable color output even if supported
  11. where <input> is the parquet file to print to stdout
  1. # 打印 Parquet 文件的 schema
  2. $ hadoop jar parquet-tools-1.10.1.jar schema -h
  3. parquet-schema:
  4. Prints the schema of Parquet file(s)
  5. usage: parquet-schema [option...] <input>
  6. where option is one of:
  7. -d,--detailed Show detailed information about the schema.
  8. --debug Enable debug output
  9. -h,--help Show this help string
  10. --no-color Disable color output even if supported
  11. where <input> is the parquet file containing the schema to show
  1. # 打印 Parquet 文件的元数据,包含 key-value 属性 (类似 Avro schema),
  2. # compression ratios, encodings, compression used, 和 row group 信息
  3. $ hadoop jar parquet-tools-1.10.1.jar meta -h
  4. parquet-meta:
  5. Prints the metadata of Parquet file(s)
  6. usage: parquet-meta [option...] <input>
  7. where option is one of:
  8. --debug Enable debug output
  9. -h,--help Show this help string
  10. --no-color Disable color output even if supported
  11. where <input> is the parquet file to print to stdout
  1. # 打印 Parquet 文件的内容和元数据
  2. $ hadoop jar parquet-tools-1.10.1.jar dump -h
  3. parquet-dump:
  4. Prints the content and metadata of a Parquet file
  5. usage: parquet-dump [option...] <input>
  6. where option is one of:
  7. -c,--column <arg> Dump only the given column, can be specified more than
  8. once
  9. -d,--disable-data Do not dump column data
  10. --debug Enable debug output
  11. -h,--help Show this help string
  12. -m,--disable-meta Do not dump row group and page metadata
  13. -n,--disable-crop Do not crop the output based on console width
  14. --no-color Disable color output even if supported
  15. where <input> is the parquet file to print to stdout
  1. # 将多个 Parquet 文件合并为一个
  2. $ hadoop jar parquet-tools-1.10.1.jar merge -h
  3. parquet-merge:
  4. Merges multiple Parquet files into one. The command doesn't merge row groups,
  5. just places one after the other. When used to merge many small files, the
  6. resulting file will still contain small row groups, which usually leads to bad
  7. query performance.
  8. usage: parquet-merge [option...] <input> [<input> ...] <output>
  9. where option is one of:
  10. --debug Enable debug output
  11. -h,--help Show this help string
  12. --no-color Disable color output even if supported
  13. where <input> is the source parquet files/directory to be merged
  14. <output> is the destination parquet file
  1. # 打印 Parquet 文件中的行数
  2. $ hadoop jar parquet-tools-1.10.1.jar rowcount -h
  3. parquet-rowcount:
  4. Prints the count of rows in Parquet file(s)
  5. usage: parquet-rowcount [option...] <input>
  6. where option is one of:
  7. -d,--detailed Detailed rowcount of each matching file
  8. --debug Enable debug output
  9. -h,--help Show this help string
  10. --no-color Disable color output even if supported
  11. where <input> is the parquet file to count rows to stdout
  1. # 打印 Parquet 文件的大小
  2. $ hadoop jar parquet-tools-1.10.1.jar size -h
  3. parquet-size:
  4. Prints the size of Parquet file(s)
  5. usage: parquet-size [option...] <input>
  6. where option is one of:
  7. -d,--detailed Detailed size of each matching file
  8. --debug Enable debug output
  9. -h,--help Show this help string
  10. --no-color Disable color output even if supported
  11. -p,--pretty Pretty size
  12. -u,--uncompressed Uncompressed size
  13. where <input> is the parquet file to get size & human readable size to stdout
  • 查看 schema
  1. $ hadoop jar parquet-tools-1.10.1.jar schema /user/root/myparquet.parquet
  2. 2022-08-10 16:31:53,138 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
  3. message record {
  4. optional binary _hoodie_commit_time (UTF8);
  5. optional binary _hoodie_commit_seqno (UTF8);
  6. optional binary _hoodie_record_key (UTF8);
  7. optional binary _hoodie_partition_path (UTF8);
  8. optional binary _hoodie_file_name (UTF8);
  9. required int32 status_id;
  10. optional binary status (UTF8);
  11. optional binary active (UTF8);
  12. optional binary status_cn (UTF8);
  13. optional binary status_en (UTF8);
  14. }
  • 查看内容
  1. # hadoop jar parquet-tools-1.10.1.jar cat /user/root/myparquet.parquet
  2. 2022-08-10 16:32:35,883 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
  3. 2022-08-10 16:32:36,369 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 4 records.
  4. 2022-08-10 16:32:36,369 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block
  5. 2022-08-10 16:32:36,412 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
  6. 2022-08-10 16:32:36,413 INFO compress.CodecPool: Got brand-new decompressor [.gz]
  7. 2022-08-10 16:32:36,427 INFO hadoop.InternalParquetRecordReader: block read in memory in 57 ms. row count = 4
  8. _hoodie_commit_time = 20220804160725319
  9. _hoodie_commit_seqno = 20220804160725319_0_0
  10. _hoodie_record_key = 12
  11. _hoodie_partition_path =
  12. _hoodie_file_name = 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet
  13. status_id = 12
  14. status = released
  15. active = not ceased
  16. status_cn = 已释放
  17. status_en = released
  18. _hoodie_commit_time = 20220804160725319
  19. _hoodie_commit_seqno = 20220804160725319_0_1
  20. _hoodie_record_key = 2
  21. _hoodie_partition_path =
  22. _hoodie_file_name = 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet
  23. status_id = 2
  24. status = stopped
  25. active = not ceased
  26. status_cn = 已关机
  27. status_en = stopped
  28. _hoodie_commit_time = 20220804160725319
  29. _hoodie_commit_seqno = 20220804160725319_0_2
  30. _hoodie_record_key = 4
  31. _hoodie_partition_path =
  32. _hoodie_file_name = 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet
  33. status_id = 4
  34. status = suspended
  35. active = not ceased
  36. status_cn = 已挂起
  37. status_en = suspended
  38. _hoodie_commit_time = 20220804160725319
  39. _hoodie_commit_seqno = 20220804160725319_0_3
  40. _hoodie_record_key = 6
  41. _hoodie_partition_path =
  42. _hoodie_file_name = 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet
  43. status_id = 6
  44. status = in-use
  45. active = active
  46. status_cn = 使用中
  47. status_en = in-use
  • 查看元数据
  1. $ hadoop jar parquet-tools-1.10.1.jar meta /user/root/myparquet.parquet
  2. extra: hoodie_min_record_key = 12
  3. extra: parquet.avro.schema = {"type":"record","name":"record","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"status_id","type":"int"},{"name":"status","type":["null","string"],"default":null},{"name":"active","type":["null","string"],"default":null},{"name":"status_cn","type":["null","string"],"default":null},{"name":"status_en","type":["null","string"],"default":null}]}
  4. extra: writer.model.name = avro
  5. extra: hoodie_max_record_key = 6
  6. file schema: record
  7. --------------------------------------------------------------------------------
  8. _hoodie_commit_time: OPTIONAL BINARY O:UTF8 R:0 D:1
  9. _hoodie_commit_seqno: OPTIONAL BINARY O:UTF8 R:0 D:1
  10. _hoodie_record_key: OPTIONAL BINARY O:UTF8 R:0 D:1
  11. _hoodie_partition_path: OPTIONAL BINARY O:UTF8 R:0 D:1
  12. _hoodie_file_name: OPTIONAL BINARY O:UTF8 R:0 D:1
  13. status_id: REQUIRED INT32 R:0 D:0
  14. status: OPTIONAL BINARY O:UTF8 R:0 D:1
  15. active: OPTIONAL BINARY O:UTF8 R:0 D:1
  16. status_cn: OPTIONAL BINARY O:UTF8 R:0 D:1
  17. status_en: OPTIONAL BINARY O:UTF8 R:0 D:1
  18. row group 1: RC:4 TS:775 OFFSET:4
  19. --------------------------------------------------------------------------------
  20. _hoodie_commit_time: BINARY GZIP DO:4 FPO:63 SZ:110/70/0.64 VC:4 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 20220804160725319, max: 20220804160725319, num_nulls: 0]
  21. _hoodie_commit_seqno: BINARY GZIP DO:0 FPO:114 SZ:83/130/1.57 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 20220804160725319_0_0, max: 20220804160725319_0_3, num_nulls: 0]
  22. _hoodie_record_key: BINARY GZIP DO:0 FPO:197 SZ:61/50/0.82 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 12, max: 6, num_nulls: 0]
  23. _hoodie_partition_path: BINARY GZIP DO:258 FPO:301 SZ:94/54/0.57 VC:4 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: , num_nulls: 0]
  24. _hoodie_file_name: BINARY GZIP DO:352 FPO:456 SZ:155/124/0.80 VC:4 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet, max: 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet, num_nulls: 0]
  25. status_id: INT32 GZIP DO:0 FPO:507 SZ:54/39/0.72 VC:4 ENC:BIT_PACKED,PLAIN ST:[min: 2, max: 12, num_nulls: 0]
  26. status: BINARY GZIP DO:0 FPO:561 SZ:90/76/0.84 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: in-use, max: suspended, num_nulls: 0]
  27. active: BINARY GZIP DO:651 FPO:711 SZ:112/74/0.66 VC:4 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: active, max: not ceased, num_nulls: 0]
  28. status_cn: BINARY GZIP DO:0 FPO:763 SZ:91/82/0.90 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 使用中, max: 已释放, num_nulls: 0]
  29. status_en: BINARY GZIP DO:0 FPO:854 SZ:90/76/0.84 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: in-use, max: suspended, num_nulls: 0]

通过 pip 安装 parquet-tools

  • 通过 pip 安装 parquet-tools(Ubuntu 20
  1. python3 -m pip install parquet-tools
  • 查看帮助
  1. # parquet-tools --help
  2. usage: parquet-tools [-h] {show,csv,inspect} ...
  3. parquet CLI tools
  4. positional arguments:
  5. {show,csv,inspect}
  6. show Show human readable format. see `show -h`
  7. csv Cat csv style. see `csv -h`
  8. inspect Inspect parquet file. see `inspect -h`
  9. optional arguments:
  10. -h, --help show this help message and exit
  • 查看 schema
  1. # parquet-tools inspect e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet
  2. ############ file meta data ############
  3. created_by: parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
  4. num_columns: 10
  5. num_rows: 4
  6. num_row_groups: 1
  7. format_version: 1.0
  8. serialized_size: 433901
  9. ############ Columns ############
  10. _hoodie_commit_time
  11. _hoodie_commit_seqno
  12. _hoodie_record_key
  13. _hoodie_partition_path
  14. _hoodie_file_name
  15. status_id
  16. status
  17. active
  18. status_cn
  19. status_en
  20. ############ Column(_hoodie_commit_time) ############
  21. name: _hoodie_commit_time
  22. path: _hoodie_commit_time
  23. max_definition_level: 1
  24. max_repetition_level: 0
  25. physical_type: BYTE_ARRAY
  26. logical_type: String
  27. converted_type (legacy): UTF8
  28. compression: GZIP (space_saved: -57%)
  29. ############ Column(_hoodie_commit_seqno) ############
  30. name: _hoodie_commit_seqno
  31. path: _hoodie_commit_seqno
  32. max_definition_level: 1
  33. max_repetition_level: 0
  34. physical_type: BYTE_ARRAY
  35. logical_type: String
  36. converted_type (legacy): UTF8
  37. compression: GZIP (space_saved: 36%)
  38. ############ Column(_hoodie_record_key) ############
  39. name: _hoodie_record_key
  40. path: _hoodie_record_key
  41. max_definition_level: 1
  42. max_repetition_level: 0
  43. physical_type: BYTE_ARRAY
  44. logical_type: String
  45. converted_type (legacy): UTF8
  46. compression: GZIP (space_saved: -20%)
  47. ############ Column(_hoodie_partition_path) ############
  48. name: _hoodie_partition_path
  49. path: _hoodie_partition_path
  50. max_definition_level: 1
  51. max_repetition_level: 0
  52. physical_type: BYTE_ARRAY
  53. logical_type: String
  54. converted_type (legacy): UTF8
  55. compression: GZIP (space_saved: -74%)
  56. ############ Column(_hoodie_file_name) ############
  57. name: _hoodie_file_name
  58. path: _hoodie_file_name
  59. max_definition_level: 1
  60. max_repetition_level: 0
  61. physical_type: BYTE_ARRAY
  62. logical_type: String
  63. converted_type (legacy): UTF8
  64. compression: GZIP (space_saved: -25%)
  65. ############ Column(status_id) ############
  66. name: status_id
  67. path: status_id
  68. max_definition_level: 0
  69. max_repetition_level: 0
  70. physical_type: INT32
  71. logical_type: None
  72. converted_type (legacy): NONE
  73. compression: GZIP (space_saved: -38%)
  74. ############ Column(status) ############
  75. name: status
  76. path: status
  77. max_definition_level: 1
  78. max_repetition_level: 0
  79. physical_type: BYTE_ARRAY
  80. logical_type: String
  81. converted_type (legacy): UTF8
  82. compression: GZIP (space_saved: -13%)
  83. ############ Column(active) ############
  84. name: active
  85. path: active
  86. max_definition_level: 1
  87. max_repetition_level: 0
  88. physical_type: BYTE_ARRAY
  89. logical_type: String
  90. converted_type (legacy): UTF8
  91. compression: GZIP (space_saved: -51%)
  92. ############ Column(status_cn) ############
  93. name: status_cn
  94. path: status_cn
  95. max_definition_level: 1
  96. max_repetition_level: 0
  97. physical_type: BYTE_ARRAY
  98. logical_type: String
  99. converted_type (legacy): UTF8
  100. compression: GZIP (space_saved: -9%)
  101. ############ Column(status_en) ############
  102. name: status_en
  103. path: status_en
  104. max_definition_level: 1
  105. max_repetition_level: 0
  106. physical_type: BYTE_ARRAY
  107. logical_type: String
  108. converted_type (legacy): UTF8
  109. compression: GZIP (space_saved: -13%)
  • 查看内容
  1. # parquet-tools show e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet
  2. +-----------------------+------------------------+----------------------+--------------------------+----------------------------------------------------------------------+-------------+------------+------------+-------------+-------------+
  3. | _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | _hoodie_partition_path | _hoodie_file_name | status_id | status | active | status_cn | status_en |
  4. |-----------------------+------------------------+----------------------+--------------------------+----------------------------------------------------------------------+-------------+------------+------------+-------------+-------------|
  5. | 20220808184420366 | 20220808184420366_1_0 | 11 | | e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet | 11 | associated | active | 已分配 | associated |
  6. | 20220808184420366 | 20220808184420366_1_1 | 1 | | e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet | 1 | running | active | 运行中 | running |
  7. | 20220808184420366 | 20220808184420366_1_2 | 3 | | e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet | 3 | terminated | not ceased | 已删除 | terminated |
  8. | 20220808184420366 | 20220808184420366_1_3 | 10 | | e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet | 10 | poweroffed | not ceased | 已断电 | poweroffed |
  9. +-----------------------+------------------------+----------------------+--------------------------+----------------------------------------------------------------------+-------------+------------+------------+-------------+-------------+