通过 Jar 包安装 parquet-tools
:::color4
parquet-mr 项目地址
parquet-tools 是 parquet-mr 的子项目,项目路径(1.10.1及以下的版本才支持)
https://github.com/apache/parquet-mr/tree/apache-parquet-1.10.1/parquet-tools
:::
- 安装 parquet-tools
:::info 官方文档
https://github.com/apache/parquet-mr/tree/apache-parquet-1.10.1/parquet-tools
:::
wget https://github.com/apache/parquet-mr/archive/refs/tags/apache-parquet-1.10.1.tar.gztar xvf apache-parquet-1.10.1.tar.gzcd parquet-mr-apache-parquet-1.10.1/parquet-tools# 在本地模式下使用需要加上 -Plocal,以便包含 hadoop 客户端依赖项,这样本地用 java 就可以查看mvn clean package -Plocal# 默认在 hadoop 模式下使用,将排除 hadoop 客户端依赖项mvn clean package
- 命令用法
# 本地模式java -jar parquet-tools-<VERSION>.jar <command> my_parquet_file.parquet# hadoop 模式hadoop jar parquet-tools-<VERSION>.jar <command> my_parquet_file.parquet
:::color4 下面将以 hadoop 模式进行说明,这里选择直接下载官方编译的 hadoop 模式的 Jar 包而不是自己编译
使用 hadoop 模式,文件需要保存在 hdfs
:::
- 下载 parquet-tools
wget https://repo1.maven.org/maven2/org/apache/parquet/parquet-tools/1.10.1/parquet-tools-1.10.1.jar
- 查看帮助
# 打印文件的全部内容信息$ hadoop jar parquet-tools-1.10.1.jar cat -hparquet-cat:Prints the content of a Parquet file. The output contains only the data, nometadata is displayedusage: parquet-cat [option...] <input>where option is one of:--debug Enable debug output-h,--help Show this help string-j,--json Show records in JSON format.--no-color Disable color output even if supportedwhere <input> is the parquet file to print to stdout
# 打印文件的头几行内容信息$ hadoop jar parquet-tools-1.10.1.jar head -hparquet-head:Prints the first n record of the Parquet fileusage: parquet-head [option...] <input>where option is one of:--debug Enable debug output-h,--help Show this help string-n,--records <arg> The number of records to show (default: 5)--no-color Disable color output even if supportedwhere <input> is the parquet file to print to stdout
# 打印 Parquet 文件的 schema$ hadoop jar parquet-tools-1.10.1.jar schema -hparquet-schema:Prints the schema of Parquet file(s)usage: parquet-schema [option...] <input>where option is one of:-d,--detailed Show detailed information about the schema.--debug Enable debug output-h,--help Show this help string--no-color Disable color output even if supportedwhere <input> is the parquet file containing the schema to show
# 打印 Parquet 文件的元数据,包含 key-value 属性 (类似 Avro schema),# compression ratios, encodings, compression used, 和 row group 信息$ hadoop jar parquet-tools-1.10.1.jar meta -hparquet-meta:Prints the metadata of Parquet file(s)usage: parquet-meta [option...] <input>where option is one of:--debug Enable debug output-h,--help Show this help string--no-color Disable color output even if supportedwhere <input> is the parquet file to print to stdout
# 打印 Parquet 文件的内容和元数据$ hadoop jar parquet-tools-1.10.1.jar dump -hparquet-dump:Prints the content and metadata of a Parquet fileusage: parquet-dump [option...] <input>where option is one of:-c,--column <arg> Dump only the given column, can be specified more thanonce-d,--disable-data Do not dump column data--debug Enable debug output-h,--help Show this help string-m,--disable-meta Do not dump row group and page metadata-n,--disable-crop Do not crop the output based on console width--no-color Disable color output even if supportedwhere <input> is the parquet file to print to stdout
# 将多个 Parquet 文件合并为一个$ hadoop jar parquet-tools-1.10.1.jar merge -hparquet-merge:Merges multiple Parquet files into one. The command doesn't merge row groups,just places one after the other. When used to merge many small files, theresulting file will still contain small row groups, which usually leads to badquery performance.usage: parquet-merge [option...] <input> [<input> ...] <output>where option is one of:--debug Enable debug output-h,--help Show this help string--no-color Disable color output even if supportedwhere <input> is the source parquet files/directory to be merged<output> is the destination parquet file
# 打印 Parquet 文件中的行数$ hadoop jar parquet-tools-1.10.1.jar rowcount -hparquet-rowcount:Prints the count of rows in Parquet file(s)usage: parquet-rowcount [option...] <input>where option is one of:-d,--detailed Detailed rowcount of each matching file--debug Enable debug output-h,--help Show this help string--no-color Disable color output even if supportedwhere <input> is the parquet file to count rows to stdout
# 打印 Parquet 文件的大小$ hadoop jar parquet-tools-1.10.1.jar size -hparquet-size:Prints the size of Parquet file(s)usage: parquet-size [option...] <input>where option is one of:-d,--detailed Detailed size of each matching file--debug Enable debug output-h,--help Show this help string--no-color Disable color output even if supported-p,--pretty Pretty size-u,--uncompressed Uncompressed sizewhere <input> is the parquet file to get size & human readable size to stdout
- 查看 schema
$ hadoop jar parquet-tools-1.10.1.jar schema /user/root/myparquet.parquet2022-08-10 16:31:53,138 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = falsemessage record {optional binary _hoodie_commit_time (UTF8);optional binary _hoodie_commit_seqno (UTF8);optional binary _hoodie_record_key (UTF8);optional binary _hoodie_partition_path (UTF8);optional binary _hoodie_file_name (UTF8);required int32 status_id;optional binary status (UTF8);optional binary active (UTF8);optional binary status_cn (UTF8);optional binary status_en (UTF8);}
- 查看内容
# hadoop jar parquet-tools-1.10.1.jar cat /user/root/myparquet.parquet2022-08-10 16:32:35,883 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false2022-08-10 16:32:36,369 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 4 records.2022-08-10 16:32:36,369 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block2022-08-10 16:32:36,412 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library2022-08-10 16:32:36,413 INFO compress.CodecPool: Got brand-new decompressor [.gz]2022-08-10 16:32:36,427 INFO hadoop.InternalParquetRecordReader: block read in memory in 57 ms. row count = 4_hoodie_commit_time = 20220804160725319_hoodie_commit_seqno = 20220804160725319_0_0_hoodie_record_key = 12_hoodie_partition_path =_hoodie_file_name = 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquetstatus_id = 12status = releasedactive = not ceasedstatus_cn = 已释放status_en = released_hoodie_commit_time = 20220804160725319_hoodie_commit_seqno = 20220804160725319_0_1_hoodie_record_key = 2_hoodie_partition_path =_hoodie_file_name = 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquetstatus_id = 2status = stoppedactive = not ceasedstatus_cn = 已关机status_en = stopped_hoodie_commit_time = 20220804160725319_hoodie_commit_seqno = 20220804160725319_0_2_hoodie_record_key = 4_hoodie_partition_path =_hoodie_file_name = 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquetstatus_id = 4status = suspendedactive = not ceasedstatus_cn = 已挂起status_en = suspended_hoodie_commit_time = 20220804160725319_hoodie_commit_seqno = 20220804160725319_0_3_hoodie_record_key = 6_hoodie_partition_path =_hoodie_file_name = 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquetstatus_id = 6status = in-useactive = activestatus_cn = 使用中status_en = in-use
- 查看元数据
$ hadoop jar parquet-tools-1.10.1.jar meta /user/root/myparquet.parquetextra: hoodie_min_record_key = 12extra: parquet.avro.schema = {"type":"record","name":"record","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"status_id","type":"int"},{"name":"status","type":["null","string"],"default":null},{"name":"active","type":["null","string"],"default":null},{"name":"status_cn","type":["null","string"],"default":null},{"name":"status_en","type":["null","string"],"default":null}]}extra: writer.model.name = avroextra: hoodie_max_record_key = 6file schema: record--------------------------------------------------------------------------------_hoodie_commit_time: OPTIONAL BINARY O:UTF8 R:0 D:1_hoodie_commit_seqno: OPTIONAL BINARY O:UTF8 R:0 D:1_hoodie_record_key: OPTIONAL BINARY O:UTF8 R:0 D:1_hoodie_partition_path: OPTIONAL BINARY O:UTF8 R:0 D:1_hoodie_file_name: OPTIONAL BINARY O:UTF8 R:0 D:1status_id: REQUIRED INT32 R:0 D:0status: OPTIONAL BINARY O:UTF8 R:0 D:1active: OPTIONAL BINARY O:UTF8 R:0 D:1status_cn: OPTIONAL BINARY O:UTF8 R:0 D:1status_en: OPTIONAL BINARY O:UTF8 R:0 D:1row group 1: RC:4 TS:775 OFFSET:4--------------------------------------------------------------------------------_hoodie_commit_time: BINARY GZIP DO:4 FPO:63 SZ:110/70/0.64 VC:4 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 20220804160725319, max: 20220804160725319, num_nulls: 0]_hoodie_commit_seqno: BINARY GZIP DO:0 FPO:114 SZ:83/130/1.57 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 20220804160725319_0_0, max: 20220804160725319_0_3, num_nulls: 0]_hoodie_record_key: BINARY GZIP DO:0 FPO:197 SZ:61/50/0.82 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 12, max: 6, num_nulls: 0]_hoodie_partition_path: BINARY GZIP DO:258 FPO:301 SZ:94/54/0.57 VC:4 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: , num_nulls: 0]_hoodie_file_name: BINARY GZIP DO:352 FPO:456 SZ:155/124/0.80 VC:4 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet, max: 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet, num_nulls: 0]status_id: INT32 GZIP DO:0 FPO:507 SZ:54/39/0.72 VC:4 ENC:BIT_PACKED,PLAIN ST:[min: 2, max: 12, num_nulls: 0]status: BINARY GZIP DO:0 FPO:561 SZ:90/76/0.84 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: in-use, max: suspended, num_nulls: 0]active: BINARY GZIP DO:651 FPO:711 SZ:112/74/0.66 VC:4 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: active, max: not ceased, num_nulls: 0]status_cn: BINARY GZIP DO:0 FPO:763 SZ:91/82/0.90 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 使用中, max: 已释放, num_nulls: 0]status_en: BINARY GZIP DO:0 FPO:854 SZ:90/76/0.84 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: in-use, max: suspended, num_nulls: 0]
通过 pip 安装 parquet-tools
- 通过 pip 安装 parquet-tools(Ubuntu 20)
python3 -m pip install parquet-tools
- 查看帮助
# parquet-tools --helpusage: parquet-tools [-h] {show,csv,inspect} ...parquet CLI toolspositional arguments:{show,csv,inspect}show Show human readable format. see `show -h`csv Cat csv style. see `csv -h`inspect Inspect parquet file. see `inspect -h`optional arguments:-h, --help show this help message and exit
- 查看 schema
# parquet-tools inspect e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet############ file meta data ############created_by: parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)num_columns: 10num_rows: 4num_row_groups: 1format_version: 1.0serialized_size: 433901############ Columns ############_hoodie_commit_time_hoodie_commit_seqno_hoodie_record_key_hoodie_partition_path_hoodie_file_namestatus_idstatusactivestatus_cnstatus_en############ Column(_hoodie_commit_time) ############name: _hoodie_commit_timepath: _hoodie_commit_timemax_definition_level: 1max_repetition_level: 0physical_type: BYTE_ARRAYlogical_type: Stringconverted_type (legacy): UTF8compression: GZIP (space_saved: -57%)############ Column(_hoodie_commit_seqno) ############name: _hoodie_commit_seqnopath: _hoodie_commit_seqnomax_definition_level: 1max_repetition_level: 0physical_type: BYTE_ARRAYlogical_type: Stringconverted_type (legacy): UTF8compression: GZIP (space_saved: 36%)############ Column(_hoodie_record_key) ############name: _hoodie_record_keypath: _hoodie_record_keymax_definition_level: 1max_repetition_level: 0physical_type: BYTE_ARRAYlogical_type: Stringconverted_type (legacy): UTF8compression: GZIP (space_saved: -20%)############ Column(_hoodie_partition_path) ############name: _hoodie_partition_pathpath: _hoodie_partition_pathmax_definition_level: 1max_repetition_level: 0physical_type: BYTE_ARRAYlogical_type: Stringconverted_type (legacy): UTF8compression: GZIP (space_saved: -74%)############ Column(_hoodie_file_name) ############name: _hoodie_file_namepath: _hoodie_file_namemax_definition_level: 1max_repetition_level: 0physical_type: BYTE_ARRAYlogical_type: Stringconverted_type (legacy): UTF8compression: GZIP (space_saved: -25%)############ Column(status_id) ############name: status_idpath: status_idmax_definition_level: 0max_repetition_level: 0physical_type: INT32logical_type: Noneconverted_type (legacy): NONEcompression: GZIP (space_saved: -38%)############ Column(status) ############name: statuspath: statusmax_definition_level: 1max_repetition_level: 0physical_type: BYTE_ARRAYlogical_type: Stringconverted_type (legacy): UTF8compression: GZIP (space_saved: -13%)############ Column(active) ############name: activepath: activemax_definition_level: 1max_repetition_level: 0physical_type: BYTE_ARRAYlogical_type: Stringconverted_type (legacy): UTF8compression: GZIP (space_saved: -51%)############ Column(status_cn) ############name: status_cnpath: status_cnmax_definition_level: 1max_repetition_level: 0physical_type: BYTE_ARRAYlogical_type: Stringconverted_type (legacy): UTF8compression: GZIP (space_saved: -9%)############ Column(status_en) ############name: status_enpath: status_enmax_definition_level: 1max_repetition_level: 0physical_type: BYTE_ARRAYlogical_type: Stringconverted_type (legacy): UTF8compression: GZIP (space_saved: -13%)
- 查看内容
# parquet-tools show e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet+-----------------------+------------------------+----------------------+--------------------------+----------------------------------------------------------------------+-------------+------------+------------+-------------+-------------+| _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | _hoodie_partition_path | _hoodie_file_name | status_id | status | active | status_cn | status_en ||-----------------------+------------------------+----------------------+--------------------------+----------------------------------------------------------------------+-------------+------------+------------+-------------+-------------|| 20220808184420366 | 20220808184420366_1_0 | 11 | | e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet | 11 | associated | active | 已分配 | associated || 20220808184420366 | 20220808184420366_1_1 | 1 | | e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet | 1 | running | active | 运行中 | running || 20220808184420366 | 20220808184420366_1_2 | 3 | | e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet | 3 | terminated | not ceased | 已删除 | terminated || 20220808184420366 | 20220808184420366_1_3 | 10 | | e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet | 10 | poweroffed | not ceased | 已断电 | poweroffed |+-----------------------+------------------------+----------------------+--------------------------+----------------------------------------------------------------------+-------------+------------+------------+-------------+-------------+
