通过 Jar 包安装 parquet-tools
:::color4
parquet-mr 项目地址
parquet-tools 是 parquet-mr 的子项目,项目路径(1.10.1及以下的版本才支持)
https://github.com/apache/parquet-mr/tree/apache-parquet-1.10.1/parquet-tools
:::
- 安装 parquet-tools
:::info 官方文档
https://github.com/apache/parquet-mr/tree/apache-parquet-1.10.1/parquet-tools
:::
wget https://github.com/apache/parquet-mr/archive/refs/tags/apache-parquet-1.10.1.tar.gz
tar xvf apache-parquet-1.10.1.tar.gz
cd parquet-mr-apache-parquet-1.10.1/parquet-tools
# 在本地模式下使用需要加上 -Plocal,以便包含 hadoop 客户端依赖项,这样本地用 java 就可以查看
mvn clean package -Plocal
# 默认在 hadoop 模式下使用,将排除 hadoop 客户端依赖项
mvn clean package
- 命令用法
# 本地模式
java -jar parquet-tools-<VERSION>.jar <command> my_parquet_file.parquet
# hadoop 模式
hadoop jar parquet-tools-<VERSION>.jar <command> my_parquet_file.parquet
:::color4 下面将以 hadoop 模式进行说明,这里选择直接下载官方编译的 hadoop 模式的 Jar 包而不是自己编译
使用 hadoop 模式,文件需要保存在 hdfs
:::
- 下载 parquet-tools
wget https://repo1.maven.org/maven2/org/apache/parquet/parquet-tools/1.10.1/parquet-tools-1.10.1.jar
- 查看帮助
# 打印文件的全部内容信息
$ hadoop jar parquet-tools-1.10.1.jar cat -h
parquet-cat:
Prints the content of a Parquet file. The output contains only the data, no
metadata is displayed
usage: parquet-cat [option...] <input>
where option is one of:
--debug Enable debug output
-h,--help Show this help string
-j,--json Show records in JSON format.
--no-color Disable color output even if supported
where <input> is the parquet file to print to stdout
# 打印文件的头几行内容信息
$ hadoop jar parquet-tools-1.10.1.jar head -h
parquet-head:
Prints the first n record of the Parquet file
usage: parquet-head [option...] <input>
where option is one of:
--debug Enable debug output
-h,--help Show this help string
-n,--records <arg> The number of records to show (default: 5)
--no-color Disable color output even if supported
where <input> is the parquet file to print to stdout
# 打印 Parquet 文件的 schema
$ hadoop jar parquet-tools-1.10.1.jar schema -h
parquet-schema:
Prints the schema of Parquet file(s)
usage: parquet-schema [option...] <input>
where option is one of:
-d,--detailed Show detailed information about the schema.
--debug Enable debug output
-h,--help Show this help string
--no-color Disable color output even if supported
where <input> is the parquet file containing the schema to show
# 打印 Parquet 文件的元数据,包含 key-value 属性 (类似 Avro schema),
# compression ratios, encodings, compression used, 和 row group 信息
$ hadoop jar parquet-tools-1.10.1.jar meta -h
parquet-meta:
Prints the metadata of Parquet file(s)
usage: parquet-meta [option...] <input>
where option is one of:
--debug Enable debug output
-h,--help Show this help string
--no-color Disable color output even if supported
where <input> is the parquet file to print to stdout
# 打印 Parquet 文件的内容和元数据
$ hadoop jar parquet-tools-1.10.1.jar dump -h
parquet-dump:
Prints the content and metadata of a Parquet file
usage: parquet-dump [option...] <input>
where option is one of:
-c,--column <arg> Dump only the given column, can be specified more than
once
-d,--disable-data Do not dump column data
--debug Enable debug output
-h,--help Show this help string
-m,--disable-meta Do not dump row group and page metadata
-n,--disable-crop Do not crop the output based on console width
--no-color Disable color output even if supported
where <input> is the parquet file to print to stdout
# 将多个 Parquet 文件合并为一个
$ hadoop jar parquet-tools-1.10.1.jar merge -h
parquet-merge:
Merges multiple Parquet files into one. The command doesn't merge row groups,
just places one after the other. When used to merge many small files, the
resulting file will still contain small row groups, which usually leads to bad
query performance.
usage: parquet-merge [option...] <input> [<input> ...] <output>
where option is one of:
--debug Enable debug output
-h,--help Show this help string
--no-color Disable color output even if supported
where <input> is the source parquet files/directory to be merged
<output> is the destination parquet file
# 打印 Parquet 文件中的行数
$ hadoop jar parquet-tools-1.10.1.jar rowcount -h
parquet-rowcount:
Prints the count of rows in Parquet file(s)
usage: parquet-rowcount [option...] <input>
where option is one of:
-d,--detailed Detailed rowcount of each matching file
--debug Enable debug output
-h,--help Show this help string
--no-color Disable color output even if supported
where <input> is the parquet file to count rows to stdout
# 打印 Parquet 文件的大小
$ hadoop jar parquet-tools-1.10.1.jar size -h
parquet-size:
Prints the size of Parquet file(s)
usage: parquet-size [option...] <input>
where option is one of:
-d,--detailed Detailed size of each matching file
--debug Enable debug output
-h,--help Show this help string
--no-color Disable color output even if supported
-p,--pretty Pretty size
-u,--uncompressed Uncompressed size
where <input> is the parquet file to get size & human readable size to stdout
- 查看 schema
$ hadoop jar parquet-tools-1.10.1.jar schema /user/root/myparquet.parquet
2022-08-10 16:31:53,138 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
message record {
optional binary _hoodie_commit_time (UTF8);
optional binary _hoodie_commit_seqno (UTF8);
optional binary _hoodie_record_key (UTF8);
optional binary _hoodie_partition_path (UTF8);
optional binary _hoodie_file_name (UTF8);
required int32 status_id;
optional binary status (UTF8);
optional binary active (UTF8);
optional binary status_cn (UTF8);
optional binary status_en (UTF8);
}
- 查看内容
# hadoop jar parquet-tools-1.10.1.jar cat /user/root/myparquet.parquet
2022-08-10 16:32:35,883 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2022-08-10 16:32:36,369 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 4 records.
2022-08-10 16:32:36,369 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block
2022-08-10 16:32:36,412 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
2022-08-10 16:32:36,413 INFO compress.CodecPool: Got brand-new decompressor [.gz]
2022-08-10 16:32:36,427 INFO hadoop.InternalParquetRecordReader: block read in memory in 57 ms. row count = 4
_hoodie_commit_time = 20220804160725319
_hoodie_commit_seqno = 20220804160725319_0_0
_hoodie_record_key = 12
_hoodie_partition_path =
_hoodie_file_name = 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet
status_id = 12
status = released
active = not ceased
status_cn = 已释放
status_en = released
_hoodie_commit_time = 20220804160725319
_hoodie_commit_seqno = 20220804160725319_0_1
_hoodie_record_key = 2
_hoodie_partition_path =
_hoodie_file_name = 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet
status_id = 2
status = stopped
active = not ceased
status_cn = 已关机
status_en = stopped
_hoodie_commit_time = 20220804160725319
_hoodie_commit_seqno = 20220804160725319_0_2
_hoodie_record_key = 4
_hoodie_partition_path =
_hoodie_file_name = 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet
status_id = 4
status = suspended
active = not ceased
status_cn = 已挂起
status_en = suspended
_hoodie_commit_time = 20220804160725319
_hoodie_commit_seqno = 20220804160725319_0_3
_hoodie_record_key = 6
_hoodie_partition_path =
_hoodie_file_name = 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet
status_id = 6
status = in-use
active = active
status_cn = 使用中
status_en = in-use
- 查看元数据
$ hadoop jar parquet-tools-1.10.1.jar meta /user/root/myparquet.parquet
extra: hoodie_min_record_key = 12
extra: parquet.avro.schema = {"type":"record","name":"record","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"status_id","type":"int"},{"name":"status","type":["null","string"],"default":null},{"name":"active","type":["null","string"],"default":null},{"name":"status_cn","type":["null","string"],"default":null},{"name":"status_en","type":["null","string"],"default":null}]}
extra: writer.model.name = avro
extra: hoodie_max_record_key = 6
file schema: record
--------------------------------------------------------------------------------
_hoodie_commit_time: OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_commit_seqno: OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_record_key: OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_partition_path: OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_file_name: OPTIONAL BINARY O:UTF8 R:0 D:1
status_id: REQUIRED INT32 R:0 D:0
status: OPTIONAL BINARY O:UTF8 R:0 D:1
active: OPTIONAL BINARY O:UTF8 R:0 D:1
status_cn: OPTIONAL BINARY O:UTF8 R:0 D:1
status_en: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:4 TS:775 OFFSET:4
--------------------------------------------------------------------------------
_hoodie_commit_time: BINARY GZIP DO:4 FPO:63 SZ:110/70/0.64 VC:4 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 20220804160725319, max: 20220804160725319, num_nulls: 0]
_hoodie_commit_seqno: BINARY GZIP DO:0 FPO:114 SZ:83/130/1.57 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 20220804160725319_0_0, max: 20220804160725319_0_3, num_nulls: 0]
_hoodie_record_key: BINARY GZIP DO:0 FPO:197 SZ:61/50/0.82 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 12, max: 6, num_nulls: 0]
_hoodie_partition_path: BINARY GZIP DO:258 FPO:301 SZ:94/54/0.57 VC:4 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: , num_nulls: 0]
_hoodie_file_name: BINARY GZIP DO:352 FPO:456 SZ:155/124/0.80 VC:4 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet, max: 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet, num_nulls: 0]
status_id: INT32 GZIP DO:0 FPO:507 SZ:54/39/0.72 VC:4 ENC:BIT_PACKED,PLAIN ST:[min: 2, max: 12, num_nulls: 0]
status: BINARY GZIP DO:0 FPO:561 SZ:90/76/0.84 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: in-use, max: suspended, num_nulls: 0]
active: BINARY GZIP DO:651 FPO:711 SZ:112/74/0.66 VC:4 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: active, max: not ceased, num_nulls: 0]
status_cn: BINARY GZIP DO:0 FPO:763 SZ:91/82/0.90 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 使用中, max: 已释放, num_nulls: 0]
status_en: BINARY GZIP DO:0 FPO:854 SZ:90/76/0.84 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: in-use, max: suspended, num_nulls: 0]
通过 pip 安装 parquet-tools
- 通过 pip 安装 parquet-tools(Ubuntu 20)
python3 -m pip install parquet-tools
- 查看帮助
# parquet-tools --help
usage: parquet-tools [-h] {show,csv,inspect} ...
parquet CLI tools
positional arguments:
{show,csv,inspect}
show Show human readable format. see `show -h`
csv Cat csv style. see `csv -h`
inspect Inspect parquet file. see `inspect -h`
optional arguments:
-h, --help show this help message and exit
- 查看 schema
# parquet-tools inspect e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet
############ file meta data ############
created_by: parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
num_columns: 10
num_rows: 4
num_row_groups: 1
format_version: 1.0
serialized_size: 433901
############ Columns ############
_hoodie_commit_time
_hoodie_commit_seqno
_hoodie_record_key
_hoodie_partition_path
_hoodie_file_name
status_id
status
active
status_cn
status_en
############ Column(_hoodie_commit_time) ############
name: _hoodie_commit_time
path: _hoodie_commit_time
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: -57%)
############ Column(_hoodie_commit_seqno) ############
name: _hoodie_commit_seqno
path: _hoodie_commit_seqno
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 36%)
############ Column(_hoodie_record_key) ############
name: _hoodie_record_key
path: _hoodie_record_key
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: -20%)
############ Column(_hoodie_partition_path) ############
name: _hoodie_partition_path
path: _hoodie_partition_path
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: -74%)
############ Column(_hoodie_file_name) ############
name: _hoodie_file_name
path: _hoodie_file_name
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: -25%)
############ Column(status_id) ############
name: status_id
path: status_id
max_definition_level: 0
max_repetition_level: 0
physical_type: INT32
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: -38%)
############ Column(status) ############
name: status
path: status
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: -13%)
############ Column(active) ############
name: active
path: active
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: -51%)
############ Column(status_cn) ############
name: status_cn
path: status_cn
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: -9%)
############ Column(status_en) ############
name: status_en
path: status_en
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: -13%)
- 查看内容
# parquet-tools show e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet
+-----------------------+------------------------+----------------------+--------------------------+----------------------------------------------------------------------+-------------+------------+------------+-------------+-------------+
| _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | _hoodie_partition_path | _hoodie_file_name | status_id | status | active | status_cn | status_en |
|-----------------------+------------------------+----------------------+--------------------------+----------------------------------------------------------------------+-------------+------------+------------+-------------+-------------|
| 20220808184420366 | 20220808184420366_1_0 | 11 | | e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet | 11 | associated | active | 已分配 | associated |
| 20220808184420366 | 20220808184420366_1_1 | 1 | | e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet | 1 | running | active | 运行中 | running |
| 20220808184420366 | 20220808184420366_1_2 | 3 | | e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet | 3 | terminated | not ceased | 已删除 | terminated |
| 20220808184420366 | 20220808184420366_1_3 | 10 | | e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet | 10 | poweroffed | not ceased | 已断电 | poweroffed |
+-----------------------+------------------------+----------------------+--------------------------+----------------------------------------------------------------------+-------------+------------+------------+-------------+-------------+