通过 Jar 包安装 parquet-tools

:::color4

parquet-mr 项目地址

https://github.com/apache/parquet-mr
parquet-tools 是 parquet-mr 的子项目，项目路径（1.10.1及以下的版本才支持）

https://github.com/apache/parquet-mr/tree/apache-parquet-1.10.1/parquet-tools

:::

安装 parquet-tools

:::info 官方文档

https://github.com/apache/parquet-mr/tree/apache-parquet-1.10.1/parquet-tools

:::

wget https://github.com/apache/parquet-mr/archive/refs/tags/apache-parquet-1.10.1.tar.gz
tar xvf apache-parquet-1.10.1.tar.gz
cd parquet-mr-apache-parquet-1.10.1/parquet-tools
# 在本地模式下使用需要加上 -Plocal，以便包含 hadoop 客户端依赖项，这样本地用 java 就可以查看
mvn clean package -Plocal
# 默认在 hadoop 模式下使用，将排除 hadoop 客户端依赖项
mvn clean package

命令用法

# 本地模式
java -jar parquet-tools-<VERSION>.jar <command> my_parquet_file.parquet
# hadoop 模式
hadoop jar parquet-tools-<VERSION>.jar <command> my_parquet_file.parquet

:::color4 下面将以 hadoop 模式进行说明，这里选择直接下载官方编译的 hadoop 模式的 Jar 包而不是自己编译

使用 hadoop 模式，文件需要保存在 hdfs

:::

下载 parquet-tools

wget https://repo1.maven.org/maven2/org/apache/parquet/parquet-tools/1.10.1/parquet-tools-1.10.1.jar

查看帮助

# 打印文件的全部内容信息
$ hadoop jar parquet-tools-1.10.1.jar cat -h
parquet-cat:
Prints the content of a Parquet file. The output contains only the data, no
metadata is displayed
usage: parquet-cat [option...] <input>
where option is one of:
       --debug     Enable debug output
    -h,--help      Show this help string
    -j,--json      Show records in JSON format.
       --no-color  Disable color output even if supported
where <input> is the parquet file to print to stdout

# 打印文件的头几行内容信息
$ hadoop jar parquet-tools-1.10.1.jar head -h
parquet-head:
Prints the first n record of the Parquet file
usage: parquet-head [option...] <input>
where option is one of:
       --debug          Enable debug output
    -h,--help           Show this help string
    -n,--records <arg>  The number of records to show (default: 5)
       --no-color       Disable color output even if supported
where <input> is the parquet file to print to stdout

# 打印 Parquet 文件的 schema
$ hadoop jar parquet-tools-1.10.1.jar schema -h
parquet-schema:
Prints the schema of Parquet file(s)
usage: parquet-schema [option...] <input>
where option is one of:
    -d,--detailed  Show detailed information about the schema.
       --debug     Enable debug output
    -h,--help      Show this help string
       --no-color  Disable color output even if supported
where <input> is the parquet file containing the schema to show

# 打印 Parquet 文件的元数据，包含 key-value 属性 (类似 Avro schema), 
# compression ratios, encodings, compression used, 和 row group 信息
$ hadoop jar parquet-tools-1.10.1.jar meta -h
parquet-meta:
Prints the metadata of Parquet file(s)
usage: parquet-meta [option...] <input>
where option is one of:
       --debug     Enable debug output
    -h,--help      Show this help string
       --no-color  Disable color output even if supported
where <input> is the parquet file to print to stdout

# 打印 Parquet 文件的内容和元数据
$ hadoop jar parquet-tools-1.10.1.jar dump -h
parquet-dump:
Prints the content and metadata of a Parquet file
usage: parquet-dump [option...] <input>
where option is one of:
    -c,--column <arg>  Dump only the given column, can be specified more than
                       once
    -d,--disable-data  Do not dump column data
       --debug         Enable debug output
    -h,--help          Show this help string
    -m,--disable-meta  Do not dump row group and page metadata
    -n,--disable-crop  Do not crop the output based on console width
       --no-color      Disable color output even if supported
where <input> is the parquet file to print to stdout

# 将多个 Parquet 文件合并为一个
$ hadoop jar parquet-tools-1.10.1.jar merge -h
parquet-merge:
Merges multiple Parquet files into one. The command doesn't merge row groups,
just places one after the other. When used to merge many small files, the
resulting file will still contain small row groups, which usually leads to bad
query performance.
usage: parquet-merge [option...] <input> [<input> ...] <output>
where option is one of:
       --debug     Enable debug output
    -h,--help      Show this help string
       --no-color  Disable color output even if supported
where <input> is the source parquet files/directory to be merged
   <output> is the destination parquet file

# 打印 Parquet 文件中的行数
$ hadoop jar parquet-tools-1.10.1.jar rowcount -h
parquet-rowcount:
Prints the count of rows in Parquet file(s)
usage: parquet-rowcount [option...] <input>
where option is one of:
    -d,--detailed  Detailed rowcount of each matching file
       --debug     Enable debug output
    -h,--help      Show this help string
       --no-color  Disable color output even if supported
where <input> is the parquet file to count rows to stdout

# 打印 Parquet 文件的大小
$ hadoop jar parquet-tools-1.10.1.jar size -h
parquet-size:
Prints the size of Parquet file(s)
usage: parquet-size [option...] <input>
where option is one of:
    -d,--detailed      Detailed size of each matching file
       --debug         Enable debug output
    -h,--help          Show this help string
       --no-color      Disable color output even if supported
    -p,--pretty        Pretty size
    -u,--uncompressed  Uncompressed size
where <input> is the parquet file to get size & human readable size to stdout

查看 schema

$ hadoop jar parquet-tools-1.10.1.jar schema /user/root/myparquet.parquet
2022-08-10 16:31:53,138 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
message record {
  optional binary _hoodie_commit_time (UTF8);
  optional binary _hoodie_commit_seqno (UTF8);
  optional binary _hoodie_record_key (UTF8);
  optional binary _hoodie_partition_path (UTF8);
  optional binary _hoodie_file_name (UTF8);
  required int32 status_id;
  optional binary status (UTF8);
  optional binary active (UTF8);
  optional binary status_cn (UTF8);
  optional binary status_en (UTF8);
}

查看内容

# hadoop jar parquet-tools-1.10.1.jar cat /user/root/myparquet.parquet
2022-08-10 16:32:35,883 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2022-08-10 16:32:36,369 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 4 records.
2022-08-10 16:32:36,369 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block
2022-08-10 16:32:36,412 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
2022-08-10 16:32:36,413 INFO compress.CodecPool: Got brand-new decompressor [.gz]
2022-08-10 16:32:36,427 INFO hadoop.InternalParquetRecordReader: block read in memory in 57 ms. row count = 4
_hoodie_commit_time = 20220804160725319
_hoodie_commit_seqno = 20220804160725319_0_0
_hoodie_record_key = 12
_hoodie_partition_path =
_hoodie_file_name = 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet
status_id = 12
status = released
active = not ceased
status_cn = 已释放
status_en = released
_hoodie_commit_time = 20220804160725319
_hoodie_commit_seqno = 20220804160725319_0_1
_hoodie_record_key = 2
_hoodie_partition_path =
_hoodie_file_name = 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet
status_id = 2
status = stopped
active = not ceased
status_cn = 已关机
status_en = stopped
_hoodie_commit_time = 20220804160725319
_hoodie_commit_seqno = 20220804160725319_0_2
_hoodie_record_key = 4
_hoodie_partition_path =
_hoodie_file_name = 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet
status_id = 4
status = suspended
active = not ceased
status_cn = 已挂起
status_en = suspended
_hoodie_commit_time = 20220804160725319
_hoodie_commit_seqno = 20220804160725319_0_3
_hoodie_record_key = 6
_hoodie_partition_path =
_hoodie_file_name = 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet
status_id = 6
status = in-use
active = active
status_cn = 使用中
status_en = in-use

查看元数据

$ hadoop jar parquet-tools-1.10.1.jar meta /user/root/myparquet.parquet
extra:                  hoodie_min_record_key = 12
extra:                  parquet.avro.schema = {"type":"record","name":"record","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"status_id","type":"int"},{"name":"status","type":["null","string"],"default":null},{"name":"active","type":["null","string"],"default":null},{"name":"status_cn","type":["null","string"],"default":null},{"name":"status_en","type":["null","string"],"default":null}]}
extra:                  writer.model.name = avro
extra:                  hoodie_max_record_key = 6
file schema:            record
--------------------------------------------------------------------------------
_hoodie_commit_time:    OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_commit_seqno:   OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_record_key:     OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_partition_path: OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_file_name:      OPTIONAL BINARY O:UTF8 R:0 D:1
status_id:              REQUIRED INT32 R:0 D:0
status:                 OPTIONAL BINARY O:UTF8 R:0 D:1
active:                 OPTIONAL BINARY O:UTF8 R:0 D:1
status_cn:              OPTIONAL BINARY O:UTF8 R:0 D:1
status_en:              OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1:            RC:4 TS:775 OFFSET:4
--------------------------------------------------------------------------------
_hoodie_commit_time:     BINARY GZIP DO:4 FPO:63 SZ:110/70/0.64 VC:4 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 20220804160725319, max: 20220804160725319, num_nulls: 0]
_hoodie_commit_seqno:    BINARY GZIP DO:0 FPO:114 SZ:83/130/1.57 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 20220804160725319_0_0, max: 20220804160725319_0_3, num_nulls: 0]
_hoodie_record_key:      BINARY GZIP DO:0 FPO:197 SZ:61/50/0.82 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 12, max: 6, num_nulls: 0]
_hoodie_partition_path:  BINARY GZIP DO:258 FPO:301 SZ:94/54/0.57 VC:4 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: , num_nulls: 0]
_hoodie_file_name:       BINARY GZIP DO:352 FPO:456 SZ:155/124/0.80 VC:4 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet, max: 9f35356b-63ef-4237-8228-d5a82042c437_0-4-0_20220804160725319.parquet, num_nulls: 0]
status_id:               INT32 GZIP DO:0 FPO:507 SZ:54/39/0.72 VC:4 ENC:BIT_PACKED,PLAIN ST:[min: 2, max: 12, num_nulls: 0]
status:                  BINARY GZIP DO:0 FPO:561 SZ:90/76/0.84 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: in-use, max: suspended, num_nulls: 0]
active:                  BINARY GZIP DO:651 FPO:711 SZ:112/74/0.66 VC:4 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: active, max: not ceased, num_nulls: 0]
status_cn:               BINARY GZIP DO:0 FPO:763 SZ:91/82/0.90 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 使用中, max: 已释放, num_nulls: 0]
status_en:               BINARY GZIP DO:0 FPO:854 SZ:90/76/0.84 VC:4 ENC:BIT_PACKED,PLAIN,RLE ST:[min: in-use, max: suspended, num_nulls: 0]

通过 pip 安装 parquet-tools

通过 pip 安装 parquet-tools（Ubuntu 20）

python3 -m pip install parquet-tools

查看帮助

# parquet-tools  --help
usage: parquet-tools [-h] {show,csv,inspect} ...
parquet CLI tools
positional arguments:
  {show,csv,inspect}
    show              Show human readable format. see `show -h`
    csv               Cat csv style. see `csv -h`
    inspect           Inspect parquet file. see `inspect -h`
optional arguments:
  -h, --help          show this help message and exit

查看 schema

# parquet-tools inspect  e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet
############ file meta data ############
created_by: parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
num_columns: 10
num_rows: 4
num_row_groups: 1
format_version: 1.0
serialized_size: 433901
############ Columns ############
_hoodie_commit_time
_hoodie_commit_seqno
_hoodie_record_key
_hoodie_partition_path
_hoodie_file_name
status_id
status
active
status_cn
status_en
############ Column(_hoodie_commit_time) ############
name: _hoodie_commit_time
path: _hoodie_commit_time
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: -57%)
############ Column(_hoodie_commit_seqno) ############
name: _hoodie_commit_seqno
path: _hoodie_commit_seqno
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 36%)
############ Column(_hoodie_record_key) ############
name: _hoodie_record_key
path: _hoodie_record_key
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: -20%)
############ Column(_hoodie_partition_path) ############
name: _hoodie_partition_path
path: _hoodie_partition_path
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: -74%)
############ Column(_hoodie_file_name) ############
name: _hoodie_file_name
path: _hoodie_file_name
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: -25%)
############ Column(status_id) ############
name: status_id
path: status_id
max_definition_level: 0
max_repetition_level: 0
physical_type: INT32
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: -38%)
############ Column(status) ############
name: status
path: status
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: -13%)
############ Column(active) ############
name: active
path: active
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: -51%)
############ Column(status_cn) ############
name: status_cn
path: status_cn
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: -9%)
############ Column(status_en) ############
name: status_en
path: status_en
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: -13%)

查看内容

# parquet-tools  show  e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet
+-----------------------+------------------------+----------------------+--------------------------+----------------------------------------------------------------------+-------------+------------+------------+-------------+-------------+
|   _hoodie_commit_time |   _hoodie_commit_seqno |   _hoodie_record_key | _hoodie_partition_path   | _hoodie_file_name                                                    |   status_id | status     | active     | status_cn   | status_en   |
|-----------------------+------------------------+----------------------+--------------------------+----------------------------------------------------------------------+-------------+------------+------------+-------------+-------------|
|     20220808184420366 |  20220808184420366_1_0 |                   11 |                          | e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet |          11 | associated | active     | 已分配       | associated  |
|     20220808184420366 |  20220808184420366_1_1 |                    1 |                          | e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet |           1 | running    | active     | 运行中       | running     |
|     20220808184420366 |  20220808184420366_1_2 |                    3 |                          | e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet |           3 | terminated | not ceased | 已删除       | terminated  |
|     20220808184420366 |  20220808184420366_1_3 |                   10 |                          | e24e1f19-e478-424e-80e6-2cbe1fc198f1_1-4-0_20220808184420366.parquet |          10 | poweroffed | not ceased | 已断电       | poweroffed  |
+-----------------------+------------------------+----------------------+--------------------------+----------------------------------------------------------------------+-------------+------------+------------+-------------+-------------+

Flink

查看 parquet 文件格式内容

通过 Jar 包安装 parquet-tools

通过 pip 安装 parquet-tools