HDFS - HDF5 数据文件简介 - 《大数据》

HDF5 结构
HDF5 下载与安装(略)
Python 读写 HDF5 文件

转 https://zhuanlan.zhihu.com/p/104145585

HDF5 结构

HDF5 文件一般以 .h5 或者 .hdf5 作为后缀名，需要专门的软件才能打开预览文件的内容。HDF5 文件结构中有 2 primary objects: Groups 和 Datasets。

Groups 就类似于文件夹，每个 HDF5 文件其实就是根目录 (root) group'/'。
Datasets 类似于 NumPy 中的数组 array 。

每个 dataset 可以分成两部分: 原始数据 (raw) data values 和 元数据 metadata (a set of data that describes and gives information about other data => raw data)。

+-- Dataset
|   +-- (Raw) Data Values (eg: a 4 x 5 x 6 matrix)
|   +-- Metadata
|   |   +-- Dataspace (eg: Rank = 3, Dimensions = {4, 5, 6})
|   |   +-- Datatype (eg: Integer)
|   |   +-- Properties (eg: Chuncked, Compressed)
|   |   +-- Attributes (eg: attr1 = 32.4, attr2 = "hello", ...)
|

从上面的结构中可以看出：

Dataspace 给出原始数据的秩 (Rank) 和维度 (dimension)
Datatype 给出数据类型
Properties 说明该 dataset 的分块储存以及压缩情况
- Chunked: Better access time for subsets; extendible
- Chunked & Compressed: Improves storage efficiency, transmission speed
Attributes 为该 dataset 的其他自定义属性

整个 HDF5 文件的结构如下所示：

+-- /
|   +-- group_1
|   |   +-- dataset_1_1
|   |   |   +-- attribute_1_1_1
|   |   |   +-- attribute_1_1_2
|   |   |   +-- ...
|   |   |
|   |   +-- dataset_1_2
|   |   |   +-- attribute_1_2_1
|   |   |   +-- attribute_1_2_2
|   |   |   +-- ...
|   |   |
|   |   +-- ...
|   |
|   +-- group_2
|   |   +-- dataset_2_1
|   |   |   +-- attribute_2_1_1
|   |   |   +-- attribute_2_1_2
|   |   |   +-- ...
|   |   |
|   |   +-- dataset_2_2
|   |   |   +-- attribute_2_2_1
|   |   |   +-- attribute_2_2_2
|   |   |   +-- ...
|   |   |
|   |   +-- ...
|   |
|   +-- ...
|

HDF5 下载与安装(略)

注意: 当为 python 安装 HDF5 的 h5py 库时，使用 conda install h5py 或者 pip install h5py 后也会安装部分二进制文件 (如 h5dump, h5cc/h5c++, h5fc 等) 和库文件，但是可能不完整，导致 HDF5 的 C/C++ 编译器 h5cc/h5c++ 和 Fortran 编译器 h5fc 无法正常工作。
解决办法: 若 h5c++ 无法正常编译 C++ 文件，终端输入 which h5c++, 若显示该二进制文件在 python 的二进制 (binary) 文件夹 bin 内，则只需找到 brew 或者其他安装包管理工具下载的 h5c++ (一般在 /usr/local/bin 内) 或者官网下载解压后的 h5c++，在根目录 (~) 下的 .bashrc 文件 (或者其他 shell, 如 zsh 的配置文件 .zshrc) 内添加 alias h5c++ = /usr/local/bin/h5c++ 就可以了。

若是想用 clang++ 或者 g++ 而非 h5c++ 编译, 其中只要添加一些头文件 (-I) 和库文件 (-L) 的 flags 就行了。首先确认 h5c++ 可以正常编译后，在终端输入 h5c++ -show, 会显示 CXX_COMPILER + CXX_FLAGS, 例如: g++ -I/usr/local/opt/szip/include -L/usr/local/Cellar/hdf5/1.10.6/lib /usr/local/Cellar/hdf5/1.10.6/lib/libhdf5_hl_cpp.a /usr/local/Cellar/hdf5/1.10.6/lib/libhdf5_cpp.a /usr/local/Cellar/hdf5/1.10.6/lib/libhdf5_hl.a /usr/local/Cellar/hdf5/1.10.6/lib/libhdf5.a -L/usr/local/opt/szip/lib -lsz -lz -ldl -lm , 故我们可以使用 CXX_COMPILER + XXX.cpp + CXX_FLAGS 来编译 C++ 文件 (因为编译依赖关系，CXX_FLAGS 通常放在最后，XXX.cpp 放在 CXX_FLAGS 之前，否则可能会无法成功编译) 。

Python 读写 HDF5 文件

HDF5 的 python 库 h5py 调用起来比较简单，我在这给出一个简单的例子：
/h5py_example.py

#!/usr/bin/python
# -*- coding: UTF-8 -*-
#
# Created by WW on Jan. 26, 2020
# All rights reserved.
#
import h5py
import numpy as np
def main():
    #===========================================================================
    # Create a HDF5 file.
    f = h5py.File("h5py_example.hdf5", "w")    # mode = {'w', 'r', 'a'}
    # Create two groups under root '/'.
    g1 = f.create_group("bar1")
    g2 = f.create_group("bar2")
    # Create a dataset under root '/'.
    d = f.create_dataset("dset", data=np.arange(16).reshape([4, 4]))
    # Add two attributes to dataset 'dset'
    d.attrs["myAttr1"] = [100, 200]
    d.attrs["myAttr2"] = "Hello, world!"
    # Create a group and a dataset under group "bar1".
    c1 = g1.create_group("car1")
    d1 = g1.create_dataset("dset1", data=np.arange(10))
    # Create a group and a dataset under group "bar2".
    c2 = g2.create_group("car2")
    d2 = g2.create_dataset("dset2", data=np.arange(10))
    # Save and exit the file.
    f.close()
    ''' h5py_example.hdf5 file structure
    +-- '/'
    |   +--    group "bar1"
    |   |   +-- group "car1"
    |   |   |   +-- None
    |   |   |   
    |   |   +-- dataset "dset1"
    |   |
    |   +-- group "bar2"
    |   |   +-- group "car2"
    |   |   |   +-- None
    |   |   |
    |   |   +-- dataset "dset2"
    |   |   
    |   +-- dataset "dset"
    |   |   +-- attribute "myAttr1"
    |   |   +-- attribute "myAttr2"
    |   |   
    |   
    '''
    #===========================================================================
    # Read HDF5 file.
    f = h5py.File("h5py_example.hdf5", "r")    # mode = {'w', 'r', 'a'}
    # Print the keys of groups and datasets under '/'.
    print(f.filename, ":")
    print([key for key in f.keys()], "\n")  
    #===================================================
    # Read dataset 'dset' under '/'.
    d = f["dset"]
    # Print the data of 'dset'.
    print(d.name, ":")
    print(d[:])
    # Print the attributes of dataset 'dset'.
    for key in d.attrs.keys():
        print(key, ":", d.attrs[key])
    print()
    #===================================================
    # Read group 'bar1'.
    g = f["bar1"]
    # Print the keys of groups and datasets under group 'bar1'.
    print([key for key in g.keys()])
    # Three methods to print the data of 'dset1'.
    print(f["/bar1/dset1"][:])        # 1. absolute path
    print(f["bar1"]["dset1"][:])    # 2. relative path: file[][]
    print(g['dset1'][:])        # 3. relative path: group[]
    # Delete a database.
    # Notice: the mode should be 'a' when you read a file.
    '''
    del g["dset1"]
    '''
    # Save and exit the file
    f.close()
if __name__ == "__main__":
    main()