一、Snappy

1. snzip 基于 Snappy 的压缩/解压工具

  • 使用的版本 1.0.4
  • 支持框架的格式化
    • framing-format
    • old framing-format
    • hadoop-snappy format (Hadoop Snappy 文件格式的压缩)
    • raw format
    • snappy-java
    • snappy-in-java
  • snzip 项目地址

下载、安装使用方法, 具体见 github 文档

  1. 1. 安装
  2. tar xvfz snzip-1.0.4.tar.gz
  3. cd snzip-1.0.4
  4. ./configure --prefix=/usr/local/snappy
  5. make
  6. make install
  7. 2. 加载到系统环境
  8. vim ~/.bashrc
  9. # snzip
  10. export SNZIP_HOME=/usr/local/snappy
  11. export PATH=${SNZIP_HOME}/bin:$PATH
  12. source ~/.bashrc
  13. 3. snzip -help
  14. general options:
  15. -c 输出到标准输出,保持原始文件不变
  16. -d 解压缩
  17. -k 不删除原文件
  18. -t name 压缩框架文件格式
  19. -h give this help
  20. raw_format option:
  21. -s size size of input data when compressing.
  22. The default value is the file size i f available
  23. tuning options(调优参数):
  24. -b num internal block size in bytes
  25. -B num internal block size. \'num\'-th power of two.
  26. -R num size of read buffer in bytes
  27. -W num size of write buffer in bytes
  28. -T trace for debug
  29. supported formats(压缩框架格式选择):
  30. NAME SUFFIX URL
  31. ---- ------ ---
  32. framing2 sz https://github.com/google/snappy/blob/master/framing_format.txt
  33. hadoop-snappy snappy https://code.google.com/p/hadoop-snappy/
  34. iwa iwa https://github.com/obriensp/iWorkFileFormat/blob/master/Docs/index.md#snappy-compression
  35. framing sz https://github.com/google/snappy/blob/0755c815197dacc77d8971ae917c86d7aa96bf8e/framing_format.txt
  36. snzip snz https://github.com/kubo/snzip
  37. snappy-java snappy https://github.com/xerial/snappy-java
  38. snappy-in-java snappy https://github.com/dain/snappy
  39. comment-43 snappy http://code.google.com/p/snappy/issues/detail?id=34#c43
  40. 4. 压缩 hadoop 框架支持的格式化
  41. snzip -t -k hadoop-snappy -k file_name 压缩
  42. snzip -d compressed_file.snappy 解压

2. python 压缩/解压接口(不兼容 HDFS 原生的 Snappy)

  1. 1. 安装
  2. 依赖包
  3. ubuntu:
  4. sudo apt-get install libsnappy-dev
  5. Centos:
  6. sudo yum install libsnappy-devel
  7. Brew:
  8. brew install snappy
  9. 安装
  10. pip install python-snappy
  11. python -m snappy --help
  12. 2. 压缩/解压文件
  13. python -m snappy -c uncompressed_file compressed_file.snappy
  14. python -m snappy -d compressed_file.snappy uncompressed_file
  15. 3. 压缩/解压 Stream
  16. cat uncompressed_data | python -m snappy -c > compressed_data.snappy
  17. cat compressed_data.snappy | python -m snappy -d > uncompressed_data

3. java 压缩/解压接口

  1. 1. 注意事项, 如果在 Mac 环境中使用请把 jar 包解压
  2. 复制 libsnappyjava.jnilib -> libsnappyjava.dylib
  3. cp org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib org/xerial/snappy/native/Mac/x86_64/libsnappyjava.dylib
  4. 重启打包
  5. jar cf snappy-java-1.0.4.1.jar org
  6. 2. pom.xml 配置加载本地包
  7. <dependency>
  8. <groupId>org.xerial.snappy</groupId>
  9. <artifactId>snappy-java</artifactId>
  10. <version>1.0.4.1</version>
  11. <scope>system</scope>
  12. <systemPath>${basedir}/lib/snappy-java-1.0.4.1.jar</systemPath>
  13. </dependency>
  14. </dependencies>

二、SPARK 配置

  1. # Spark 配置 Snappy
  2. export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native
  3. export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native
  4. export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_HOME/lib/native
  5. export SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_HOME/lib/snappy-java-1.0.4.1.jar
  6. spark-sql --jars file:///etc/hive/auxlib/json-serde-1.3.7-jar-with-dependencies.jar,file:///usr/lib/hadoop/lib/snappy-java-1.0.4.1.jar