SparkSQL 解析 Snappy

  • 版本 Spark 2.1

一、配置

  • $HADOOP_HOME/lib/native 库的支持(配置好环境变量, 退出终端重启服务)
  1. # 环境配置
  2. export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
  3. export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
  4. export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native
  5. export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native:/usr/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib
  6. export SPARK_HOME=/usr/local/spark
  7. export SPARK_CONF_DIR=$SPARK_HOME/conf
  8. export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_HOME/lib/native
  9. export SPARK_CLASSPATH=$SPARK_CLASSPATH
  10. export PATH=$SPARK_HOME/bin:$PATH
  11. # 启动配置
  12. spark-sql --jars file://$HADOOP_HOME/lib/snappy-java-1.0.4.1.jar,file:///etc/hive/auxlib/json-serde-1.3.7-jar-with-dependencies.jar \
  13. --name spark-sql-server \
  14. --master yarn \
  15. --deploy-mode client \
  16. --driver-cores 2 \
  17. --driver-memory 4g \
  18. --executor-cores 2 \
  19. --executor-memory 4g \
  20. --num-executors 2 \
  21. --conf spark.eventLog.enabled=false \
  22. --conf spark.eventLog.dir=hdfs://dw1:8020/tmp/spark \
  23. --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  24. --conf spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec \
  25. --conf net.topology.script.file.name=/etc/hadoop/conf.cloudera.yarn/topology.py \
  26. --conf spark.sql.parquet.compression.codec=snappy
  27. # 参数解读
  28. -- 压缩编码 spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec;
  29. -- 解码器 spark.sql.parquet.compression.codec=snappy;
  30. # 测试
  31. SELECT common FROM ods.ods_browser_click WHERE p_dt='2017-05-07' AND p_hours='00' LIMIT 1;

二、问题

在 spark.master=yarn 模式下支持不好, 主要问题表现在子节点无法获取到 $HADOOP_HOME/lib/native 类库

spark-sql \ —name spark-sql-server \ —master yarn \ —deploy-mode client \ —driver-cores 2 \ —driver-memory 4g \ —executor-cores 2 \ —executor-memory 4g \ —num-executors 2 \ —conf spark.eventLog.enabled=false \ —conf spark.eventLog.dir=hdfs://dw1:8020/tmp/spark \ —conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ —conf spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec \ —conf net.topology.script.file.name=/etc/hadoop/conf.cloudera.yarn/topology.py \ —conf spark.sql.parquet.compression.codec=snappy