What is ETL & Data Sciencec
    Data Pipeline
    Spark for ETL and Data Science - 图1

    How to do ETL in Spark
    What is ETL

    • Extract
      • Read raw data from single/multiple sources(no schema , uncompressed , dirty)
    • Transfrom
      • Transfrom raw data(Filter/Aggregation/Normalization/ Normalization/Join)
    • Load
      • Write raw into sinks(compressed , structured , cleaned ,well-organized)

    Spark for ETL and Data Science - 图2
    What is ETL

    • Architecture
    • Performance
    • Ecosystem
    • API

    Spark for ETL and Data Science - 图3
    ETL Example in Spark
    spark.read.csv(“source_path”)
    .filter(…)
    .agg(…)
    .write.mode(“append”)
    .orc(“sink_path”)

    Spark for ETL and Data Science - 图4
    Handing bad record

    Text format(csv,json)supports 3 parsing mode

    • PERMISSIVE
      • set other fields to ‘null’ when it meets a corrupted record and puts the malformed string into a new field configured by sqprk.sql,columnNameOfCorruptRecord.
    • DORPMALFORMED
      • ignores the whole corrupted records.
    • FALFAST

      • throws an exception when it meets corrupted records.


      Handing record corrupted
      Spark for ETL and Data Science - 图5

      Keep in mind

    • You have no control on source data(format/scale/schema)

    • You have no control on hardware/network(fault tolerance)

    How to do Data Science inSpark
    Data Science via Spark
    Spark for ETL and Data Science - 图6

    Spark SQL Operation

    • Add or update columns
    • Dorp column
    • Where | Filter
    • Group by
    • Aggregation
    • Join
    • Union
    • UDF

    Visualization

    • Zeppelin Notebook(SQL)
    • PySpark(Python)
    • Sparkr(R)

    Zeppelin Notebook
    Spark for ETL and Data Science - 图7
    PySpark

    • Matplotlib
    • Pandas
    • Bokeh
    • Seaborn
    • Ploynine
    • Holoviews

    Spark for ETL and Data Science - 图8
    R

    • R builtin
    • ggplot2
    • googlevis


      Spark for ETL and Data Science - 图9
      Three types of Machine Learning

    • Supervised Learning

      • Labeled data is available
      • Classification / Regression
    • Unsupervised Learning
      • No labeled data is available
    • Reinforcement Learning
      • Model is continuously learned and relearn based on the action and effects/rewards based on the actions

    Mechine Learning Basics
    Spark for ETL and Data Science - 图10
    Spark ML Pipeline
    Spark for ETL and Data Science - 图11
    Demo via Spark on Zeppelin

    Apache Zeppelin
    Spark for ETL and Data Science - 图12
    Dome代码截图
    Spark for ETL and Data Science - 图13

    Spark for ETL and Data Science - 图14

    Spark for ETL and Data Science - 图15

    Spark for ETL and Data Science - 图16

    Spark for ETL and Data Science - 图17

    Spark for ETL and Data Science - 图18

    Spark for ETL and Data Science - 图19

    Spark for ETL and Data Science - 图20

    Spark for ETL and Data Science - 图21

    Spark for ETL and Data Science - 图22

    Spark for ETL and Data Science - 图23

    Spark for ETL and Data Science - 图24

    Spark for ETL and Data Science - 图25

    Spark for ETL and Data Science - 图26

    Spark for ETL and Data Science - 图27

    Spark for ETL and Data Science - 图28

    Spark for ETL and Data Science - 图29

    Spark for ETL and Data Science - 图30

    Spark for ETL and Data Science - 图31
    问题答疑

    今天问答部分因为个人的聊天缓存被清掉了没法做记录,有相应记录的兄弟可以发给我我补充上