3.23离线数据分析理解

3.23离线数据分析理解
- 题目要求

先使用sparkcore进行数据清洗，上传到hdfs，使用hive数仓建立一个外部表，把上传到hdfs的文件导入进去，再新建一个hive表把数仓的内容把外部表的内容用特定sql语句进行统计写入新建的hive数仓里面。

题目要求

使用spark进行数据分析
1.将数据集中的标题行删除；
2.在数据集中添加一个表示省份的列province，同时为每一行在该列上生成一个随机省份值，如"北京","上海","广州","深圳"；
3.将数据集中的time字段中的小时数字去掉，只保留年-月-日；
4.输出路径为hdfs上：hdfs://master:9000/员工姓名
Hive数据分析
1.创建hive外部表user_action_external_hive，location指向第一步处理后数据的存储目录hdfs://master:9000/员工姓名；
2.使用Hive分析统计不同地区的用户在网站上各种行为的次数，即浏览总次数、加入购物车总次数、收藏总次数、购买总次数，并将结果写入一个新建的hive表，表名为user_action_stat。

第一步

    按题目要求使用sparkcore编写程序，要注意分隔符的写法，**\t是tab键不是/t**

package work_Test
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import scala.util.Random
object sale_user_text {
  def main(args: Array[String]): Unit = {
    //初始化spark
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("sale_user_text")
    val cs = new SparkContext(sparkConf)
    //定义数组
    val city = Array("北京","上海","重庆","天津")
    //文件位置
    val data = cs.textFile("D:\\新建文件夹\\input\\sale_user.csv",1)
    //使用filterRDD去除第一行的字段
    val filterRDD: RDD[String] = data.filter { line => !line.startsWith("user_id") }
    //使用mapRDD把每一行的数据转化为数组，重新拼接把字段time用,split(" ")以空格断开，添加的城市用
    //定义数组的方式使用Random函数的Random.nextInt(4)来随机生成下标
    val mapRDD = filterRDD.map {line =>
      val arr = line.split(",")
      arr(0) + "\t"+arr(1)+"\t"+arr(2)+"\t"+arr(3)+"\t"+arr(4).split(" ")(0)+"\t"+city(Random.nextInt(4))
    }
    //文件保存位置
    mapRDD.saveAsTextFile("D:\\新建文件夹\\output\\output3")
    //停止运行
    cs.stop()
  }
}

第二步

把清洗文件放到hdfs文件系统里面,一般情况下再第一步的spark程序里面就已经完成了

hadoop fs -put po000000 /input

先在txt文件里面写代码，后复制到hive客户端<br />
external表示外部表，row format delimited fields terminated by '\t'表示数仓里面的数据以\t分割<br />
location'/input';表示hdfs文件的位置<br />
注意写入后要是出现数据为null那就赶快去检查第一步程序里面的分割符号是不是一样的

//新建外部表
create external table user_action(
user_id int,
googs_id int,
act_id int,
cata_id int,
time string,
prov string)
row format delimited fields terminated by '\t' location'/input';

创建一个新的的表用来存放分析后的数据

create  table user_action_sta(
prov string,
user_id int,
viewCount int,
addCount int,
acllCount int,
buyCount int)
row format delimited fields terminated by '\t' ;

使用使用sql命令将user_action表里面的prov,user_id,以及点击的类型次数填充到新表里面
使用if判断act_id的值以sum(if(act_id=1,1,0)),如果act_id的值为1那么就赋值1,如果不是就赋值为0最后用sum相加统计viewCount的总数
group by prov,user_id;表示相同的进行相加,表示同一个用户的行为数据

intert overwrite table user_action_sta 
select prov,
       user_id,
       sum(if(act_id=1,1,0)),
       sum(if(act_id=2,1,0)),
       sum(if(act_id=3,1,0)),
       sum(if(act_id=4,1,0));
from user_action group by prov,user_id;

遇到的问题

Caused by: java.lang.ArrayIndexOutOfBoundsException: 1

这个问题很愚蠢