Spark中,某一列是list,希望对这些list每行汇总一个乘积,根据下表中的第三列,得到最右列
+---+---+---------+---+| r| o| w| s2|+---+---+---------+---+| 1| 2| [3]| 3|| 1| 2|[4, 5, 6]|120|+---+---+---------+---+
下面提供两个方法
from pyspark.sql import SparkSession, Rowfrom pyspark.sql.functions import udffrom pyspark.sql.types import IntegerTypespark=SparkSession.builder.master('local').getOrCreate()sc=spark.sparkContext# input: col 3 contains list# output: element-wise multiply col 3 to col 4df1=spark.createDataFrame(((1,2,[3]),(1,2,[4,5,6])),['r','o','w'])# https://stackoverflow.com/questions/595374/whats-the-function-like-sum-but-for-multiplication-product#answer-48648756from functools import reduceimport operatordef prod(iterable):return reduce(operator.mul, iterable, 1)p1=udf(lambda s:prod(s),IntegerType())df4=df1.withColumn('s',p1(df1.w))# https://stackoverflow.com/questions/51283931/apply-function-on-list-in-pyspark-column#answer-51284282import numpy as npp2=udf(lambda x:np.prod(x).tolist() if x is not None else None,IntegerType())df5=df1.withColumn('s2',p2(df1.w))
输出如下
Out[3]: [Row(r=1, o=2, w=[3], s2=3), Row(r=1, o=2, w=[4, 5, 6], s2=120)]
关键词:
element wise multiply
product of row wise list
apply map reduce agg
