Spark中,某一列是list,希望对这些list每行汇总一个乘积,根据下表中的第三列,得到最右列
+---+---+---------+---+
| r| o| w| s2|
+---+---+---------+---+
| 1| 2| [3]| 3|
| 1| 2|[4, 5, 6]|120|
+---+---+---------+---+
下面提供两个方法
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
spark=SparkSession.builder.master('local').getOrCreate()
sc=spark.sparkContext
# input: col 3 contains list
# output: element-wise multiply col 3 to col 4
df1=spark.createDataFrame(((1,2,[3]),(1,2,[4,5,6])),['r','o','w'])
# https://stackoverflow.com/questions/595374/whats-the-function-like-sum-but-for-multiplication-product#answer-48648756
from functools import reduce
import operator
def prod(iterable):
return reduce(operator.mul, iterable, 1)
p1=udf(lambda s:prod(s),IntegerType())
df4=df1.withColumn('s',p1(df1.w))
# https://stackoverflow.com/questions/51283931/apply-function-on-list-in-pyspark-column#answer-51284282
import numpy as np
p2=udf(lambda x:np.prod(x).tolist() if x is not None else None,IntegerType())
df5=df1.withColumn('s2',p2(df1.w))
输出如下
Out[3]: [Row(r=1, o=2, w=[3], s2=3), Row(r=1, o=2, w=[4, 5, 6], s2=120)]
关键词:
element wise multiply
product of row wise list
apply map reduce agg