Spark中,某一列是list,希望对这些list每行汇总一个乘积,根据下表中的第三列,得到最右列

    1. +---+---+---------+---+
    2. | r| o| w| s2|
    3. +---+---+---------+---+
    4. | 1| 2| [3]| 3|
    5. | 1| 2|[4, 5, 6]|120|
    6. +---+---+---------+---+

    下面提供两个方法

    1. from pyspark.sql import SparkSession, Row
    2. from pyspark.sql.functions import udf
    3. from pyspark.sql.types import IntegerType
    4. spark=SparkSession.builder.master('local').getOrCreate()
    5. sc=spark.sparkContext
    6. # input: col 3 contains list
    7. # output: element-wise multiply col 3 to col 4
    8. df1=spark.createDataFrame(((1,2,[3]),(1,2,[4,5,6])),['r','o','w'])
    9. # https://stackoverflow.com/questions/595374/whats-the-function-like-sum-but-for-multiplication-product#answer-48648756
    10. from functools import reduce
    11. import operator
    12. def prod(iterable):
    13. return reduce(operator.mul, iterable, 1)
    14. p1=udf(lambda s:prod(s),IntegerType())
    15. df4=df1.withColumn('s',p1(df1.w))
    16. # https://stackoverflow.com/questions/51283931/apply-function-on-list-in-pyspark-column#answer-51284282
    17. import numpy as np
    18. p2=udf(lambda x:np.prod(x).tolist() if x is not None else None,IntegerType())
    19. df5=df1.withColumn('s2',p2(df1.w))

    输出如下

    1. Out[3]: [Row(r=1, o=2, w=[3], s2=3), Row(r=1, o=2, w=[4, 5, 6], s2=120)]

    关键词:
    element wise multiply
    product of row wise list
    apply map reduce agg