Java 类名:com.alibaba.alink.operator.batch.statistics.SummarizerBatchOp
Python 类名:SummarizerBatchOp

功能介绍

全表统计用来计算整表的统计量, 包含count(个数),numValidValue(有效值个数), numMissingValue(缺失值个数), sum(求和), mean(均值), standardDeviation(标准差), variance(方差), min(最小值), max(最大值), normL1(L1范数), normL2(L2范数)。
结果可以使用collectSummary获取TableSummary, 通过TableSummary获取对应的结果, 也可以直接打印。
另外, 对所有的BatchOp, 可以直接获取Op输出表的统计量。具体使用方式如下,

使用方式

  • 打印统计结果.
  1. summary = summarizer.linkFrom(source).collectSummary()
  2. print(summary)
  • 获取相应的统计值
  1. summary = summarizer.linkFrom(source).collectSummary()
  2. print(summary.sum('f_double'))
  3. print(summary.mean('f_double'))
  4. print(summary.variance('f_double'))
  5. print(summary.standardDeviation('f_double'))
  6. print(summary.min('f_double'))
  7. print(summary.max('f_double'))
  8. print(summary.normL1('f_double'))
  9. print(summary.normL2('f_double'))
  10. print(summary.numValidValue('f_double'))
  11. print(summary.numMissingValue('f_double'))
  • 对Op的输出表做统计
  1. source.lazyPrintStatistics()
  2. BatchOperator.execute()
  • 获取Op输出表的TableSummary
  1. summary = source..collectStatistics()

参数说明

| 名称 | 中文名称 | 描述 | 类型 | 是否必须? | 取值范围 | 默认值 | | —- | —- | —- | —- | —- | —- | —- |

| selectedCols | 选中的列名数组 | 计算列对应的列名列表 | String[] | | | null |

代码示例

Python 代码

  1. from pyalink.alink import *
  2. import pandas as pd
  3. useLocalEnv(1)
  4. df = pd.DataFrame([
  5. ["a", 1, 1,2.0, True],
  6. ["c", 1, 2, -3.0, True],
  7. ["a", 2, 2,2.0, False],
  8. ["c", 0, 0, 0.0, False]
  9. ])
  10. source = BatchOperator.fromDataframe(df, schemaStr='f_string string, f_long long, f_int int, f_double double, f_boolean boolean')
  11. summarizer = SummarizerBatchOp()\
  12. .setSelectedCols(["f_long", "f_int", "f_double"])
  13. summary = summarizer.linkFrom(source).collectSummary()
  14. print(summary)

Java 代码

  1. package com.alibaba.alink.operator.batch.statistics;
  2. import org.apache.flink.api.common.typeinfo.TypeInformation;
  3. import org.apache.flink.table.api.TableSchema;
  4. import org.apache.flink.table.api.Types;
  5. import org.apache.flink.types.Row;
  6. import com.alibaba.alink.operator.batch.BatchOperator;
  7. import com.alibaba.alink.operator.batch.source.MemSourceBatchOp;
  8. import com.alibaba.alink.operator.common.statistics.basicstatistic.TableSummary;
  9. import com.alibaba.alink.testutil.AlinkTestBase;
  10. import org.junit.Assert;
  11. import org.junit.Test;
  12. import java.util.ArrayList;
  13. import java.util.Arrays;
  14. import java.util.List;
  15. public class SummarizerBatchOpTest extends AlinkTestBase {
  16. @Test
  17. public void test() {
  18. Row[] testArray =
  19. new Row[] {
  20. Row.of("a", 1L, 1, 2.0, true),
  21. Row.of(null, 2L, 2, -3.0, true),
  22. Row.of("c", null, null, 2.0, false),
  23. Row.of("a", 0L, 0, null, null),
  24. };
  25. String[] colNames = new String[] {"f_string", "f_long", "f_int", "f_double", "f_boolean"};
  26. MemSourceBatchOp source = new MemSourceBatchOp(Arrays.asList(testArray), colNames);
  27. SummarizerBatchOp summarizer = new SummarizerBatchOp()
  28. .setSelectedCols("f_double", "f_int");
  29. summarizer.linkFrom(source);
  30. TableSummary srt = summarizer.collectSummary();
  31. System.out.println(srt.toString());
  32. Assert.assertEquals(srt.getColNames().length, 2);
  33. Assert.assertEquals(srt.count(), 4);
  34. Assert.assertEquals(srt.numMissingValue("f_double"), 1, 10e-4);
  35. Assert.assertEquals(srt.numValidValue("f_double"), 3, 10e-4);
  36. Assert.assertEquals(srt.max("f_double"), 2.0, 10e-4);
  37. Assert.assertEquals(srt.min("f_int"), 0.0, 10e-4);
  38. Assert.assertEquals(srt.mean("f_double"), 0.3333333333333333, 10e-4);
  39. Assert.assertEquals(srt.variance("f_double"), 8.333333333333334, 10e-4);
  40. Assert.assertEquals(srt.standardDeviation("f_double"), 2.886751345948129, 10e-4);
  41. Assert.assertEquals(srt.normL1("f_double"), 7.0, 10e-4);
  42. Assert.assertEquals(srt.normL2("f_double"), 4.123105625617661, 10e-4);
  43. }
  44. }

运行结果

Summary:

| colName | count | missing | sum | mean | variance | min | max | | —- | —- | —- | —- | —- | —- | —- | —- |

| f_long | 4 | 0 | 4 | 1 | 0.6667 | 0 | 2 |

| f_int | 4 | 0 | 5 | 1.25 | 0.9167 | 0 | 2 |

| f_double | 4 | 0 | 1 | 0.25 | 5.5833 | -3 | 2 |