Java 类名:com.alibaba.alink.operator.batch.statistics.SummarizerBatchOp
Python 类名:SummarizerBatchOp
功能介绍
全表统计用来计算整表的统计量, 包含count(个数),numValidValue(有效值个数), numMissingValue(缺失值个数), sum(求和), mean(均值), standardDeviation(标准差), variance(方差), min(最小值), max(最大值), normL1(L1范数), normL2(L2范数)。
结果可以使用collectSummary获取TableSummary, 通过TableSummary获取对应的结果, 也可以直接打印。
另外, 对所有的BatchOp, 可以直接获取Op输出表的统计量。具体使用方式如下,
使用方式
- 打印统计结果.
summary = summarizer.linkFrom(source).collectSummary()
print(summary)
- 获取相应的统计值
summary = summarizer.linkFrom(source).collectSummary()
print(summary.sum('f_double'))
print(summary.mean('f_double'))
print(summary.variance('f_double'))
print(summary.standardDeviation('f_double'))
print(summary.min('f_double'))
print(summary.max('f_double'))
print(summary.normL1('f_double'))
print(summary.normL2('f_double'))
print(summary.numValidValue('f_double'))
print(summary.numMissingValue('f_double'))
- 对Op的输出表做统计
source.lazyPrintStatistics()
BatchOperator.execute()
- 获取Op输出表的TableSummary
summary = source..collectStatistics()
参数说明
| 名称 | 中文名称 | 描述 | 类型 | 是否必须? | 取值范围 | 默认值 | | —- | —- | —- | —- | —- | —- | —- |
| selectedCols | 选中的列名数组 | 计算列对应的列名列表 | String[] | | | null |
代码示例
Python 代码
from pyalink.alink import *
import pandas as pd
useLocalEnv(1)
df = pd.DataFrame([
["a", 1, 1,2.0, True],
["c", 1, 2, -3.0, True],
["a", 2, 2,2.0, False],
["c", 0, 0, 0.0, False]
])
source = BatchOperator.fromDataframe(df, schemaStr='f_string string, f_long long, f_int int, f_double double, f_boolean boolean')
summarizer = SummarizerBatchOp()\
.setSelectedCols(["f_long", "f_int", "f_double"])
summary = summarizer.linkFrom(source).collectSummary()
print(summary)
Java 代码
package com.alibaba.alink.operator.batch.statistics;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.table.api.TableSchema;
import org.apache.flink.table.api.Types;
import org.apache.flink.types.Row;
import com.alibaba.alink.operator.batch.BatchOperator;
import com.alibaba.alink.operator.batch.source.MemSourceBatchOp;
import com.alibaba.alink.operator.common.statistics.basicstatistic.TableSummary;
import com.alibaba.alink.testutil.AlinkTestBase;
import org.junit.Assert;
import org.junit.Test;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class SummarizerBatchOpTest extends AlinkTestBase {
@Test
public void test() {
Row[] testArray =
new Row[] {
Row.of("a", 1L, 1, 2.0, true),
Row.of(null, 2L, 2, -3.0, true),
Row.of("c", null, null, 2.0, false),
Row.of("a", 0L, 0, null, null),
};
String[] colNames = new String[] {"f_string", "f_long", "f_int", "f_double", "f_boolean"};
MemSourceBatchOp source = new MemSourceBatchOp(Arrays.asList(testArray), colNames);
SummarizerBatchOp summarizer = new SummarizerBatchOp()
.setSelectedCols("f_double", "f_int");
summarizer.linkFrom(source);
TableSummary srt = summarizer.collectSummary();
System.out.println(srt.toString());
Assert.assertEquals(srt.getColNames().length, 2);
Assert.assertEquals(srt.count(), 4);
Assert.assertEquals(srt.numMissingValue("f_double"), 1, 10e-4);
Assert.assertEquals(srt.numValidValue("f_double"), 3, 10e-4);
Assert.assertEquals(srt.max("f_double"), 2.0, 10e-4);
Assert.assertEquals(srt.min("f_int"), 0.0, 10e-4);
Assert.assertEquals(srt.mean("f_double"), 0.3333333333333333, 10e-4);
Assert.assertEquals(srt.variance("f_double"), 8.333333333333334, 10e-4);
Assert.assertEquals(srt.standardDeviation("f_double"), 2.886751345948129, 10e-4);
Assert.assertEquals(srt.normL1("f_double"), 7.0, 10e-4);
Assert.assertEquals(srt.normL2("f_double"), 4.123105625617661, 10e-4);
}
}
运行结果
Summary:
| colName | count | missing | sum | mean | variance | min | max | | —- | —- | —- | —- | —- | —- | —- | —- |
| f_long | 4 | 0 | 4 | 1 | 0.6667 | 0 | 2 |
| f_int | 4 | 0 | 5 | 1.25 | 0.9167 | 0 | 2 |
| f_double | 4 | 0 | 1 | 0.25 | 5.5833 | -3 | 2 |