Java 类名:com.alibaba.alink.operator.batch.feature.ChiSqSelectorBatchOp
Python 类名:ChiSqSelectorBatchOp

功能介绍

针对table数据,进行特征筛选

参数说明

名称 中文名称 描述 类型 是否必须? 取值范围 默认值
labelCol 标签列名 输入表中的标签列名 String
selectedCols 选择的列名 计算列对应的列名列表 String[]
fdr 发现阈值 发现阈值, 默认值0.05 Double 0.05
fpr p value的阈值 p value的阈值,默认值0.05 Double 0.05
fwe 错误率阈值 错误率阈值, 默认值0.05 Double 0.05
numTopFeatures 最大的p-value列个数 最大的p-value列个数, 默认值50 Integer 50
percentile 筛选的百分比 筛选的百分比,默认值0.1 Double 0.1
selectorType 筛选类型 筛选类型,包含”NumTopFeatures”,”percentile”, “fpr”, “fdr”, “fwe”五种。 String “NumTopFeatures”, “PERCENTILE”, “FPR”, “FDR”, “FWE” “NumTopFeatures”

代码示例

Python 代码

  1. from pyalink.alink import *
  2. import pandas as pd
  3. useLocalEnv(1)
  4. df = pd.DataFrame([
  5. ["a", 1, 1,2.0, True],
  6. ["c", 1, 2, -3.0, True],
  7. ["a", 2, 2,2.0, False],
  8. ["c", 0, 0, 0.0, False]
  9. ])
  10. source = BatchOperator.fromDataframe(df, schemaStr='f_string string, f_long long, f_int int, f_double double, f_boolean boolean')
  11. selector = ChiSqSelectorBatchOp()\
  12. .setSelectedCols(["f_string", "f_long", "f_int", "f_double"])\
  13. .setLabelCol("f_boolean")\
  14. .setNumTopFeatures(2)
  15. selector.linkFrom(source)
  16. modelInfo: ChisqSelectorModelInfo = selector.collectModelInfo()
  17. print(modelInfo.getColNames())

Java 代码

  1. import org.apache.flink.types.Row;
  2. import com.alibaba.alink.operator.batch.BatchOperator;
  3. import com.alibaba.alink.operator.batch.feature.ChiSqSelectorBatchOp;
  4. import com.alibaba.alink.operator.batch.source.MemSourceBatchOp;
  5. import com.alibaba.alink.operator.common.feature.ChisqSelectorModelInfo;
  6. import org.junit.Test;
  7. import java.util.Arrays;
  8. import java.util.List;
  9. public class ChiSqSelectorBatchOpTest {
  10. @Test
  11. public void testChiSqSelectorBatchOp() throws Exception {
  12. List <Row> df = Arrays.asList(
  13. Row.of("a", 1L, 1, 2.0, true),
  14. Row.of("c", 1L, 2, -3.0, true),
  15. Row.of("a", 2L, 2, 2.0, false),
  16. Row.of("c", 0L, 0, 0.0, false)
  17. );
  18. BatchOperator <?> source = new MemSourceBatchOp(df,
  19. "f_string string, f_long long, f_int int, f_double double, f_boolean boolean");
  20. ChiSqSelectorBatchOp selector = new ChiSqSelectorBatchOp()
  21. .setSelectedCols("f_string", "f_long", "f_int", "f_double")
  22. .setLabelCol("f_boolean")
  23. .setNumTopFeatures(2);
  24. selector.linkFrom(source);
  25. ChisqSelectorModelInfo modelInfo = selector.collectModelInfo();
  26. System.out.println(modelInfo.toString());
  27. }
  28. }

运行结果

  1. ------------------------- ChisqSelectorModelInfo -------------------------
  2. Number of Selector Features: 2
  3. Number of Features: 4
  4. Type of Selector: NumTopFeatures
  5. Number of Top Features: 2
  6. Selector Indices:
  7. | ColName|ChiSquare|PValue| DF|Selected|
  8. |--------|---------|------|---|--------|
  9. | f_long| 4|0.1353| 2| true|
  10. | f_int| 2|0.3679| 2| true|
  11. |f_double| 2|0.3679| 2| false|
  12. |f_string| 0| 1| 1| false|