Java 类名:com.alibaba.alink.operator.batch.dataproc.AggLookupBatchOp
Python 类名:AggLookupBatchOp

功能介绍

需要查找多个值,并统计结果的总和、平均值、最大最小值或拼接查询结果时,可以使用聚合查找,该组件有两个输入,依次是模型数据表和原始数据表。模型数据有两列,依次是String类型和DenseVector类型,原始数据有任意行和列,每列都是String类型。原始数据默认使用空格作为单词的分隔符。

使用方法

在机器学习中,想要使用embedding结果时,可以用该组件进行数据处理和特征生成,例如加载训练好的词向量模型,对文本进行向量化和特征生成。在下面例子中,modelOp是一个词向量字典,inOp是三条英文句子,在该例子中,输出是拼接、简单求和、平均等方法得到的句子特征向量。

参数说明

名称 中文名称 描述 类型 是否必须? 取值范围 默认值
clause 运算语句 运算语句 String
delimiter 分隔符 用来分割字符串 String “ “
modelFilePath 模型的文件路径 模型的文件路径 String null
reservedCols 算法保留列名 算法保留列 String[] null
numThreads 组件多线程线程个数 组件多线程线程个数 Integer 1

代码示例

Python 代码

  1. from pyalink.alink import *
  2. import pandas as pd
  3. useLocalEnv(1)
  4. df = pd.DataFrame([
  5. ["the quality of the word vectors increases"],
  6. ["amount of the training data increases"],
  7. ["the training speed is significantly improved"]
  8. ])
  9. inOp = BatchOperator.fromDataframe(df, schemaStr='sentence string')
  10. df2 = pd.DataFrame([
  11. ["the", "0.6343,0.8561,0.1249,0.4701"],
  12. ["training", "0.2753,0.2444,0.3699,0.6048"],
  13. ["of", "0.3160,0.3675,0.1649,0.4116"],
  14. ["increases", "1.0372,0.6092,0.1050,0.2630"],
  15. ["word", "0.9911,0.6338,0.4570,0.8451"],
  16. ["vectors", "0.8780,0.4500,0.5455,0.7495"],
  17. ["speed", "0.9504,0.3168,0.7484,0.6965"],
  18. ["significantly", "-0.0465,0.6597,0.0906,0.7137"],
  19. ["quality", "0.9745,0.7521,0.8874,0.5192"],
  20. ["is", "0.8221,0.0487,-0.0065,0.4088"],
  21. ["improved", "0.1910,0.0723,0.8216,0.4367"],
  22. ["data", "0.8985,0.0117,0.8083,0.9636"],
  23. ["amount", "0.9786,0.1470,0.7385,0.8856"]
  24. ])
  25. modelOp = BatchOperator.fromDataframe(df2, schemaStr="id string, vec string")
  26. AggLookupBatchOp() \
  27. .setClause("CONCAT(sentence,2) as concat, AVG(sentence) as avg, SUM(sentence) as sum,MAX(sentence) as max,MIN(sentence) as min") \
  28. .setDelimiter(" ") \
  29. .setReservedCols([]) \
  30. .linkFrom(modelOp, inOp) \
  31. .print()

Java 代码

  1. import org.apache.flink.types.Row;
  2. import com.alibaba.alink.operator.batch.BatchOperator;
  3. import com.alibaba.alink.operator.batch.dataproc.AggLookupBatchOp;
  4. import com.alibaba.alink.operator.batch.source.MemSourceBatchOp;
  5. import org.junit.Test;
  6. import java.util.Arrays;
  7. import java.util.List;
  8. public class AggLookupBatchOpTest {
  9. @Test
  10. public void testAggLookupBatchOp() throws Exception {
  11. List <Row> df = Arrays.asList(
  12. Row.of("the quality of the word vectors increases"),
  13. Row.of("amount of the training data increases"),
  14. Row.of("the training speed is significantly improved")
  15. );
  16. BatchOperator <?> inOp = new MemSourceBatchOp(df, "sentence string");
  17. List <Row> df2 = Arrays.asList(
  18. Row.of("the", "0.6343,0.8561,0.1249,0.4701"),
  19. Row.of("training", "0.2753,0.2444,0.3699,0.6048"),
  20. Row.of("of", "0.3160,0.3675,0.1649,0.4116"),
  21. Row.of("increases", "1.0372,0.6092,0.1050,0.2630"),
  22. Row.of("word", "0.9911,0.6338,0.4570,0.8451"),
  23. Row.of("vectors", "0.8780,0.4500,0.5455,0.7495"),
  24. Row.of("speed", "0.9504,0.3168,0.7484,0.6965"),
  25. Row.of("significantly", "-0.0465,0.6597,0.0906,0.7137"),
  26. Row.of("quality", "0.9745,0.7521,0.8874,0.5192"),
  27. Row.of("is", "0.8221,0.0487,-0.0065,0.4088"),
  28. Row.of("improved", "0.1910,0.0723,0.8216,0.4367"),
  29. Row.of("data", "0.8985,0.0117,0.8083,0.9636"),
  30. Row.of("amount", "0.9786,0.1470,0.7385,0.8856")
  31. );
  32. BatchOperator <?> modelOp = new MemSourceBatchOp(df2, "id string, vec string");
  33. AggLookupBatchOp aggLookupBatchOp = new AggLookupBatchOp()
  34. .setClause(
  35. "CONCAT(sentence,2) as concat, AVG(sentence) as avg, SUM(sentence) as sum,MAX(sentence) as max,MIN(sentence) "
  36. + "as min")
  37. .setDelimiter(" ")
  38. .linkFrom(modelOp, inOp);
  39. aggLookupBatchOp.select(new String[] {"e0"})
  40. .print();
  41. aggLookupBatchOp.select(new String[] {"e1", "e2", "e3", "e4"})
  42. .print();
  43. }
  44. }

运行结果

| concat | | —- |

| 0.6343 0.8561 0.1249 0.4701 0.9745 0.7521 0.8874 0.5192 |

| 0.9786 0.147 0.7385 0.8856 0.316 0.3675 0.1649 0.4116 |

| 0.6343 0.8561 0.1249 0.4701 0.2753 0.2444 0.3699 0.6048 |

| avg | sum | max | min | | —- | —- | —- | —- |

| 0.7807 0.6464 0.3442 0.5326 | 5.4654 4.5248 2.4096 3.7286 | 1.0372 0.8561 0.8874 0.8451 | 0.316 0.3675 0.105 0.263 |

| 0.6899 0.3726 0.3852 0.5997 | 4.1399 2.2359 2.3115 3.5987 | 1.0372 0.8561 0.8083 0.9636 | 0.2753 0.0117 0.105 0.263 |

| 0.4710 0.3663 0.3581 0.5550 | 2.8266 2.1980 2.1489 3.3306 | 0.9504 0.8561 0.8216 0.7137 | -0.0465 0.0487 -0.0065 0.4088 |