我们之前学习的转换算子是无法访问事件的时间戳信息和水位线信息的。而这在一些应用场景下,极为重要。例如:MapFunction 这样的 map 转换算子就无法访问时间戳或当前事件的事件时间;
基于此,DataStreamAPI提供了一系列的底层API转换算子,可以访问时间戳、Watermark以及注册定时事件。还可以输出特定的一些事件,例如:超时事件等。
ProcessFunction用来构建事件驱动的应用以及实现自定义的业务逻辑;Flink提供了8个ProcessFunction:
- ProcessFunction :常用
- KeyedProcessFunction :常用
- CoProcessFunction
- ProcessJoinFunction
- BroadcastProcessFunction
- KeyedBroadcastProcessFunction
- ProcessWindowFunction
- ProcessAllWindowFunction
一、KeyedProcessFunction
用来操作 KeyedStream。KeyedProcessFunction 会处理流的每一个元素,输出为0个、1个或者多个元素。所有的ProcessFunction 都继承自 RichFunction接口,所以都有 open()、close() 和 getRuntimeContext() 等方法。而 KeyedProcessFunction 还提供了额外两个方法:
- processElement(v: IN, ctx: Context, out: Collector[OUT]),流中的每一个元素都会调用这个方法,调用结果将会放在 Collector 数据类型中输出。Context 可以访问元素的时间戳、元素的key 以及TimeService 时间服务。Context 还可以将结果输出到别的流(side outputs)
- onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[OUT]) 是一个回调函数。当之前注册的定时器触发时调用。参数 timestamp 为定时器所设定的触发的时间戳。 Collector 为输出的结果集合。OnTimerContext 和 processElement 的Context参数一样,提供了上下文的一些信息;
二、TimerService和定时器(Timer)
实例:
package com.wells.flink.demo.process;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;
/**
* Description ProcessAPI
* Created by wells on 2020-05-24 17:45:41
*/
public class ProcessAPITest {
public static void main(String[] args) throws Exception {
String host = "localhost";
int port = 9999;
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStreamSource<String> dataStreamSource = env.socketTextStream(host, port);
dataStreamSource.print("socketSource");
SingleOutputStreamOperator<String> processStreamSource = dataStreamSource.map(new MapFunction<String, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> map(String line) throws Exception {
String[] split = line.split(" ", -1);
return new Tuple2<String, Integer>(split[0], Integer.parseInt(split[1]));
}
}).keyBy(0).process(new MyKeyedProcessFunction());
processStreamSource.print("processSource");
env.execute();
}
}
class MyKeyedProcessFunction extends KeyedProcessFunction<Tuple, Tuple2<String, Integer>, String> {
// 定义一个状态,用来保存上一行记录的数字
private transient ValueState<Integer> lastNumber;
// 记录增加的定时器的时间,方便清除
private transient ValueState<Long> currentTimerTs;
@Override
public void open(Configuration parameters) throws Exception {
lastNumber = getRuntimeContext().getState(new ValueStateDescriptor<Integer>("lastNumber", Integer.class, 0));
currentTimerTs = getRuntimeContext().getState(new ValueStateDescriptor<Long>("currentTimerTs", Long.class, 0L));
}
// 处理流中的每一个元素
@Override
public void processElement(Tuple2<String, Integer> value, Context ctx, Collector<String> out) throws Exception {
// 得到上一次的状态值
Integer preNumber = lastNumber.value();
// 更新上一次的状态值
lastNumber.update(value.f1);
// 当前定时器的时间
Long currTimerTs = currentTimerTs.value();
// 如果数字减小或者第一次设置定时器,则触发定时器
if (value.f1 < preNumber && currTimerTs == 0) {
// 触发定时器时间: 在当前时间基础上加10s 即 当前时间过10s后数字下降,触发定时器
long timerTs = ctx.timerService().currentProcessingTime() + 10000L;
ctx.timerService().registerProcessingTimeTimer(timerTs);
currentTimerTs.update(timerTs);
} else if (value.f1 > preNumber || preNumber == 0) {
// 数字变大 或者 第一次进入 则删除定时器
ctx.timerService().deleteProcessingTimeTimer(currTimerTs);
currentTimerTs.clear();
}
}
// 定时器要做的事情
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
out.collect("number more than last number");
currentTimerTs.clear();
}
}
三、侧输出流
通过在processFunction处理将流分开;也可以使用 split 算子
package com.wells.flink.demo.process;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;
/**
* Description 侧输出流,通过
* Created by wells on 2020-05-24 19:06:57
*/
public class SideOutputTest {
public static void main(String[] args) throws Exception {
String host = "localhost";
int port = 9999;
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStreamSource<String> dataStreamSource = env.socketTextStream(host, port);
dataStreamSource.print("socketSource");
SingleOutputStreamOperator<String> processStreamSource = dataStreamSource.map(new MapFunction<String, Integer>() {
@Override
public Integer map(String line) throws Exception {
return Integer.parseInt(line);
}
}).process(new MyProcessFunction());
processStreamSource.print("processSource");
processStreamSource.getSideOutput(new OutputTag<String>("numberDecrease", TypeInformation.of(String.class)))
.print("numberDecrease");
env.execute();
}
}
class MyProcessFunction extends ProcessFunction<Integer, String> {
private transient OutputTag<String> outputTag;
@Override
public void open(Configuration parameters) throws Exception {
outputTag = new OutputTag<String>("numberDecrease", TypeInformation.of(String.class));
}
// 处理流中的每一个元素
@Override
public void processElement(Integer value, Context ctx, Collector<String> out) throws Exception {
// 如果数字减小或者第一次设置定时器,则触发定时器
if (value < 0) {
ctx.output(outputTag, String.valueOf(value));
} else {
out.collect(String.valueOf(value));
}
}
}