迭代算法出现在数据分析的许多领域,例如机器学习或图形分析。Flink程序通过定义步进函数并将其嵌入到特殊的迭代运算符中来实现迭代算法。实际应用中有两种变体:Bulk Iterate和Delta Iterate。两个运算符都在当前迭代状态下重复调用step函数,直到达到某个终止条件为止。
| Bulk Iterations | Delta Iterations | |
|---|---|---|
| 迭代输入 | Partial Solution | Workset和Solution Set |
| 步进功能 | 任意数据流 | |
| 状态更新 | 下一个 Partial Solution |
- 下一个Workset - Solution Set的更改 |
| 迭代结果 | 最后的 Partial Solution |
最后一次迭代后的 Solution Set |
| 终止条件 | - 最大迭代次数(默认) - 自定义的收敛判断 |
- 最大迭代次数或空Workset(默认) - 自定义的收敛判断 |
Bulk Iterations
简介
要创建BulkIteration调用,iterate(int)应从迭代开始的DataSet方法开始。然后返回IterativeDataSet,在迭代中可以使用各个常规的算子进行转换。当单次迭代结束后,调用closeWith(DataSet)方法来定义迭代的返回点。用户可以通过定义最大迭代次数来指定迭代的结束条件,也可以使用另一个方式来指定终止条件,即closeWith(DataSet, DataSet),如果第二个DataSet为空,迭代将结束。
demo
我们通过Flink中的KMeans实现来了解Bulk Interactions的基本代码结构。k均值聚类算法(k-means clustering algorithm)是一种迭代求解的聚类分析算法,其步骤是,预将数据分为K组,则随机选取K个对象作为初始的聚类中心,然后计算每个对象与各个种子聚类中心之间的距离,把每个对象分配给距离它最近的聚类中心。聚类中心以及分配给它们的对象就代表一个聚类。每分配一个样本,聚类的聚类中心会根据聚类中现有的对象被重新计算。这个过程将不断重复直到满足某个终止条件。在本例中,我们人为指定了初始的聚类中心,而且结束条件定义为达到最大迭代次数后结束。
public class BulkIteration {public static void main(String[] args) throws Exception {final ParameterTool params = ParameterTool.fromArgs(args);final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();env.getConfig().setGlobalJobParameters(params);// 读取数据点文件和中心点文件DataSet<Point> points = getPointDataSet(env);DataSet<Centroid> centroids = getCentroidDataSet(env);// 设置KMeans最大迭代次数IterativeDataSet<Centroid> loop = centroids.iterate(params.getInt("iterations", 10));DataSet<Centroid> newCentroids = points// 计算距离每个点最近的中心点.map(new SelectNearestCenter())// 每个分区都接受全量的聚类中心数据.withBroadcastSet(loop, "centroids")// 将数据点归到每个新的聚类中心.map(new CountAppender())// 按聚类中心分组.groupBy(0).reduce(new CentroidAccumulator())// 计算新的聚类中心.map(new CentroidAverager());// 返回新的聚类中心到下一次迭代中DataSet<Centroid> finalCentroids = loop.closeWith(newCentroids);DataSet<Tuple2<Integer, Point>> clusteredPoints = points// 计算每个数据点的最终归属.map(new SelectNearestCenter()).withBroadcastSet(finalCentroids, "centroids");if (params.has("output")) {clusteredPoints.writeAsCsv(params.get("output"), "\n", ",", FileSystem.WriteMode.OVERWRITE);} else {System.out.println("Printing result to stdout. Use --output to specify output path.");clusteredPoints.printOnTaskManager("centroids:");}env.execute("KMeans Example");}private static DataSet<Centroid> getCentroidDataSet(ExecutionEnvironment env) throws IOException {DataSet<Centroid> centroids;URL fileUrl = BulkIteration.class.getClassLoader().getResource("centers");centroids = env.readCsvFile(fileUrl.getPath()).fieldDelimiter(" ").pojoType(Centroid.class, "id", "x", "y");return centroids;}private static DataSet<Point> getPointDataSet(ExecutionEnvironment env) throws IOException {DataSet<Point> points;URL fileUrl = BulkIteration.class.getClassLoader().getResource("points");points = env.readCsvFile(fileUrl.getPath()).fieldDelimiter(" ").pojoType(Point.class, "x", "y");return points;}public static class Point implements Serializable {public double x, y;public Point() {}public Point(double x, double y) {this.x = x;this.y = y;}public Point add(Point other) {x += other.x;y += other.y;return this;}public Point div(long val) {x /= val;y /= val;return this;}public double euclideanDistance(Point other) {return Math.sqrt((x - other.x) * (x - other.x) + (y - other.y) * (y - other.y));}public void clear() {x = y = 0.0;}@Overridepublic String toString() {return x + " " + y;}}public static class Centroid extends Point {public int id;public Centroid() {}public Centroid(int id, double x, double y) {super(x, y);this.id = id;}public Centroid(int id, Point p) {super(p.x, p.y);this.id = id;}@Overridepublic String toString() {return id + " " + super.toString();}}/*** 计算距离某个数据点最近的聚类中心*/@FunctionAnnotation.ForwardedFields("*->1")public static final class SelectNearestCenter extends RichMapFunction<Point, Tuple2<Integer, Point>> {private Collection<Centroid> centroids;/*** 在各分区上获取广播中的全量聚类中心数据*/@Overridepublic void open(Configuration parameters) throws Exception {this.centroids = getRuntimeContext().getBroadcastVariable("centroids");}@Overridepublic Tuple2<Integer, Point> map(Point p) throws Exception {double minDistance = Double.MAX_VALUE;int closestCentroidId = -1;for (Centroid centroid : centroids) {// 计算距离double distance = p.euclideanDistance(centroid);// 更新最近的距离if (distance < minDistance) {minDistance = distance;closestCentroidId = centroid.id;}}// 输出各个数据点以及其对应的最近聚类中心idreturn new Tuple2<>(closestCentroidId, p);}}/*** 附加新的数据点到聚类中心*/@FunctionAnnotation.ForwardedFields("f0;f1")public static final class CountAppender implements MapFunction<Tuple2<Integer, Point>, Tuple3<Integer, Point, Long>> {@Overridepublic Tuple3<Integer, Point, Long> map(Tuple2<Integer, Point> t) {return new Tuple3<>(t.f0, t.f1, 1L);}}/*** 累加每个聚类中心的数据点信息*/@FunctionAnnotation.ForwardedFields("0")public static final class CentroidAccumulator implements ReduceFunction<Tuple3<Integer, Point, Long>> {@Overridepublic Tuple3<Integer, Point, Long> reduce(Tuple3<Integer, Point, Long> val1, Tuple3<Integer, Point, Long> val2) {return new Tuple3<>(val1.f0, val1.f1.add(val2.f1), val1.f2 + val2.f2);}}/*** 计算新的聚类中心*/@FunctionAnnotation.ForwardedFields("0->id")public static final class CentroidAverager implements MapFunction<Tuple3<Integer, Point, Long>, Centroid> {@Overridepublic Centroid map(Tuple3<Integer, Point, Long> value) {return new Centroid(value.f0, value.f1.div(value.f2));}}}
Delta Iterations
简介
在Bulk Interactions中,每次迭代中,所有的输入数据都会重新参与计算,直至形成新的输出结构。但在某些算法不会在每次迭代中更改解决输入数据集中的每个数据点,Delta Interactions正是适用于这类算法。Delta Iterations在迭代中有两个数据集,一个称为WorkSet,另一个称为SolutionSet。在每次迭代后,将返回之前WorkSet的部分数据,不再参与计算的将不返回,同时返回更新后的SolutionSet。迭代计算的结果是最后一次迭代后的SolutionSet。要创建DeltaIteration,请分别调用iterateDelta(DataSet, int, int)(或iterateDelta(DataSet, int, int[]))。在输入数据集上调用上述方法,参数分别是初始数据集,最大迭代次数和键位置,返回的 DeltaIteration对象,可以通过访问iteration.getWorkset()和iteration.getSolutionSet()来获取所需的集合并附加新的算子。
demo
本例利用Delta Iterations来实现在连通图中传播最小值的目的。连通图中每个顶点都有一个ID和一个值,该值最终应等于此点与其周围临近点中ID最小的点的值。每个顶点会将其顶点ID传播到相邻的顶点。该算法目标是将最小ID分配给子图的每个顶点。如果接收到的ID小于当前ID,则它将更改为具有接收ID的顶点值。每个顶点的初始值默认设为自己的ID,通过多次迭代,将每个顶点的值调整为最终的临近最小ID。在迭代过程中,如果某个顶点的值没有发生变化,则它将不再参加下一次迭代中,这就是Delta Iterations的优势,它可以不断缩减WorkSet的大小,直至为空,最后退出迭代。而SolutionSet在此过程中不断更新,成为最终的目标结果。
@SuppressWarnings("serial")public class DeltaIteration {public static void main(String[] args) throws Exception {final ParameterTool params = ParameterTool.fromArgs(args);final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();final int maxIterations = params.getInt("iterations", 10);env.getConfig().setGlobalJobParameters(params);DataSet<Long> vertices = getVertexDataSet(env, params);// 将初始数据集中的单向边补全对应反向的边,形成完整的无向图边集合DataSet<Tuple2<Long, Long>> edges = getEdgeDataSet(env, params).flatMap(new UndirectEdge());// 初始分配各个顶点的临近最小值为自身DataSet<Tuple2<Long, Long>> verticesWithInitialId = vertices.map(new DuplicateValue<Long>());// 开启增量迭代org.apache.flink.api.java.operators.DeltaIteration<Tuple2<Long, Long>, Tuple2<Long, Long>> iteration = verticesWithInitialId.iterateDelta(verticesWithInitialId, maxIterations, 0);// 选取附近的最小值更新到自己的值DataSet<Tuple2<Long, Long>> changes = iteration.getWorkset().join(edges).where(0).equalTo(0)// 将当前数据点的值传播到所有临近点.with(new NeighborWithComponentIDJoin()).groupBy(0)// 在所有临近点的值中选取最小值.aggregate(Aggregations.MIN, 1).join(iteration.getSolutionSet()).where(0).equalTo(0)// 如果临近点的最小值小于当前数据点的目标值,则更新.with(new ComponentIdFilter());// 单次迭代闭环DataSet<Tuple2<Long, Long>> result = iteration.closeWith(changes, changes);if (params.has("output")) {result.writeAsCsv(params.get("output"), "\n", " ");env.execute("Connected Components Example");} else {System.out.println("Printing result to stdout. Use --output to specify output path.");result.print();}}/*** 初始化(点-值)元组*/@FunctionAnnotation.ForwardedFields("*->f0")public static final class DuplicateValue<T> implements MapFunction<T, Tuple2<T, T>> {@Overridepublic Tuple2<T, T> map(T vertex) {return new Tuple2<T, T>(vertex, vertex);}}/*** 将单向边映射为反向边,即将有向图变成无向图*/public static final class UndirectEdge implements FlatMapFunction<Tuple2<Long, Long>, Tuple2<Long, Long>> {Tuple2<Long, Long> invertedEdge = new Tuple2<Long, Long>();@Overridepublic void flatMap(Tuple2<Long, Long> edge, Collector<Tuple2<Long, Long>> out) {invertedEdge.f0 = edge.f1;invertedEdge.f1 = edge.f0;out.collect(edge);out.collect(invertedEdge);}}/*** 读取所有临近点的值*/@FunctionAnnotation.ForwardedFieldsFirst("f1->f1")@FunctionAnnotation.ForwardedFieldsSecond("f1->f0")public static final class NeighborWithComponentIDJoin implements JoinFunction<Tuple2<Long, Long>, Tuple2<Long, Long>, Tuple2<Long, Long>> {@Overridepublic Tuple2<Long, Long> join(Tuple2<Long, Long> vertexWithComponent, Tuple2<Long, Long> edge) {return new Tuple2<Long, Long>(edge.f1, vertexWithComponent.f1);}}/*** 如果候选值小于当前数据点的目标值,则输出新的(点-值)元组。*/@FunctionAnnotation.ForwardedFieldsFirst("*")public static final class ComponentIdFilter implements FlatJoinFunction<Tuple2<Long, Long>, Tuple2<Long, Long>, Tuple2<Long, Long>> {@Overridepublic void join(Tuple2<Long, Long> candidate, Tuple2<Long, Long> old, Collector<Tuple2<Long, Long>> out) {if (candidate.f1 < old.f1) {out.collect(candidate);}}}private static DataSet<Long> getVertexDataSet(ExecutionEnvironment env, ParameterTool params) {if (params.has("vertices")) {return env.readCsvFile(params.get("vertices")).types(Long.class).map(new MapFunction<Tuple1<Long>, Long>() {@Overridepublic Long map(Tuple1<Long> value) {return value.f0;}});} else {System.out.println("Executing Connected Components example with default vertices data set.");System.out.println("Use --vertices to specify file input.");return ConnectedComponentsData.getDefaultVertexDataSet(env);}}private static DataSet<Tuple2<Long, Long>> getEdgeDataSet(ExecutionEnvironment env, ParameterTool params) {if (params.has("edges")) {return env.readCsvFile(params.get("edges")).fieldDelimiter(" ").types(Long.class, Long.class);} else {System.out.println("Executing Connected Components example with default edges data set.");System.out.println("Use --edges to specify file input.");return ConnectedComponentsData.getDefaultEdgeDataSet(env);}}}
