Apache Flink

{{Short description|Framework and distributed processing engine}}

{{Infobox software

| title = Apache Flink

| name = Apache Flink

| logo = Apache Flink logo.svg

| screenshot =

| caption =

| developer = Apache Software Foundation

| released = {{Start date and age|2011|05|df=yes}}

| latest release version = {{wikidata|property|edit|reference|P348}}

| latest release date = {{start date and age|{{wikidata|qualifier|P348|P577}}}}{{cite web|url=https://flink.apache.org/downloads.html|title=All stable Flink releases|website=flink.apache.org|publisher=Apache Software Foundation|access-date=2021-12-20}}

| programming language = Java and Scala

| operating system = Cross-platform

| genre = {{hlist|Data analytics|machine learning algorithms}}

| license = Apache License 2.0

}}

Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala.{{cite web|url=https://flink.apache.org/|title=Apache Flink: Scalable Batch and Stream Data Processing|work=apache.org}}{{cite web|url=https://github.com/apache/flink|title=apache/flink|work=GitHub|date=29 January 2022}} Flink executes arbitrary dataflow programs in a data-parallel and pipelined (hence task parallel) manner.Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinländer, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, and Daniel Warneke. 2014. The Stratosphere platform for big data analytics. The VLDB Journal 23, 6 (December 2014), 939-964. [https://dx.doi.org/10.1007/s00778-014-0357-y DOI] Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs.{{cite web|url=http://www.infoworld.com/article/2919602/hadoop/flink-hadoops-new-contender-for-mapreduce-spark.html|title=Apache Flink: New Hadoop contender squares off against Spark|author=Ian Pointer|date=7 May 2015|work=InfoWorld}}{{cite web|url=http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/|title=On Apache Flink. Interview with Volker Markl.|work=odbms.org}} Furthermore, Flink's runtime supports the execution of iterative algorithms natively.Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl. 2012. Spinning fast iterative data flows. Proc. VLDB Endow. 5, 11 (July 2012), 1268-1279. [https://dx.doi.org/10.14778/2350229.2350245 DOI]

Flink provides a high-throughput, low-latency streaming engine{{Cite news|url=https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at|title=Benchmarking Streaming Computation Engines at Yahoo!|newspaper=Yahoo Engineering|access-date=2017-02-23}} as well as support for event-time processing and state management. Flink applications are fault-tolerant in the event of machine failure and support exactly-once semantics.{{Cite arXiv|last1=Carbone|first1=Paris|last2=Fóra|first2=Gyula|last3=Ewen|first3=Stephan|last4=Haridi|first4=Seif|last5=Tzoumas|first5=Kostas|date=2015-06-29|title=Lightweight Asynchronous Snapshots for Distributed Dataflows|eprint=1506.08603|class=cs.DC}} Programs can be written in Java, Python,{{cite web|url=https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/batch/python.html|title=Apache Flink 1.2.0 Documentation: Python Programming Guide|website=ci.apache.org|language=en|access-date=2017-02-23}} and SQL{{cite web|url=https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/table_api.html|title=Apache Flink 1.2.0 Documentation: Table and SQL|website=ci.apache.org|language=en|access-date=2017-02-23}} and are automatically compiled and optimizedFabian Hueske, Mathias Peters, Matthias J. Sax, Astrid Rheinländer, Rico Bergmann, Aljoscha Krettek, and Kostas Tzoumas. 2012. Opening the black boxes in data flow optimization. Proc. VLDB Endow. 5, 11 (July 2012), 1256-1267. [https://dx.doi.org/10.14778/2350229.2350244 DOI] into dataflow programs that are executed in a cluster or cloud environment.Daniel Warneke and Odej Kao. 2009. Nephele: efficient parallel data processing in the cloud. In Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS '09). ACM, New York, NY, USA, Article 8, 10 pages. [http://doi.acm.org/10.1145/1646468.1646476 DOI]

Flink does not provide its own data-storage system, but provides data-source and sink connectors to systems such as Apache Doris, Amazon Kinesis, Apache Kafka, HDFS, Apache Cassandra, and ElasticSearch.{{cite web|url=https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/index.html|title=Apache Flink 1.2.0 Documentation: Streaming Connectors|website=ci.apache.org|language=en|access-date=2017-02-23}}

Development

Apache Flink is developed under the Apache License 2.0{{cite web|url=https://git1-us-west.apache.org/repos/asf?p=flink.git;a=blob;f=LICENSE;hb=HEAD|title=ASF Git Repos - flink.git/blob - LICENSE|work=apache.org|access-date=2015-04-12|archive-url=https://web.archive.org/web/20171023175448/https://git1-us-west.apache.org/repos/asf?p=flink.git;a=blob;f=LICENSE;hb=HEAD|archive-date=2017-10-23|url-status=dead}} by the Apache Flink Community within the Apache Software Foundation. The project is driven by 119{{cite web|title=Apache Flink Committee Information|url=https://projects.apache.org/committee.html?flink|access-date=2025-04-10}} committers and over 340 contributors.

Overview

Apache Flink's dataflow programming model provides event-at-a-time processing on both finite and infinite datasets. At a basic level, Flink programs consist of streams and transformations. “Conceptually, a stream is a (potentially never-ending) flow of data records, and a transformation is an operation that takes one or more streams as input, and produces one or more output streams as a result.”{{cite web|url=https://ci.apache.org/projects/flink/flink-docs-release-1.2/concepts/programming-model.html#programs-and-dataflows|title=Apache Flink 1.2.0 Documentation: Dataflow Programming Model|website=ci.apache.org|language=en|access-date=2017-02-23}}

Apache Flink includes two core APIs: a DataStream API for bounded or unbounded streams of data and a DataSet API for bounded data sets. Flink also offers a Table API, which is a SQL-like expression language for relational stream and batch processing that can be easily embedded in Flink's DataStream and DataSet APIs. The highest-level language supported by Flink is SQL, which is semantically similar to the Table API and represents programs as SQL query expressions.

= Programming Model and Distributed Runtime =

Upon execution, Flink programs are mapped to streaming dataflows. Every Flink dataflow starts with one or more sources (a data input, e.g., a message queue or a file system) and ends with one or more sinks (a data output, e.g., a message queue, file system, or database). An arbitrary number of transformations can be performed on the stream. These streams can be arranged as a directed, acyclic dataflow graph, allowing an application to branch and merge dataflows.

Flink offers ready-built source and sink connectors with Apache Kafka, Amazon Kinesis,{{Cite web |url=https://digitalcloud.training/amazon-kinesis/ |title=Kinesis Data Streams: processing streaming data in real time|date=5 January 2022 }} HDFS, Apache Cassandra, and more.

Flink programs run as a distributed system within a cluster and can be deployed in a standalone mode as well as on YARN, Mesos, Docker-based setups along with other resource management frameworks.{{cite web|url=https://ci.apache.org/projects/flink/flink-docs-release-1.2/concepts/runtime.html|title=Apache Flink 1.2.0 Documentation: Distributed Runtime Environment|website=ci.apache.org|language=en|access-date=2017-02-24}}

= State: Checkpoints, Savepoints, and Fault-tolerance =

Apache Flink includes a lightweight fault tolerance mechanism based on distributed checkpoints. A checkpoint is an automatic, asynchronous snapshot of the state of an application and the position in a source stream. In the case of a failure, a Flink program with checkpointing enabled will, upon recovery, resume processing from the last completed checkpoint, ensuring that Flink maintains exactly-once state semantics within an application. The checkpointing mechanism exposes hooks for application code to include external systems into the checkpointing mechanism as well (like opening and committing transactions with a database system).

Flink also includes a mechanism called savepoints, which are manually-triggered checkpoints.{{cite web|url=https://ci.apache.org/projects/flink/flink-docs-release-1.2/concepts/runtime.html#savepoints|title=Apache Flink 1.2.0 Documentation: Distributed Runtime Environment - Savepoints|website=ci.apache.org|language=en|access-date=2017-02-24}} A user can generate a savepoint, stop a running Flink program, then resume the program from the same application state and position in the stream. Savepoints enable updates to a Flink program or a Flink cluster without losing the application's state . As of Flink 1.2, savepoints also allow to restart an application with a different parallelism—allowing users to adapt to changing workloads.

= DataStream API =

Flink's DataStream API enables transformations (e.g. filters, aggregations, window functions) on bounded or unbounded streams of data. The DataStream API includes more than 20 different types of transformations and is available in Java and Scala.{{cite web|url=https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/datastream_api.html|title=Apache Flink 1.2.0 Documentation: Flink DataStream API Programming Guide|website=ci.apache.org|language=en|access-date=2017-02-24}}

A simple example of a stateful stream processing program is an application that emits a word count from a continuous input stream and groups the data in 5-second windows:

import org.apache.flink.streaming.api.scala._

import org.apache.flink.streaming.api.windowing.time.Time

case class WordCount(word: String, count: Int)

object WindowWordCount {

def main(args: Array[String]) {

val env = StreamExecutionEnvironment.getExecutionEnvironment

val text = env.socketTextStream("localhost", 9999)

val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }

.map { WordCount(_, 1) }

.keyBy("word")

.timeWindow(Time.seconds(5))

.sum("count")

counts.print

env.execute("Window Stream WordCount")

}

}

= DataSet API =

Flink's DataSet API enables transformations (e.g., filters, mapping, joining, grouping) on bounded datasets. The DataSet API includes more than 20 different types of transformations.{{cite web|url=https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/batch/index.html|title=Apache Flink 1.2.0 Documentation: Flink DataSet API Programming Guide|website=ci.apache.org|language=en|access-date=2017-02-24}} The API is available in Java, Scala and an experimental Python API. Flink's DataSet API is conceptually similar to the DataStream API. This API is deprecated at Flink version 2.0 {{cite web|url=https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-2.0/#api|title=Deprecated Flink version 2.0 APIs|access-date=2025-04-10}}

= Table API and SQL =

Flink's Table API is a SQL-like expression language for relational stream and batch processing that can be embedded in Flink's Java and Scala DataSet and DataStream APIs. The Table API and SQL interface operate on a relational Table abstraction. Tables can be created from external data sources or from existing DataStreams and DataSets. The Table API supports relational operators such as selection, aggregation, and joins on Tables.

Tables can also be queried with regular SQL. The Table API and SQL offer equivalent functionality and can be mixed in the same program. When a Table is converted back into a DataSet or DataStream, the logical plan, which was defined by relational operators and SQL queries, is optimized using Apache Calcite and is transformed into a DataSet or DataStream program.{{cite web|url=https://flink.apache.org/news/2016/05/24/stream-sql.html|title=Stream Processing for Everyone with SQL and Apache Flink|website=flink.apache.org|date=24 May 2016 |language=en|access-date=2020-01-08}}

History

In 2010, the research project "Stratosphere: Information Management on the Cloud"{{cite web|url=http://stratosphere.eu/|title=Stratosphere|work=stratosphere.eu}} led by Volker Markl (funded by the German Research Foundation (DFG)){{cite web|url=https://gepris.dfg.de/gepris/projekt/132320961 |title= Stratosphere - Information Management on the Cloud |publisher=Deutsche Forschungsgemeinschaft (DFG) |accessdate=2023-12-01}} was started as a collaboration of Technische Universität Berlin, Humboldt-Universität zu Berlin, and Hasso-Plattner-Institut Potsdam. Flink started from a fork of Stratosphere's distributed execution engine and it became an Apache Incubator project in March 2014.{{cite web|url=https://wiki.apache.org/incubator/StratosphereProposal|title=Stratosphere|work=apache.org}} In December 2014, Flink was accepted as an Apache top-level project.{{cite web|url=https://projects.apache.org/project.html?flink|title=Project Details for Apache Flink|work=apache.org}}{{cite web|url=https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces69|title=The Apache Software Foundation Announces Apache™ Flink™ as a Top-Level Project : The Apache Software Foundation Blog|work=apache.org|date=12 January 2015 }}{{cite web|url=http://siliconangle.com/blog/2015/02/09/will-the-mysterious-apache-flink-find-its-sweet-spot-in-the-enterprise|title=Will the mysterious Apache Flink find a sweet spot in the enterprise?|work=siliconangle.com|date=9 February 2015}}[http://www.heise.de/developer/meldung/Big-Data-Apache-Flink-wird-Top-Level-Projekt-2516177.html (in German)]

class="wikitable"
Version

! Original release date

! Latest version

! Release date

{{Version|o|0.9}}

| 2015-06-24

| 0.9.1

| 2015-09-01

{{Version|o|0.10}}

| 2015-11-16

| 0.10.2

| 2016-02-11

{{Version|o|1.0}}

| 2016-03-08

| 1.0.3

| 2016-05-11

{{Version|o|1.1}}

| 2016-08-08

| 1.1.5

| 2017-03-22

{{Version|o|1.2}}

| 2017-02-06

| 1.2.1

| 2017-04-26

{{Version|o|1.3}}

| 2017-06-01

| 1.3.3

| 2018-03-15

{{Version|o|1.4}}

| 2017-12-12

| 1.4.2

| 2018-03-08

{{Version|o|1.5}}

| 2018-05-25

| 1.5.6

| 2018-12-26

{{Version|o|1.6}}

| 2018-08-08

| 1.6.3

| 2018-12-22

{{Version|o|1.7}}

| 2018-11-30

| 1.7.2

| 2019-02-15

{{Version|o|1.8}}

| 2019-04-09

| 1.8.3

| 2019-12-11

{{Version|o|1.9}}

| 2019-08-22

| 1.9.2

| 2020-01-30

{{Version|o|1.10}}

| 2020-02-11

| 1.10.3

| 2021-01-29

{{Version|o|1.11}}

| 2020-07-06

| 1.11.6

| 2021-12-16

{{Version|o|1.12}}

| 2020-12-10

| 1.12.7

| 2021-12-16

{{Version|o|1.13}}

| 2021-05-03

| 1.13.6

| 2022-02-18

{{Version|o|1.14}}

| 2021-09-29

| 1.14.6

| 2022-09-28

{{Version|o|1.15}}

| 2022-05-05

| 1.15.4

| 2023-03-15

{{Version|o|1.16}}

| 2022-10-28

| 1.16.3

| 2023-11-29

{{Version|o|1.17}}

| 2023-03-23

| 1.17.2

| 2023-11-29

{{Version|o|1.18}}

| 2023-10-24

| 1.18.1

| 2024-01-19

{{Version|co|1.19}}

| 2024-03-18

| 1.19.2

| 2025-02-12

{{Version|co|1.20 (LTS)}}

| 2024-08-02

| 1.20.1

| 2025-02-12

{{Version|c|2.0}}

| 2025-03-19

| 2.0.0

| 2025-03-19

colspan="5" | {{Version |l |show=111100}}

Release Dates

  • 03/2025: Apache Flink 2.0 (03/2025: v2.0.0)
  • 08/2024: Apache Flink 1.20 (02/2025: v1.20.1)
  • 03/2024: Apache Flink 1.19 (06/2024: v1.19.1, 02/2025: v1.19.2)
  • 10/2023: Apache Flink 1.18 (01/2024: v1.18.1)
  • 03/2023: Apache Flink 1.17 (05/2023: v1.17.1; 11/2023: v1.17.2)
  • 10/2022: Apache Flink 1.16 (01/2023: v1.16.1; 05/2023: v1.16.2; 11/2023: v1.16.3)
  • 05/2022: Apache Flink 1.15 (07/2022: v1.15.1; 08/2022: v1.15.2; 11/2022: v1.15.3; 03/2023: v1.15.4)
  • 09/2021: Apache Flink 1.14 (12/2021: v1.14.2; 01/2022: v1.14.3; 03/2022: v1.14.4; 06/2022: v1.14.5; 09/2022: v1.14.6)
  • 05/2021: Apache Flink 1.13 (05/2021: v1.13.1; 08/2021: v1.13.2; 10/2021: v1.13.3; 12/2021: v1.13.5; 02/2022: v1.13.6)
  • 12/2020: Apache Flink 1.12 (01/2021: v1.12.1; 03/2021: v1.12.2; 04/2021: v1.12.3; 05/2021: v1.12.4; 08/2021: v1.12.5; 12/2021: v1.12.7)
  • 07/2020: Apache Flink 1.11 (07/2020: v1.11.1; 09/2020: v1.11.2; 12/2020: v1.11.3; 08/2021: v1.11.4; 12/2021: v1.11.6)
  • 02/2020: Apache Flink 1.10 (05/2020: v1.10.1; 08/2020: v1.10.2; 01/2021: v1.10.3)
  • 08/2019: Apache Flink 1.9 (10/2019: v1.9.1; 01/2020: v1.9.2)
  • 04/2019: Apache Flink 1.8 (07/2019: v1.8.1; 09/2019: v1.8.2; 12/2019: v1.8.3)
  • 11/2018: Apache Flink 1.7 (12/2018: v1.7.1; 02/2019: v1.7.2)
  • 08/2018: Apache Flink 1.6 (09/2018: v1.6.1; 10/2018: v1.6.2; 12/2018: v1.6.3; 02/2019: v1.6.4)
  • 05/2018: Apache Flink 1.5 (07/2018: v1.5.1; 07/2018: v1.5.2; 08/2018: v1.5.3; 09/2018: v1.5.4; 10/2018: v1.5.5; 12/2018: v1.5.6)
  • 12/2017: Apache Flink 1.4 (02/2018: v1.4.1; 03/2018: v1.4.2)
  • 06/2017: Apache Flink 1.3 (06/2017: v1.3.1; 08/2017: v1.3.2; 03/2018: v1.3.3)
  • 02/2017: Apache Flink 1.2 (04/2017: v1.2.1)
  • 08/2016: Apache Flink 1.1 (08/2016: v1.1.1; 09/2016: v1.1.2; 10/2016: v1.1.3; 12/2016: v1.1.4; 03/2017: v1.1.5)
  • 03/2016: Apache Flink 1.0 (04/2016: v1.0.1; 04/2016: v1.0.2; 05/2016: v1.0.3)
  • 11/2015: Apache Flink 0.10 (11/2015: v0.10.1; 02/2016: v0.10.2)
  • 06/2015: Apache Flink 0.9 (09/2015: v0.9.1)
  • 04/2015: Apache Flink 0.9-milestone-1

Apache Incubator Release Dates

  • 01/2015: Apache Flink 0.8-incubating
  • 11/2014: Apache Flink 0.7-incubating
  • 08/2014: Apache Flink 0.6-incubating (09/2014: v0.6.1-incubating)
  • 05/2014: Stratosphere 0.5 (06/2014: v0.5.1; 07/2014: v0.5.2)

Pre-Apache Stratosphere Release Dates

  • 01/2014: Stratosphere 0.4 (version 0.3 was skipped)
  • 08/2012: Stratosphere 0.2
  • 05/2011: Stratosphere 0.1 (08/2011: v0.1.1)

The 1.14.1, 1.13.4, 1.12.6, 1.11.5 releases, which were supposed to only contain a Log4j upgrade to 2.15.0, were skipped because {{CVE|2021-45046|link=no}} was discovered during the release publication.{{cite web|url=https://flink.apache.org/news/2021/12/16/log4j-patch-releases.html|title=Apache Flink Log4j emergency releases|website=flink.apache.org|date=16 December 2021 |publisher=Apache Software Foundation|access-date=2021-12-22}}

See also

{{Portal|Free and open-source software}}

References

{{Reflist|colwidth=30em}}