Cascading (software)

{{Short description|Software abstraction layer for Apache Hadoop and Apache Flink}}

{{Infobox software

| name = Cascading

| latest release version = 3.3.0

| latest release date = {{start date and age|2018|03|24}}{{cite web

| url = https://github.com/Cascading/cascading/releases

| title = Releases · Cascading/cascading

| website = github.com

| access-date = 2021-03-29

}}

| latest preview version = 4.0-wip-120

| latest preview date = {{start date and age|2021|03|27}}{{cite web

| url = https://github.com/cwensel/cascading/releases

| title = Releases · cwensel/cascading

| website = github.com

| access-date = 2021-03-29

}}

| programming language = Java

| license = Apache License v2{{cite web

| url = https://github.com/Cascading/cascading/blob/3.3/LICENSE.txt

| title = cascading/LICENSE.txt at 3.3 · Cascading/cascading

| website = github.com

| access-date = 2021-03-29

}}

| repo = {{url|https://github.com/Cascading/cascading}}

| website = {{url|https://www.cascading.org/}}

}}

Cascading is a software abstraction layer for Apache Hadoop and Apache Flink. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clojure, etc.), hiding the underlying complexity of MapReduce jobs. It is open source and available under the Apache License. Commercial support is available from Driven, Inc.{{Cite web|url=https://www.driven.io/support/|title=Cascading and Driven | Support|website=Driven}}

Cascading was originally authored by Chris Wensel, who later founded Concurrent, Inc, which has been re-branded as Driven.{{Cite web|url=https://www.integrate.io/|title=Integrate.io - One Platform To Support Your Entire Data Journey|website=Integrate.io}} Cascading is being actively developed by the community{{citation needed|date=October 2014}} and a number of add-on modules are available.{{Cite web |url=http://www.cascading.org/modules.html |title=Cascading modules |access-date=2011-08-22 |archive-url=https://web.archive.org/web/20110811082852/http://www.cascading.org/modules.html |archive-date=2011-08-11 |url-status=dead }}

Architecture

To use Cascading, Apache Hadoop must also be installed, and the Hadoop job .jar must contain the Cascading .jars. Cascading consists of a data processing API, integration API, process planner and process scheduler.

Cascading leverages the scalability of Hadoop but abstracts standard data processing operations away from underlying map and reduce tasks.[http://codeascraft.etsy.com/2010/02/24/analyzing-etsys-data-with-hadoop-and-cascading/ Blog post by Etsy describing their use of Cascading with Hadoop]{{Better source needed|date=October 2013}} Developers use Cascading to create a .jar file that describes the required processes. It follows a ‘source-pipe-sink’ paradigm, where data is captured from sources, follows reusable ‘pipes’ that perform data analysis processes, where the results are stored in output files or ‘sinks’. Pipes are created independent from the data they will process. Once tied to data sources and sinks, it is called a ‘flow’. These flows can be grouped into a ‘cascade’, and the process scheduler will ensure a given flow does not execute until all its dependencies are satisfied. Pipes and flows can be reused and reordered to support different business needs.{{Cite web|url=http://www.cascading.org/1.2/userguide/pdf/userguide.pdf|archiveurl=https://web.archive.org/web/20110206053054/http://www.cascading.org/1.2/userguide/pdf/userguide.pdf|url-status=dead|title=Cascading User Guide|archivedate=February 6, 2011}}

Developers write the code in a JVM-based language and do not need to learn MapReduce. The resulting program can be regression tested and integrated with external applications like any other Java application.{{Cite web|url=https://www.driven.io/features/|title=Hadoop Application Performance Management - DRIVEN's Features|website=Driven}}

Cascading is most often used for ad targeting, log file analysis, bioinformatics, machine learning, predictive analytics, web content mining, and extract, transform and load (ETL) applications.

Uses of Cascading

Cascading was cited as one of the top five most powerful Hadoop projects by SD Times in 2011,{{cite news

| last = Handy

| first = Alex

| date = 1 June 2011

| title = The top five most powerful Hadoop projects

| url =http://www.sdtimes.com/content/article.aspx?ArticleID=35596&page=1

| newspaper = SD Times

| location =

| publisher =

| accessdate = 26 October 2013

}}{{Unreliable source?|date=October 2013}} as a major open source project relevant to bioinformatics{{cite news

| last = Taylor

| first = Ronald

| date = 21 December 2010

| title = An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics

| url = http://www.biomedcentral.com/1471-2105/11/S12/S1

| newspaper = BioMed Central

| location =

| publisher = Springer Science+Business Media

| accessdate = 26 October 2013

}}{{Unreliable source?|date=October 2013}} and is included in Hadoop: A Definitive Guide, by Tom White.{{Cite book|url=https://books.google.com/books?id=Nff49D7vnJcC&dq=cascading+hadoop&pg=PA548|title=Hadoop: The Definitive Guide|first=Tom|last=White|date=September 24, 2010|publisher="O'Reilly Media, Inc."|isbn=9781449396893 |via=Google Books}} The project has also been cited in presentations, conference proceedings and Hadoop user group meetings as a useful tool for working with Hadoop{{Cite web|url=https://www.slideshare.net/pacoid/getting-started-on-hadoop|title=Getting Started on Hadoop|website=www.slideshare.net}}{{Cite web |url=http://www.smartfrog.org/wiki/download/attachments/6193590/hadoop_and_beyond.pdf?version=1&modificationDate=1238073739000 |title=Julio Guijarro, Steve Loughran and Paolo Castagna, "Hadoop and beyond," HP Labs, Bristol UK, 2008. |access-date=2011-08-22 |archive-url=https://web.archive.org/web/20111001095030/http://www.smartfrog.org/wiki/download/attachments/6193590/hadoop_and_beyond.pdf?version=1&modificationDate=1238073739000 |archive-date=2011-10-01 |url-status=dead }}{{Cite web|url=https://www.slideshare.net/hadoopusergroup/flightcaster-presentation-hadoop|title=Flightcaster Presentation Hadoop|website=www.slideshare.net}}{{Cite web|url=https://www.slideshare.net/chriscurtin/nosql-hadoop-cascading-june-2010|title=NoSQL, Hadoop, Cascading June 2010|website=www.slideshare.net}} and with Apache Spark{{Cite web|url=https://spark-summit.org/2014/talk/using-cascading-to-build-data-centric-applications-on-spark|title=Using Cascading to Build Data-centric Applications on Spark|date=2014-05-07|website=Spark Summit 2014|access-date=2016-03-25}}

MultiTool on Amazon Web Services was developed using Cascading.{{Cite web|url=http://aws.amazon.com/articles/2293?_encoding=UTF8&jiveRedirect=1|title=Cascading{{Not a typo|.}}Multitool on AWS}}
LogAnalyzer for Amazon CloudFront was developed using Cascading.{{Cite web|url=https://aws.amazon.com/articles/item/|title=AWS Articles|website=Amazon Web Services, Inc.}}
BackType[http://tech.backtype.com/ BackType blog] {{webarchive |url=https://web.archive.org/web/20110825014616/http://tech.backtype.com/ |date=August 25, 2011 }} - social analytics platform
Etsy - marketplace
FlightCaster{{Cite web|url=http://www.informationweek.com/news/software/infrastructure/224000240|title=FlightCaster}} - predicting flight delays
Ion Flux{{Cite web|url=http://www.concurrentinc.com/casestudies/ion_flux|archiveurl=https://web.archive.org/web/20111023203553/http://www.concurrentinc.com/casestudies/ion_flux|url-status=dead|title=Ion Flux|archivedate=October 23, 2011}} - analyzing DNA sequence data
RapLeaf[http://blog.rapleaf.com/dev/2008/09/05/goodbye-mapreduce-hello-cascading/ RapLeaf Blog] {{webarchive |url=https://web.archive.org/web/20110201023302/http://blog.rapleaf.com/dev/2008/09/05/goodbye-mapreduce-hello-cascading/ |date=February 1, 2011 }} - personalization and recommendation systems
Razorfish{{Cite web|url=https://aws.amazon.com/solutions/case-studies/razorfish/|title=Razorfish Case Study|website=Amazon Web Services, Inc.}} - digital advertising

Domain-Specific Languages Built on Cascading

PyCascading{{Cite web|url=https://github.com/twitter/pycascading|title = PyCascading is no longer maintained|website = GitHub|date = 17 September 2021}} - by Twitter, available on GitHub
Cascading.jruby{{Cite web|url=https://github.com/gmarabout/cascading.jruby|title=Cascading.JRuby|date=August 8, 2018|via=GitHub}} - developed by Gregoire Marabout, available on GitHub
Cascalog{{Cite web|url=https://github.com/nathanmarz/cascalog|title=Cascalog|date=June 23, 2023|via=GitHub}} - authored by Nathan Marz, available on GitHub
Scalding{{Cite web|url=https://github.com/twitter/scalding|title=Scalding|date=June 22, 2023|via=GitHub}} - A Scala API for Cascading. Makes it easier to transition Cascading/Scalding code to Spark. By Twitter, available on GitHub

References

External links

[http://www.cascading.org/ Official website]

Category:Free software programmed in Java (programming language)

Category:Free system software

Category:Cloud infrastructure

Category:Hadoop