Data Community DC and District Data Labs are excited to be hosting a Fast Data Applications with Spark & Python workshop on November 8th For more info and to sign up, go to http://bit.ly/Zhj0y1. There’s even an early bird discount if you register before October 17th!
Hadoop has made the world of Big Data possible by providing a framework for distributed computing on economical, commercial off-the-shelf hardware. Hadoop 2.0 implements a distributed file system, HDFS, and a computing framework, YARN, that allows distributed applications to easily harness the power of clustered computing on extremely large data sets. Over the past decade, the primary application framework has been MapReduce – a functional programming paradigm that lends itself extremely well to designing distributed applications, but carries with it a lot of computational overhead.
Many excellent analytical applications and algorithms have been written in MapReduce, creating an ecosystem that has made Hadoop continue to grow as an effective tool. However, more complex algorithms, especially machine learning algorithms, often require extremely complex chains of jobs to conform to the MapReduce functional paradigm. Enter Spark, an open source Apache project that uses the cluster resource daemons of Hadoop (particularly HDFS and other Hadoop data stores) but allows developers to break out of the MapReduce paradigm and write distributed applications that are much faster.
Spark also distributes applications to a cluster by using distributed executor processes- Spark developers write applications that are intended to work on local data; however unlike with MapReduce, these executors are in communication with each other and can share data via an external store. Spark is intended to work with Hadoop data stores, but can be run in a stand alone mode, or if you already have a Hadoop 2.0 cluster- then Spark can be run with YARN. The flexibility that Spark provides means that it can be used to implement more complex algorithms and applications previously unavailable to MapReduce patterns.
Spark can run in memory, making it hundreds of times faster than disk based MapReduce, and provides a programming API in Scala, Java, and Python – making it more accessible to developers. Spark has an interactive command line interface to quickly interact with data on the cluster, and applications for writing SQL-like queries with Spark and a fairly complete Machine Learning library. Importantly, it can also execute Graph algorithms that were previously unable to be ported to MapReduce frameworks. Continue reading