Apache Spark

Apache Spark is an open-source cluster-computing framework . Originally Developed at the University of California, Berkeley ‘s AMPLab , the Spark codebase Was later Donated to the Apache Software Foundation , qui HAS maintained it since. Spark provides an interface for full programming with implicit data parallelism and fault tolerance .

Overview

Apache Spark has as its architectural foundation the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, which is maintained in a fault-tolerant way. [2] In Spark 1.x, the RDD was the primary application programming interface (API), but as of Spark 2.x use of the Dataset API is encouraged [3] even though the RDD API is not deprecated . [4] [5] The RDD technology still underlies the Dataset API. [6]

Spark and Its DTH Were Developed in 2012 in response to limitations in the MapReduce computing cluster paradigm , qui forces Particular linear dataflow structure is distributed programs: MapReduce programs read input data from disk, map a function across the data, Reduce the results of the map, and store reduction results on disk. Spark’s DTH function as a working set for distributed programs That offers a (damaged deliberately) restricted form of distributed shared memory . [7]

Spark facilitates the implementation of both iterative algorithms , and the interactive database / querying of data. The latency of such applications may be reduced by several orders of magnitude compared to a MapReduce implementation (as was common in Apache Hadoop stacks). [2] [8] Among the classes of iterative algorithms are the training algorithms for machine learning systems, which form the initial impetus for developing Apache Spark. [9]

Apache Spark requires a cluster manager and a distributed storage system . For cluster management, standalone Spark supports (native Spark cluster), Hadoop YARN , or Apache Mesos . [10] For distributed storage, including Hadoop Distributed File System (HDFS) , [11] MapR File System (MapR-FS) , [12] Cassandra , [13] OpenStack Swift , Amazon S3 , Kudu, or a custom solution can be implemented. Spark also supports a pseudo-distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in such a scenario, Spark is running on a single machine with one executor per CPU core .

Spark Core

Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic I / O Functionalities, exposed through an application programming interface (for Java , Python , Scala , and R ) centered on the RDD abstraction (the Java API is available for other JVM languages, goal is also usable for some other non-JVM languages, such as Julia , [14] that can connect to the JVM). This interface mirrors a functional / higher-order model of programming: a “driver” program invokes parallel operations such as map, filteror reduce on an RDD by passing a function to Spark, which then schedules the function of execution in parallel on the cluster. [2] These operations, and additional ones Such As join , take DTH as input and Produce new DTH. RDDs are immutable and their operations are lazy ; It is possible to keep track of the “lineage” of each RDD (so that it can be reconstructed in the case of data loss. RDDs can contain any type of Python, Java, or Scala objects.

Aside from the RDD-oriented functional style of programming, Spark Provides two restricted forms of shared variables: broadcast variable reference read-only data That needs to be disponible all nodes, while accumulators can be used to program reductions in an imperative style. [2]

A typical example of RDD-centric functional programming is the following Scala program that computes the frequencies of all the words in a set of text files and the most common ones. Each map , flatMap (a variant of map ) and reduceByKey takes an anonymous function that performs a simple operation on a single item (or a pair of items), and applies its argument to transform an RDD into a new RDD.

val conf = new SparkConf (). setAppName ( "wiki_test" ) // create a spark config object
val sc = new SparkContext ( // Add a count to one to each token, then sum the counts per word type .wordFreq . sortBy ( s => - s . _2 ) . map ( x => ( x . _2 , x . _1 )). top ( 10 )conf) // Create a spark context
val data = sc.textFile("/path/to/somedir") // Read files from "somedir" into an RDD of (filename, content) pairs.
val tokens = data.flatMap(_.split(" ")) // Split each file into a list of tokens (words).
val wordFreq = tokens.map((_, 1)).reduceByKey(_ + _)
 // Get the top 10 words. Swap word and count to spell by count.

Spark SQL

Spark SQL is a component of Spark Core that introduces DataFrames, [a] which provides support for structured and semi-structured data . Spark SQL provides a domain-specific language (DSL) to manipulate DataFrames in Scala , Java , or Python . It also provides SQL language support, with command-line interfaces and ODBC / JDBC server. Although DataFrames lacks the compile-time type-checking afforded by RDDs, as of Spark 2.0, the strongly typed DataSet is fully supported by Spark SQL as well.

import org.apache.spark.sql.SQLContext
val url = "jdbc: mysql: //yourIP: yourPort / test? user = yourUsername; password = yourPassword" // URL for your database server.
val sqlContext = new org . apache . spark . sql . SQLContext ( sc ) // Create a sql context object
df . printSchema () // Looks the schema of this DataFrame. val countsByAge = df . groupBy ( "age" ). count ()val df = sqlContext
 .read
 .format("jdbc")
 .option("url", url)
 .option("dbtable", "people")
 .load()
 // Counts people by age

Spark Streaming

Spark Streaming Leverages Spark Core’s fast scheduling capability to perform streaming analytics . It ingests data in mini-batches and performs RDD transformations on those mini-batches of data. This design enables the same set of application code written for batch analytics to be used in streaming analytics, thus facilitating easy implementation of lambda architecture . [16] [17] However, this convenience comes with the penalty of latency equal to the mini-batch duration. Other streaming data engines That process event by event Rather than in batches include mini Storm and the streaming component of Flink . [18] Spark Streaming has support built-in to consume fromKafka , Flume , Twitter , ZeroMQ , Kinesis , and TCP / IP sockets . [19]

In Spark 2.x, a separate technology based on Datasets, called Structured Streaming, which has a higher-level interface is also provided to support streaming. [20]

MLlib Machine Learning Library

Spark MLlib is a distributed machine learning framework on Spark Core, which is due to large part to the distributed memory-based Spark architecture, is as much as fast as the disk-based implementation used by Apache Mahout(according to benchmarks done by the MLlib developers against the Alternating Least Squares (ALS) implementations, and before the Mahout itself gained a Spark interface), and scales better than Vowpal Wabbit . [21] Many common machine learning and statistical algorithms have been implemented with MLlib which simplifies large scale machine learning pipelines , including:

  • summary statistics , correlations , stratified sampling , hypothesis testing , random data generation [22]
  • classification and regression : support vector machines , logistic regression , linear regression , decision trees, naive Bayes classification
  • collaborative filtering techniques including alternating least squares (ALS)
  • cluster analysis methods Including k-means , and Latent Dirichlet Allocation (LDA)
  • Technical dimensionality reduction Such As singular value decomposition (SVD), and principal component analysis (PCA)
  • feature extraction and transformation functions
  • optimization algorithms such as stochastic gradient descent , limited-memory BFGS (L-BFGS)

GraphX

GraphX is a distributed graph processing framework on top of Apache Spark. Because it is based on RDDs, which are immutable, graphs are immutable and thus GraphX ​​is unsuitable for graphs that need to be updated, let alone in a transactional manner like a graph database . [23] GraphX ​​provides two separate APIs for massively parallel algorithms implementation (such as PageRank ): a Pregel abstraction, and a more general MapReduce API style. [24]Unlike its predecessor Bagel, which was formally deprecated in Spark 1.6, GraphX ​​has full support for graphs (graphs where properties can be attached to edges and vertices). [25]

GraphX ​​can be viewed as the Spark in-memory version of Apache Giraph , which uses Hadoop disk-based MapReduce. [26]

Like Apache Spark, GraphX ​​originally started as a research project at UC Berkeley ‘s AMPLab and Databricks, and was later donated to the Apache Software Foundation and the Spark project. [27]

History

Spark was first started by Matei Zaharia at UC Berkeley’s AMPLab in 2009, and open sourced in 2010 under a BSD license .

In 2013, the project was donated to the Apache Software Foundation and switched to Apache 2.0 . In February 2014, Spark became a Top-Level Apache Project . [28]

In November 2014, Spark founder Mr. Zaharia’s company Databricks set a new world record in large scale sorting using Spark. [29] [ third-party source needed ]

Spark had in excess of 1000 contributors in 2015, [30] making it one of the most active projects in the Apache Software Foundation [31] and one of the most active open source big data projects.

Given the popularity of the platform by 2014, paid programs like General Assembly and free fellowships like The Data Incubator -have started Offering customized training courses [32]

Version Original release date Latest version Release date
0.5 2012-06-12 0.5.1 2012-10-07
0.6 2012-10-14 0.6.2 2013-02-07 [33]
0.7 2013-02-27 0.7.3 2013-07-16
0.8 2013-09-25 0.8.1 2013-12-19
0.9 2014-02-02 0.9.2 2014-07-23
1.0 2014-05-30 1.0.2 2014-08-05
1.1 2014-09-11 1.1.1 2014-11-26
1.2 2014-12-18 1.2.2 2015-04-17
1.3 2015-03-13 1.3.1 2015-04-17
1.4 2015-06-11 1.4.1 2015-07-15
1.5 2015-09-09 1.5.2 2015-11-09
1.6 2016-01-04 1.6.3 2016-11-07
2.0 2016-07-26 2.0.2 2016-11-14
2.1 2016-12-28 2.1.1 2017-05-02
2.2 2017-07-11 2.2.1 2017-12-01
Legend:
Old version
Older version, still supported
Latest version
Latest preview version

See also

  • List of concurrent and parallel programming APIs / Frameworks

Notes

  1. Jump up^ Called SchemaRDDs before Spark 1.3 [15]

References

  1. Jump up^ “Spark Release 2.0.0” . MLlib in R: SparkR now offers MLlib APIs [..] Python: PySpark now offers many more MLlib algorithms “
  2. ^ Jump up to:d Zaharia, Matei; Chowdhury, Mosharaf; Franklin, Michael J .; Shenker, Scott; Stoica, Ion. Spark: Cluster Computing with Working Sets (PDF) . USENIX Workshop on Hot Topics in Cloud Computing (HotCloud).
  3. Jump up^ “Spark 2.2.0 Quick Start” . apache.org . 2017-07-11 . Retrieved 2017-10-19 . We highly recommend you to use Dataset, which has better performance than RDD
  4. Jump up^ “Spark 2.2.0 deprecation list” . apache.org . 2017-07-11 . Retrieved 2017-10-10 .
  5. Jump up^ Damji, Jules (2016-07-14). “A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets: When to use them and why” . databricks.com . Retrieved 2017-10-19 .
  6. Jump up^ Chambers, Bill (2017-08-10). “11”. Spark: The Definitive Guide(“Rough Cut” pre-print ed.). O’Reilly Media . virtually all Spark code run, where DataFrames or Datasets, compiles down to an RDD
  7. Jump up^ Zaharia, Matei; Chowdhury, Mosharaf; Das, Tathagata; Dave, Ankur; Ma, Justin; McCauley, Murphy; J., Michael; Shenker, Scott; Stoica, Ion. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (PDF) . USENIX Symp. Networked Systems Design and Implementation.
  8. Jump up^ Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker, Scott; Stoica, Ion (June 2013). “Shark: SQL and Rich Analytics at Scale”(PDF) .
  9. Jump up^ Harris, Derrick (28 June 2014). “4 reasons why Spark could jolt Hadoop into hyperdrive” . Gigaom .
  10. Jump up^ “Cluster Mode Overview – Spark 1.2.0 Documentation – Cluster Manager Types” . apache.org . Apache Foundation. 2014-12-18 . Retrieved 2015-01-18 .
  11. Jump up^ Figure showing Spark in relation to other open-source Software projects including Hadoop
  12. Jump up^ MapR ecosystem support matrix
  13. Jump up^ Doan, DuyHai (2014-09-10). “Re: cassandra + spark / pyspark” . Cassandra User (Mailing list) . Retrieved 2014-11-21 .
  14. Jump up^ https://github.com/dfdx/Spark.jl
  15. Jump up^ https://spark.apache.org/releases/spark-release-1-3-0.html
  16. Jump up^ “Applying the Lambda Architecture with Spark, Kafka, and Cassandra | Pluralsight” . www.pluralsight.com . Retrieved 2016-11-20 .
  17. Jump up^ Shapira, Gwen (29 August 2014). “Building Lambda Architecture with Spark Streaming” . cloudera.com . Cloudera . Retrieved 17 June 2016 . re-use the same aggregates we wrote for our batch application on a real-time data stream
  18. Jump up^ “Benchmarking Streaming Computing Engines: Storm, Flink and Spark Streaming” (PDF) . IEEE. May 2016.
  19. Jump up^ Kharbanda, Arush (17 March 2015). “Getting Data into Spark Streaming” . sigmoid.com . Sigmoid (Sunnyvale, California IT product company) . Retrieved 7 July 2016 .
  20. Jump up^ Zaharia, Matei (2016-07-28). “Structured Streaming In Apache Spark: A new high-level API for streaming” . databricks.com . Retrieved 2017-10-19 .
  21. Jump up^ Sparks, Evan; Talwalkar, Ameet (2013-08-06). “Spark Meetup: MLbase, Distributed Machine Learning with Spark” . slideshare.net . Spark User Meetup, San Francisco, California . Retrieved 10 February 2014 .
  22. Jump up^ “MLlib | Apache Spark” . spark.apache.org . Retrieved 2016-01-18 .
  23. Jump up^ Malak, Michael (14 June 2016). “Finding Graph Isomorphisms In GraphX ​​And ​​GraphFrames: Graph Processing vs. Graph Database” . slideshare.net . sparksummit.org . Retrieved 11 July 2016 .
  24. Jump up^ Malak, Michael (1 July 2016). Spark GraphX ​​in Action . Manning. p. 89. ISBN  9781617292521 . Pregel and its little sibling aggregateMessages () are the cornerstones of graph processing in GraphX. … algorithms that require more flexibility for the implementation of aggregateMessages ()
  25. Jump up^ Malak, Michael (14 June 2016). “Finding Graph Isomorphisms In GraphX ​​And ​​GraphFrames: Graph Processing vs. Graph Database” . slideshare.net . sparksummit.org . Retrieved 11 July 2016 .
  26. Jump up^ Malak, Michael (1 July 2016). Spark GraphX ​​in Action . Manning. p. 9. ISBN  9781617292521 . Giraph is limited to slow Hadoop Map / Reduce
  27. Jump up^ Gonzalez, Joseph; Xin, Reynold; Dave, Ankur; Crankshaw, Daniel; Franklin, Michael; Stoica, Ion (Oct 2014). “GraphX: Graph Processing in a Distributed Dataflow Framework” (PDF) .
  28. Jump up^ “The Apache Software Foundation Announces Apache & # 8482 Spark & ​​# 8482 as a Top-Level Project” . apache.org . Apache Software Foundation. February 27, 2014 . Retrieved 4 March 2014 .
  29. Jump up^ Spark officiellement sets a new record in wide-scale sorting
  30. Jump up^ HUB open Spark development activity
  31. Jump up^ “The Apache Software Foundation Announces Apache & # 8482 Spark & ​​# 8482 as a Top-Level Project” . apache.org . Apache Software Foundation. February 27, 2014 . Retrieved 4 March 2014 .
  32. Jump up^ “NY gets new bootcamp for data scientists: It’s free, but harder to get into than Harvard” . Beat Venture . Retrieved 2016-02-21 .
  33. Jump up^ “Spark News” . apache.org . Retrieved 2017-03-30 .