What is Apache Spark™?

Apache Spark™ is a high performing, sophisticated, open source analytics engine built around speed, performance and ease of use. Developed in 2009 at UC Berkeley’s AMPLab and open sourced in 2010, Spark is a fast and general cluster computing system for big data, providing an optimized engine that supports general computation graphs for data analysis.

Spark has advanced DAG execution engine and supports cyclic data flow and in-memory computing. Spark run workload is 100x faster than the competing engine—Hadoop.

Apache Spark Ecosystem

Spark has an increasing number of use cases in various industries—including retail, healthcare, finance, advertising, and education. It powers many new-age companies and has contributors from 300+ companies. It is continuing to gain traction in the big data space.

Apache Spark Ecosystem components make it more popular than other big data frameworks. It is a platform for many use cases ranging from real-time data analytics, structured data processing, and graph processing.

Programming Language Support: Apache Spark supports popular languages for data analysis like Python and R, as well as enterprise-friendly Java and Scala. Whether you are an application developer or a data scientist, anyone can harness the scalability and speed of Apache Spark.

Spark Components

Spark components have a set of powerful, higher-level libraries that can be seamlessly used across the application.

Spark SQL is a module for working with the structured data. Spark SQL provides a standard interface for reading from and writing to other datastores. It also provides powerful integration with the rest of the Spark ecosystem (for example, integrating SQL query processing with machine learning). A server mode provides industry-standard JDBC and ODBC connectivity for business intelligence tools.

An early addition to Apache Spark, Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. It enables powerful, interactive, and analytic applications streaming across both real-time and historical data and readily integrates with a wide variety of popular data sources. Spark Streaming extended the Apache Spark concept of batch processing into streaming by breaking the stream down into a continuous series of microbatches.

A general processing engine for the Spark platform provides in-memory computing capabilities to deliver fast execution of a wide variety of applications. Spark Core component is the foundation for parallel and distributed processing of large datasets. It provides distributed task dispatching, scheduling, and basic I/O functionality. It also handles node failures and re-computes missing pieces.

Spark supports many data sources including (with the Spark Cassandra connector) Apache Cassandra, Apache Kafka, Kinesis and Flumes.

Cluster Management in Apache Spark

Apache Spark applications can run in 3 different cluster managers. These are:

  1. Standalone Cluster: Used if only Spark is running. In the standalone cluster mode Spark manages its own cluster. Each application runs an executor on every node within the cluster.
  2. Apache Mesos: A dedicated cluster and resource manager that provides rich resource scheduling capabilities.
  3. YARN: Comes with most of the Hadoop distributions and is the only cluster manager in Spark that supports security. It allows dynamic sharing and central configuration of the same pool of cluster resources between various frameworks that run on YARN.

Apache Spark Architecture

Spark has defined and layered architecture where all the spark components and layers are loosely coupled and integrated with various extensions and libraries.

It uses a master/worker architecture. There is a driver that talks to a single coordinator called master that manages workers in which executors run.

Spark Advantages

Spark Engine is responsible for scheduling, distributing, and monitoring the data application across the cluster and has many advantages. It detects the pattern and provides actionable insight into the big data in real-time. It is the most active open big data tool reshaping the big data market.

Spark lets you write the application in different programming languages and for the developer working on it, this functionally helps make their life easier. Spark Streaming can handle real-time stream processing along with integrations of other frameworks which concludes that spark streaming ability is easy, fault-tolerant, and integrated.

Since Spark cluster can be deployed as a standalone or in the cluster mode it can access data from a diverse source. Also, the project has many developers contributing and has an active mailing list and JIRA for issue tracking.

Additionally, Spark it lets you seamlessly combine various libraries to create a workflow and manage analytics. The in-memory analytics accelerates machine learning algorithms and reduces data read and write round trip from/to disk.

Spark Elasticsearch

ElasticSearch is a JSON database popular with log processing systems. Since Spark 2.1, Spark has included native Elasticsearch support, known as Elasticsearch-Hadoop. Elasticsearch-Hadoop provides native integration between Elasticsearch and Apache Spark, in the form of an RDD that can read data from Elasticsearch. Also, it allows Elasticsearch to be used in Spark in two ways: through the dedicated support available since 2.1, or through the Map/Reduce bridge since 2.0. Elassandra (Elasticsearch + Cassandra) is a fork of Elasticsearch modified to run on top of Apache Cassandra to provide advanced search features on Cassandra tables.

Apache Spark Use Case

Apache Spark’s ability to process streaming data determines its key use case. With so much data being generated on a daily basis, it has become essential for companies to be able to stream and analyze in real time.

Apache Spark allows for entirely new use cases to enhance the value of big data. Some of the areas where Spark can be used include:

  • Supply chain optimization and maintenance
  • Optimisation in advertising and marketing to understand the probability of users clicking on available ads to maximize revenue/engagement.
  • Fraud detection by conducting real-time inspections of data packets to trace malicious activity and anomalous behaviour of users.
  • Apache Spark is fast enough to perform exploratory queries without sampling.
  • Fog computing increasingly requires low latency, massively parallel processing of machine learning and extremely complex graph analytics algorithms. Spark due to its key component qualifies as a fog computing solution.

Spark simplifies intensive, high volume, real-time, or archived data processing tasks and seamlessly integrates relevant complex capabilities such as machine learning and graph algorithms. Its unlimited potential is only limited by imagination.


NB: This article originally appeared in a slightly longer form and permission has been kindly given by Instaclustr®, one of the founding members of the OpenSI, for it reuse.

Tagged in:

More News

What is Apache Cassandra®?

This is the second in our OpenSI series where we look at an open source project in more detail. This week the focus is on Apache Cassandra®.

Read post

Developer Training and Certification for Apache Kafka®

Advance your software engineering or system administration career with this short 6-week course, where you’ll acquire the fundamental skills required to develop and operate one of the world’s most advanced and influential technologies.

Read post