Introduction to Apache Spark

5 min readFeb 11, 2023

In this tutorial, we will focus on Spark, Spark Framework, its Architecture, working, Resilient Distributed Datasets, RDD operations, Spark programming language, and a comparison of Spark with MapReduce.

Spark is a fast cluster computing system that is compatible with Hadoop. It has the capability to work with any Hadoop-supported storage system such as HDFS, or S3. Spark uses in-memory computing to improve efficiency. In-memory computation does not save the intermediate output results to disk. Spark also uses caching to handle repetitive queries. Spark is up to 100x times compared than Hadoop. Spark is developed in Scala.

Spark is another Big Data framework. Spark supports In-Memory processing. Hadoop reads and writes data directly from disk thus wasting a significant amount of time in disk I/O. To tackle this scenario Spark stores intermediate results in memory thus reducing disk I/O and increasing speed of processing.

Spark Framework

Programming: Spark offers Scala, Java, Python, and R.
Libraries: Spark offers libraries for a particular task such as Spark SQL, MLLib, GraphX, and Streaming.
Engine: Spark has its own execution engine “Spark Core” that executes all the spark jobs.
Management: Spark uses YARN, Mesos, and Spark…

Introduction to Apache Spark

Spark Framework

Written by Avinash Navlani