Introduction to Apache Spark

Avinash Navlani
5 min readFeb 11, 2023

In this tutorial, we will focus on Spark, Spark Framework, its Architecture, working, Resilient Distributed Datasets, RDD operations, Spark programming language, and a comparison of Spark with MapReduce.

Spark is a fast cluster computing system that is compatible with Hadoop. It has the capability to work with any Hadoop-supported storage system such as HDFS, or S3. Spark uses in-memory computing to improve efficiency. In-memory computation does not save the intermediate output results to disk. Spark also uses caching to handle repetitive queries. Spark is up to 100x times compared than Hadoop. Spark is developed in Scala.

Spark is another Big Data framework. Spark supports In-Memory processing. Hadoop reads and writes data directly from disk thus wasting a significant amount of time in disk I/O. To tackle this scenario Spark stores intermediate results in memory thus reducing disk I/O and increasing speed of processing.

Spark Framework

  • Programming: Spark offers Scala, Java, Python, and R.
  • Libraries: Spark offers libraries for a particular task such as Spark SQL, MLLib, GraphX, and Streaming.
  • Engine: Spark has its own execution engine “Spark Core” that executes all the spark jobs.
  • Management: Spark uses YARN, Mesos, and Spark…

--

--

Avinash Navlani

Sr Data Scientist| Analytics Consulting | Data Science Communicator | Helping Clients to Improve Products & Services with Data