Introduction to Apache Spark

Avinash Navlani
5 min readFeb 11, 2023

In this tutorial, we will focus on Spark, Spark Framework, its Architecture, working, Resilient Distributed Datasets, RDD operations, Spark programming language, and a comparison of Spark with MapReduce.

Spark is a fast cluster computing system that is compatible with Hadoop. It has the capability to work with any Hadoop-supported storage system such as HDFS, or S3. Spark uses in-memory computing to improve efficiency. In-memory computation does not save the intermediate output results to disk. Spark also uses caching to handle repetitive queries. Spark is up to 100x times…

--

--

Avinash Navlani

Sr Data Scientist| Analytics Consulting | Data Science Communicator | Helping Clients to Improve Products & Services with Data