Spark

SPARK

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Spark-Architecture

Spark-Architecture:

1. Key Components of Spark Architecture

1.1. Driver Program

The Driver is the main process that controls the execution of a Spark application.

It creates the SparkContext (or SparkSession in newer versions) to coordinate tasks across the cluster.

It converts user code into a Directed Acyclic Graph (DAG) and schedules execution.

1.2. Cluster Manager

Responsible for resource allocation across the cluster.

Types of cluster managers Spark can use:

Standalone Cluster Manager (Built-in)

Apache YARN (For Hadoop clusters)

Apache Mesos (For multi-framework clusters)

Kubernetes (For containerized Spark applications)

1.3. Executors

Worker processes that execute tasks assigned by the driver.

Each executor runs multiple tasks in parallel.

Executors store data in memory to improve performance.

1.4. Tasks

A task is a unit of work executed by an executor.

Multiple tasks run in parallel across executors.

2. Spark Execution Flow

Application Submission: The user submits a Spark job.

SparkContext Initialization: The driver program starts and creates the SparkContext/SparkSession.

DAG Creation & Scheduling: The driver transforms user code into a DAG (Directed Acyclic Graph).

Task Distribution: The DAG is split into stages and tasks, which are assigned to executors.

Execution & Computation: Executors process data in parallel and store intermediate results.

Result Collection: The final results are sent back to the driver.

3. Spark Components Interaction

RDD (Resilient Distributed Dataset): The fundamental data structure in Spark that is immutable and fault-tolerant.

DAG (Directed Acyclic Graph): A logical execution plan representing dependencies between tasks.

Transformations & Actions:

Transformations (e.g., map(), filter()) create new RDDs.

Actions (e.g., collect(), count()) trigger execution.

4. Spark Execution Modes

Local Mode: Runs on a single machine (for testing/debugging).

Cluster Mode: Distributes execution across multiple nodes.

Client Mode: The driver runs on the client machine, and executors run in the cluster.

Cluster Mode: Both driver and executors run inside the cluster.

Components of Spark

1. Spark Core 2. Spark SQL

Search This Blog

Spark

Comments

Post a Comment