Posts

Showing posts from January, 2025
Image
    SPARK A pache  S park is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the  University of California ,  Berkeley's AMPLab , the Spark codebase was later donated to the  Apache Software Foundation , which has maintained it since. Spark-Architecture Spark-Architecture:  1. Key Components of Spark Architecture 1.1. Driver Program The Driver is the main process that controls the execution of a Spark application. It creates the SparkContext (or SparkSession in newer versions) to coordinate tasks across the cluster. It converts user code into a Directed Acyclic Graph (DAG) and schedules execution. 1.2. Cluster Manager Responsible for resource allocation across the cluster. Types of cluster managers Spark can use: Standalone Cluster Manager (Built-in) Apache YARN (For Hadoop clusters) Apache Mesos (For...