Data computed by Spark comes from multiple data sources, such as local files and HDFS. Most data computed by Spark comes from the HDFS. The HDFS can read data in large scale for parallel computing. After being computed, data can be stored in the HDFS.
Spark involves Driver and Executor. Driver schedules tasks and Executor runs tasks.
Figure 1 shows the process of reading a file.
Figure 1 File reading process

The file reading process is as follows:
Figure 2 shows the process of writing data to a file.
Figure 2 File writing process

The file writing process is as follows:
The Spark computing and scheduling can be implemented using Yarn mode. Spark enjoys the computing resources provided by Yarn clusters and runs tasks in a distributed way. Spark on Yarn has two modes: Yarn-cluster and Yarn-client.
Figure 3 shows the running framework of Spark on Yarn-cluster.
Figure 3 Spark on Yarn-cluster operation framework

Spark on Yarn-cluster implementation process:
ResourceManager allocates the container to ApplicationMaster, which communicates with NodeManager, and starts the executor in the obtained container. After the executor is started, it registers with the driver and applies for tasks.
Figure 4 shows the running framework of Spark on Yarn-cluster.
Figure 4 Spark on Yarn-client operation framework

Spark on Yarn-client implementation process:
In Yarn-client mode, Driver is deployed on the client and started on the client. In Yarn-client mode, the client of the earlier version is incompatible. You are advised to use the Yarn-cluster mode.
ResourceManager allocates the containers to ApplicationMaster, which communicates with the related NodeManagers, and starts the executors in the obtained containers. After the executors are started, it registers with drivers and applies for tasks.
Running containers are not suspended and resources are not released.