In the process of building a data product one would end-up applying many resource-intensive analytical operations on a medium to large data-set in an efficient way. Apache Spark is the bet in this scenario to perform faster job execution by caching data in memory and enabling parallelism in a distributed data environments.
Components involved in Spark implementation:
1. Initialize spark session using scala program
2. Ingest data from data lake through hive queries
3. Apply business logic using scala constructs or hive queries
4. Load data into HDFS or Hive targets
5. Execute spark programs through spark submit
Read on for a sample flow.