Spark as a compute engine is very widely accepted by most industries. Most of the old data platforms based on MapReduce jobs have been migrated to Spark-based jobs, and some are in the phase of migration. In short, batch computation is being done using Spark. As a result, organizations’ infrastructure and expertise have been developed around Spark.
So, the now question is: can Spark solve the problem of batch consumption of data inherited from Kafka? The answer is yes.
The advantages of doing this are: having a unified batch computation platform, reusing existing infrastructure, expertise, monitoring, and alerting.
Click through to get to the starting point on this as well as a few tips to avoid stumbling blocks.
For security purposes, Databricks Apache Spark clusters are deployed in an isolated VPC dedicated to Databricks within the customer’s account. In order to run their data workloads, there is a need to have secure connectivity between the Databricks Spark Clusters and the above data sources.
It is straightforward for Databricks clusters located within the Databricks VPC to access data from AWS S3 which is not a VPC specific service. However, we need a different solution to access data from sources deployed in other VPCs such as AWS Redshift, RDS databases, streaming data from Kinesis or Kafka. This blog will walk you through some of the options you have available to access data from these sources securely and their cost considerations for deployments on AWS. In order to establish a secure connection to these data sources, we will have to configure the Databricks VPC with either one of the following two available options :
Read on for those two options.
Create a List[Column] with column names.
scala> var selectExpr : List[Column] = List("Type","Item","Price")
<console>:25: error: not found: type Column var selectExpr : List[Column] = List("Type","Item","Price") ^
If you are getting the same error Please take a look into this page .
Read on for the demo.
Service (application) exposes the JMX metrics at some port which will be captured by Jolokia java agent. Then Jolokia exposes those metrics at some port which is easily accessible through a rest endpoint (we call it Jolokia URL). Then we have JMX2Graphte which polls the metrics from Jolokia URL and push it to Graphite. Then Grafana reads the Graphite metrics and creates a beautiful dashboard for us along with the alerts.
So this is the working of the proposed monitoring solution. Now let’s discuss the components of the monitoring solution.
There’s a bit of code/configuration in here as well, so check it out.
Before going further let’s discuss on the below parameters which I have given for a Job.
Click through to see what these do and why Leela chose these settings. The Spark documentation has the full list of settings but it’s good to hear explanations from practitioners.
The Datasets API provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. You can define Dataset objects and then manipulate them using functional transformations (map, flatMap, filter, and so on) similar to an RDD. The benefits are that, unlike RDDs, these transformations are now applied on a structured and strongly typed distributed collection that allows Spark to leverage Spark SQL’s execution engine for optimization.
Read on for more details and a few examples of how to operate DataFrames and Datasets.
The Automatic Scaling feature in Amazon EMR lets customers dynamically scale clusters in and out, based on cluster usage or other job-related metrics. These features help you use resources efficiently, but they can also cause EC2 instances to shut down in the middle of a running job. This could result in the loss of computation and data, which can affect the stability of the job or result in duplicate work through recomputing.
To gracefully shut down nodes without affecting running jobs, Amazon EMR uses Apache Hadoop‘s decommissioning mechanism, which the Amazon EMR team developed and contributed back to the community. This works well for most Hadoop workloads, but not so much for Apache Spark. Spark currently faces various shortcomings while dealing with node loss. This can cause jobs to get stuck trying to recover and recompute lost tasks and data, and in some cases eventually crashing the job.
Auto-scaling doesn’t always mean scaling up.
SQLContext is your gateway to SparkSQL. Here is how you create a SQLContext using the SparkContext.
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Once you have the SQLContext you can start working with DataFrame, DataSet etc.
Knowing the right entry point is important.
Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.
One use of Spark SQL is to execute SQL queries. When running SQL from within another programming language the results will be returned as a Dataset/DataFrame.
Before exploring these APIs, let’s understand the need for these APIs.
I like the piece about RDDs being better at explaining the how than the what.
Now our example dataframe is ready.
Create a List[String] with column names.
scala> var selectExpr : List[String] = List("Type","Item","Price") selectExpr: List[String] = List(Type, Item, Price)
Now our list of column names is also created.
Lets select these columns from our dataframe.
Use .head and .tail to select the whole values mentioned in the List()
Click through for a demo.