Spark Application Master
To access Spark UI for the running application and get more detailed information on its execution use the Application Master link and navigate through different tabs containing more information on jobs, stages, executors and so on.
These methods also apply for on-prem Spark clusters, although the resource locations might be a little different.
In Spark 2.1, we drastically improve the initial latency of queries that touch a small fraction of table partitions. In some cases, queries that took tens of minutes on a fresh Spark cluster now execute in seconds. Our improvements cut down on table memory overheads, and make the SQL experience starting cold comparable to that on a “hot” cluster with table metadata fully cached in memory.
This looks like a nice improvement in Spark.
In a similar spirit to how
sparklyrallowed us to reuse our functions from the
dplyrpackage to manipulate Spark DataFrames, the
RxSparkAPI allows a data scientist to develop code that can be deployed in a multitude of environments. This allows the developer to shift their focus from writing code that’s specific to a certain environment, and instead focus on the complex analysis of their data science problem. We call this flexibility Write Once, Deploy Anywhere, or WODA for the acronym lovers.
For a deeper dive into the
RevoScaleRpackage, I recommend you take a look at the online course, Analyzing Big Data with Microsoft R Server. Much of this blogpost follows along the last section of the course, on deployment to Spark.
R isn’t just for small, one-off jobs anymore.
By leveraging Spark for distribution, we can achieve the same results much more quickly and with the same amount of code. By keeping data in HDFS throughout the process, we were able to ingest the same data as before in about 36 seconds. Let’s take a look at Spark code which produced equivalent results as the bash script shown above — note that a more parameterized version of this code code and of all code referenced in this article can be found down below in the Resources section.
Read the whole thing.
There are several settings you can adjust. Basically, there are two main files in the ZEPPELIN_DIR\conf :
In the first one you can configure some interpreter settings. In the second more aspects related to the Website, like for instance, the Zeppelin server port (I am using the 8080 but most probably yours is already used by another application)
This is a very clear walkthrough. Jupyter is still easier to install, but Paul’s blog post lowers that Zeppelin installation learning curve.
The roadmap for implementation was pretty straightforward:
Collect the raw data set of the lyrics (~65k sentences in total):
- Black Sabbath, In Flames, Iron Maiden, Metallica, Moonspell, Nightwish, Sentenced, etc.
- Abba, Ace of Base, Backstreet Boys, Britney Spears, Christina Aguilera, Madonna, etc.
Create training set, i.e. label (0 for metal | 1 for pop) + features (represented as double vectors)
Train logistic regression that is the obvious selection for the classification
This is a supervised learning problem, and is pretty fun to walk through.
The idea behind Spot instances is to allow you to bid on spare Amazon EC2 compute capacity. You choose the max price you’re willing to pay per EC2 instance hour. If your bid meets or exceeds the Spot market price, you win the Spot instances. However, unlike traditional bidding, when your Spot instances start running, you pay the live Spot market price (not your bid amount). Spot prices fluctuate based on the supply and demand of available EC2 compute capacity and are specific to different regions and availability zones.
So, although you may have bid 0.55 cents per hour for a r3.2xlarge instance, you’ll end up paying only 0.10 cents an hour if that’s what the going rate is for the region and availability zone.
Databricks uses spot pricing for Community Edition clusters to control costs. Click through for a very interesting discussion of spot pricing and how they take advantage of it.
Praveen Sripati has a two-part series on getting aggregates by year in various Spark languages. In part one, he looks at Python:
Hadoop – The Definitive Guide revolves around the example of finding the maximum temperature for a particular year from the weather data set. The code for the same is here and the data here. Below is the Spark code implemented in Python for the same.
In the previous blog, we looked at how find out the maximum temperature of each year from the weather dataset. Below is the code for the same using Spark SQL which is a layer on top of Spark. SQL on Spark was supported using Shark which is being replaced by Spark SQL.Here is a nice blog from DataBricks on the future of SQL on Spark.
There’s no Scala example here, but it’s pretty straightforward as well.
Before you begin, you must have the following:
An Azure Subscription: See Get Azure free trial.
A HDInsight cluster: See Get Started with HDInsight on Linux.
Visual Studio 2015: See Get Visual Studio 2015.
Check it out. Using Spark on .NET is pretty easy.
Spark’s distributed data-sharing concept is called “Resilient Distributed Datasets,” or RDD. RDDs are fault-tolerant collections of objects partitioned across a cluster that can be queried in parallel and used in a variety of workload types. RDDs are created by applying operations called “transformations” with map, filter, and groupBy clauses. They can persist in memory for rapid reuse. If an RDD data does not fit in memory, Spark will overflow it to disk.
If you’re not familiar with Spark, now’s as good a time as any to learn.