You will start by learning the Microsoft Azure services required to deploy a secure, elastic, Cloudera Enterprise cluster. These core services include security, networking, virtual machine management, and storage, just to name a few.
Then, you’ll learn best practices and patterns for cloud-based clusters, including tips and caveats for security and workload management.
Next, you’ll learn how to bootstrap a cluster using Cloudera Manager, which allows you to deploy a cluster on premises or in the cloud. The module covers how to deploy both development (Path A) and production-grade (Path B) clusters.
This is a free course, so if you’re looking for a way to fill your Thanksgiving weekend, this is definitely an option.
As most of our deployments use PowerShell I wrote some cmdlets to easily work with the Databricks API in my scripts. These included managing clusters (create, start, stop, …), deploying content/notebooks, adding secrets, executing jobs/notebooks, etc. After some time I ended up having 20+ single scripts which was not really maintainable any more. So I packed them into a PowerShell module and also published it to the PowerShell Gallery (https://www.powershellgallery.com/packages/DatabricksPS) for everyone to use!
This looks like a pretty good module if you work with Databricks.
Kafka Connect is modular in nature, providing a very powerful way of handling integration requirements. Some key components include:
- Connectors – the JAR files that define how to integrate with the data store itself
- Converters – handling serialization and deserialization of data
- Transforms – optional in-flight manipulation of messages
One of the more frequent sources of mistakes and misunderstanding around Kafka Connect involves the serialization of data, which Kafka Connect handles using converters. Let’s take a good look at how these work, and illustrate some of the common issues encountered.
Read on for a good overview of the topic.
Data Serialization – Serialization plays an important role in increasing the performance of any application. Spark provides two serialization libraries –
Java Serialization: By default, spark uses Java’s ObjectOutputStream framework which can work with any class that implements java.io.serializable. This serialization is flexible but slow and creates large serialized formats for many classes.
Kryo Serialization: Spark can use Kryo library to serialize objects. It is much faster and compact but does not support all serializable types. So we must register those classes which we want to be serialized. Therefore, Kryo uses indices instead of full class names to identify data types which reduce the size of the serialized data thereby increasing performance. We can initialize our spark conf by setting the value of the property spark.serializer to org.apache.spark.serializer.KryoSerializer. This serializer has a major impact on performance when we are shuffling or caching a large amount of data. To know more about this serializer, refer Kryo documentation
There are some good tips in here.
Broadcast VariablesBroadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
There’s some good stuff on accumulators and the SparkSession object in there as well.
Magellan is a distributed execution engine for geospatial analytics on big data. It is implemented on top of Apache Spark and deeply leverages modern database techniques like efficient data layout, code generation and query optimization in order to optimize geospatial queries (further details here).
Although people mentioned in their GitHub page that the 1.0.5 Magellan library is available for Apache Spark 2.3+ clusters, I learned through a very difficult process that the only way to make it work in Azure Databricks is if you have an Apache Spark 2.2.1 cluster with Scala 2.11. The cluster I used for this experience consisted of a Standard_DS3_v2 driver type with 14GB Memory, 4 Cores and auto scaling enabled.
In terms of datasets, I used the NYC Taxicab dataset to create the geometry points and the Magellan NYC Neighbourhoods GeoJSON dataset to extract the polygons. Both datasets were stored in a blob storage and added to Azure Databricks as a mount point.
It sounds like this is much faster than using U-SQL to perform the same task.
We recently implemented a Spark streaming application, which consumes data from from multiple Kafka topics. The data consumed from Kafka comprises different types of telemetry events generated by mobile devices. We decided to host the Spark cluster using the Amazon EMR service, which manages a fleet of EC2 instances to run our data-processing pipelines.
As part of preparing the cluster and application for deployment to production, we needed to implement monitoring so we could track the streaming application and the Spark infrastructure itself. At a high level, we wanted ensure that we could monitor the different components of the application, understand performance parameters, and get alerted when things go wrong.
In this post, we’ll walk through how we aggregated relevant metrics in Datadog from our Spark streaming application running on a YARN cluster in EMR.
Check it out. If this is interesting, Priya’s blog has the full series.
Pivot was first introduced in Apache Spark 1.6 as a new DataFrame feature that allows users to rotate a table-valued expression by turning the unique values from one column into individual columns.
The upcoming Apache Spark 2.4 release extends this powerful functionality of pivoting data to our SQL users as well. In this blog, using temperatures recordings in Seattle, we’ll show how we can use this common SQL Pivot feature to achieve complex data transformations.
The syntax is quite similar to the
PIVOT syntax that SQL Server uses.
Since we are going to try algorithms like Logistic Regression, we will have to convert the categorical variables in the dataset into numeric variables. There are 2 ways we can do this.
- Category Indexing
- One-Hot Encoding
Click through for the code and explanation.
Roughly two years ago there were a spate of attacks against the open source database solution MongoDB, as well as Hadoop. These attacks were ransomware: the attacker wiped or encrypted data and then demanded money to restore that data. Just like the recent attacks, the only Hadoop clusters affected were those that were directly connected to the internet and had no security features enabled. Cloudera published a blog post about this threat in January 2017. That blog post laid out how to ensure that your Hadoop cluster is not directly connected to the internet and encouraged the reader to enable Cloudera’s security and governance features.
That blog post has renewed relevance today with the advent of XBash and DemonBot.
The origin story of XBash and DemonBot illustrates how security researchers view the Hadoop ecosystem and the lifecycle of a vulnerability. Back in 2016 at the Hack.lu conference in Luxembourg, two security researchers gave a talk entitled Hadoop Safari: Hunting for Vulnerabilities. They described Hadoop and its security model and then suggested some “attacks” against clusters that had no security features enabled. These attacks are akin to breaking in to a house while the front door is wide open.
Their advice is simple, but simple is good here: it means you should be able to implement the advice without much trouble.