Category: Hadoop

In Azure Key Vault we will be adding secrets that we will be calling through Azure Databricks within the notebooks. First and foremost, this is for security purposes. It will ensure usernames and passwords are not hardcoded within the notebook cells and offer some type of control over access in case it needs to be reverted later on (assuming it is controlled by a different administrator). In addition to this, it will offer a better way of maintaining a solution, since if a password ever needs to be changed, it will only be changed in the Azure Key Vault without the need to go through any notebooks or logic.

If you don’t use Key Vault, Databricks does include its own secrets storage, so there’s really no reason to keep them in plaintext.

1 Comment

Authentication in Hadoop with Apache Ozone

Published 2020-03-25 by Kevin Feasel

Xiaoyu Yao explains how we can use Apache Ozone to perform service account authentication for a Hadoop cluster:

Like Hadoop delegation tokens, Ozone security token has a token identifier along with a signed signature from the issuer. Ozone manager issues delegation token and block tokens for users or client applications authenticated with Kerberos. The signature of the token can be validated by token validators to verify the identity of the issuer. This way, a valid token holder can use the token to perform operations against the cluster services as if they have Kerberos tickets of the issuer.

Read on for the high-level overview.

Comments closed

Cloudera Data Platform in Azure Marketplace

Published 2020-03-23 by Kevin Feasel

Ram Venkatesh announces availability for Cloudera Data Platform in the Azure Marketplace:

Cloudera Data Platform (CDP) is now available on Microsoft Azure Marketplace – so joint customers can easily deploy the world’s first enterprise data cloud on Microsoft Azure.
Last week we announced the availability of Cloudera Data Platform (CDP) on Azure Marketplace. CDP is an integrated data platform that is easy to secure, manage, and deploy. With its availability on the Azure Marketplace, joint customers of Cloudera and Microsoft will be able to easily discover and provision CDP Public Cloud across all Azure regions. Additionally, by procuring CDP through the Azure Marketplace, these customers can leverage integrated billing i.e. the cost of CDP will be part of a single Azure bill making procurement simple and friction-free.

The new Cloudera’s approach has been cloud-first to the point of being cloud-only. It’s an interesting shift from the merger of two on-prem companies.

Comments closed

Ensuring Ordering of Kafka Data

Published 2020-03-19 by Kevin Feasel

Zeke Dean takes us through an explanation of Kafka partitioning and message delivery behaviors:

Apache Kafka provides developers with a uniquely powerful, open source and versatile distributed streaming platform – but it also has some rather complex nuances to understand when trying to store and retrieve data in your preferred order.
Kafka captures streaming data by publishing records to a category or feed name called a topic. Kafka consumers can then subscribe to topics to retrieve that data. For each topic, the Kafka cluster creates and updates a partitioned log. Kafka send all messages from a particular producer to the same partition, storing each message in the order it arrives. Partitions therefore function as a structure commit log, containing sequences of records that are both ordered and immutable.

Click through for the explanation.

Comments closed

Distributed Model Training with Dask and SciKit-Learn

Published 2020-03-18 by Kevin Feasel

Matthieu Lamairesse shows us how we can use Dask to perform distributed ML model training:

Dask is an open-source parallel computing framework written natively in Python (initially released 2014). It has a significant following and support largely due to its good integration with the popular Python ML ecosystem triumvirate that is NumPy, Pandas and Scikit-learn.
Why Dask over other distributed machine learning frameworks?
In the context of this article it’s about Dask’s tight integration with Sckit-learn’s JobLib parallel computing library that allows us to distribute Scikit-learn code with (almost) no code change, making it a very interesting framework to accelerate ML training.

Click through for an interesting article and an example of using this on Cloudera’s ML platform.

Comments closed

CI/CD with Databricks

Published 2020-03-18 by Kevin Feasel

Sumit Mehrotra takes us through the continuous integration story around Databricks:

Development environment – Now that you have delivered a fully configured data environment to the product (or services) team in your organization, the data scientists have started working on it. They are using the data science notebook interface that they are familiar with to do exploratory analysis. The data engineers have also started working in the environment and they like working in the context of their IDEs. They would prefer a connection between their favorite IDE and the data environment that allows them to use the familiar interface of their IDE to code and, at the same time, use the power of the data environment to run through unit tests, all in context of their IDE.
Any disciplined engineering team would take their code from the developer’s desktop to production, running through various quality gates and feedback loops. As a start, the team needs to connect their data environment to their code repository on a service like git so that the code base is properly versioned and the team can work collaboratively on the codebase.

This is more of a conceptual post than a direct how-to guide, but it does a good job of getting you on the right path.

Comments closed

Tuning EMR Performance with Dr. Elephant and Sparklens

Published 2020-03-16 by Kevin Feasel

Nivas Shankar and Mert Hocanin show us how to use a couple of products to tune Hive and Spark jobs:

Data engineers and ETL developers often spend a significant amount of time running and tuning Apache Spark jobs with different parameters to evaluate performance, which can be challenging and time-consuming. Dr. Elephant and Sparklens help you tune your Spark and Hive applications by monitoring your workloads and providing suggested changes to optimize performance parameters, like required Executor nodes, Core nodes, Driver Memory and Hive (Tez or MapReduce) jobs on Mapper, Reducer, Memory, Data Skew configurations. Dr. Elephant gathers job metrics, runs analysis on them, and presents optimization recommendations in a simple way for easy consumption and corrective actions. Similarly, Sparklens makes it easy to understand the scalability limits of Spark applications and compute resources, and runs efficiently with well-defined methods instead of leaning by trial and error, which saves both developer and compute time.
This post demonstrates how to install Dr. Elephant and Sparklens on an Amazon EMR cluster and run workloads to demonstrate these tools’ capabilities. Amazon EMR is a managed Hadoop service offered by AWS to easily and cost-effectively run Hadoop and other open-source frameworks on AWS.

Even if you aren’t using ElasticMapReduce, Dr. Elephant and Sparklens are quite useful products.

Comments closed

Developing Shiny Apps in Databricks

Published 2020-03-11 by Kevin Feasel

Yifan Cao, Hossein Falaki, and Cyirelle Simeone announce something cool:

We are excited to announce that you can now develop and test Shiny applications in Databricks! Inside the RStudio Server hosted on Databricks clusters, you can now import the Shiny package and interactively develop Shiny applications. Once completed, you can publish the Shiny application to an external hosting service, while continuing to leverage Databricks to access data securely and at scale.

That’s really cool. Databricks dashboards are nice for simple stuff, but when you really need visualization power, having Shiny available is great.

Comments closed

Monitoring Data Quality on Streaming Data

Published 2020-03-10 by Kevin Feasel

Abraham Pabbathi and Greg Wood want to check data quality on Spark Streaming data:

While the emergence of streaming in the mainstream is a net positive, there is some baggage that comes along with this architecture. In particular, there has historically been a tradeoff: high-quality data, or high-velocity data? In reality, this is not a valid question; quality must be coupled to velocity for all practical means — to achieve high velocity, we need high quality data. After all, low quality at high velocity will require reprocessing, often in batch; low velocity at high quality, on the other hand, fails to meet the needs of many modern problems. As more companies adopt streaming as a lynchpin for their processing architectures, both velocity and quality must improve.
In this blog post, we’ll dive into one data management architecture that can be used to combat corrupt or bad data in streams by proactively monitoring and analyzing data as it arrives without causing bottlenecks.

This was one of the sticking points of the lambda architecture: new data could still be incomplete and possibly wrong, but until reached the batch layer, you wouldn’t know that.

Comments closed

Sqoop Scheduling and Security

Published 2020-03-10 by Kevin Feasel

Jon Moirsi continues a series on Sqoop:

In previous articles, I’ve walk through using Sqoop to import data to HDFS. I’ve also detailed how to perform full and incremental imports to Hive external and Hive managed tables.
In this article I’m going to show you how to automate execution of Sqoop jobs via Cron.
However, before we get to scheduling we need to address security. In prior examples I’ve used -P to prompt the user for login credentials interactively. With a scheduled job, this isn’t going to work. Fortunately Sqoop provides us with the “password-alias” arg which allows us to pass in passwords stored in a protected keystore.

That particular keystore tie-in works quite smoothly in my experience.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31