Press "Enter" to skip to content

Category: Hadoop

Cloudera Data Platform in Azure Marketplace

Ram Venkatesh announces availability for Cloudera Data Platform in the Azure Marketplace:

Cloudera Data Platform (CDP) is now available on Microsoft Azure Marketplace – so joint customers can easily deploy the world’s first enterprise data cloud on Microsoft Azure.

Last week we announced the availability of Cloudera Data Platform (CDP) on Azure Marketplace. CDP is an integrated data platform that is easy to secure, manage, and deploy. With its availability on the Azure Marketplace, joint customers of Cloudera and Microsoft will be able to easily discover and provision CDP Public Cloud across all Azure regions. Additionally, by procuring CDP through the Azure Marketplace, these customers can leverage integrated billing i.e. the cost of CDP will be part of a single Azure bill making procurement simple and friction-free.

The new Cloudera’s approach has been cloud-first to the point of being cloud-only. It’s an interesting shift from the merger of two on-prem companies.

Comments closed

Ensuring Ordering of Kafka Data

Zeke Dean takes us through an explanation of Kafka partitioning and message delivery behaviors:

Apache Kafka provides developers with a uniquely powerful, open source and versatile distributed streaming platform – but it also has some rather complex nuances to understand when trying to store and retrieve data in your preferred order.

Kafka captures streaming data by publishing records to a category or feed name called a topic. Kafka consumers can then subscribe to topics to retrieve that data. For each topic, the Kafka cluster creates and updates a partitioned log. Kafka send all messages from a particular producer to the same partition, storing each message in the order it arrives. Partitions therefore function as a structure commit log, containing sequences of records that are both ordered and immutable.

Click through for the explanation.

Comments closed

Distributed Model Training with Dask and SciKit-Learn

Matthieu Lamairesse shows us how we can use Dask to perform distributed ML model training:

Dask is an open-source parallel computing framework written natively in Python (initially released 2014). It has a significant following and support largely due to its good integration with the popular Python ML ecosystem triumvirate that is NumPy, Pandas and Scikit-learn. 

Why Dask over other distributed machine learning frameworks? 

In the context of this article it’s about Dask’s tight integration with Sckit-learn’s JobLib parallel computing library that allows us to distribute Scikit-learn code with (almost) no code change, making it a very interesting framework to accelerate ML training. 

Click through for an interesting article and an example of using this on Cloudera’s ML platform.

Comments closed

CI/CD with Databricks

Sumit Mehrotra takes us through the continuous integration story around Databricks:

Development environment – Now that you have delivered a fully configured data environment to the product (or services) team in your organization, the data scientists have started working on it. They are using the data science notebook interface that they are familiar with to do exploratory analysis. The data engineers have also started working in the environment and they like working in the context of their IDEs. They would prefer a  connection between their favorite IDE and the data environment that allows them to use the familiar interface of their IDE to code and, at the same time, use the power of the data environment to run through unit tests, all in context of their IDE.

Any disciplined engineering team would take their code from the developer’s desktop to production, running through various quality gates and feedback loops. As a start, the team needs to connect their data environment to their code repository on a service like git so that the code base is properly versioned and the team can work collaboratively on the codebase.

This is more of a conceptual post than a direct how-to guide, but it does a good job of getting you on the right path.

Comments closed

Tuning EMR Performance with Dr. Elephant and Sparklens

Nivas Shankar and Mert Hocanin show us how to use a couple of products to tune Hive and Spark jobs:

Data engineers and ETL developers often spend a significant amount of time running and tuning Apache Spark jobs with different parameters to evaluate performance, which can be challenging and time-consuming. Dr. Elephant and Sparklens help you tune your Spark and Hive applications by monitoring your workloads and providing suggested changes to optimize performance parameters, like required Executor nodes, Core nodes, Driver Memory and Hive (Tez or MapReduce) jobs on Mapper, Reducer, Memory, Data Skew configurations. Dr. Elephant gathers job metrics, runs analysis on them, and presents optimization recommendations in a simple way for easy consumption and corrective actions. Similarly, Sparklens makes it easy to understand the scalability limits of Spark applications and compute resources, and runs efficiently with well-defined methods instead of leaning by trial and error, which saves both developer and compute time.

This post demonstrates how to install Dr. Elephant and Sparklens on an Amazon EMR cluster and run workloads to demonstrate these tools’ capabilities. Amazon EMR is a managed Hadoop service offered by AWS to easily and cost-effectively run Hadoop and other open-source frameworks on AWS.

Even if you aren’t using ElasticMapReduce, Dr. Elephant and Sparklens are quite useful products.

Comments closed

Developing Shiny Apps in Databricks

Yifan Cao, Hossein Falaki, and Cyirelle Simeone announce something cool:

We are excited to announce that you can now develop and test Shiny applications in Databricks! Inside the RStudio Server hosted on Databricks clusters, you can now import the Shiny package and interactively develop Shiny applications. Once completed, you can publish the Shiny application to an external hosting service, while continuing to leverage Databricks to access data securely and at scale.

That’s really cool. Databricks dashboards are nice for simple stuff, but when you really need visualization power, having Shiny available is great.

Comments closed

Monitoring Data Quality on Streaming Data

Abraham Pabbathi and Greg Wood want to check data quality on Spark Streaming data:

While the emergence of streaming in the mainstream is a net positive, there is some baggage that comes along with this architecture. In particular, there has historically been a tradeoff: high-quality data, or high-velocity data? In reality, this is not a valid question; quality must be coupled to velocity for all practical means — to achieve high velocity, we need high quality data. After all, low quality at high velocity will require reprocessing, often in batch; low velocity at high quality, on the other hand, fails to meet the needs of many modern problems. As more companies adopt streaming as a lynchpin for their processing architectures, both velocity and quality must improve.

In this blog post, we’ll dive into one data management architecture that can be used to combat corrupt or bad data in streams by proactively monitoring and analyzing data as it arrives without causing bottlenecks.

This was one of the sticking points of the lambda architecture: new data could still be incomplete and possibly wrong, but until reached the batch layer, you wouldn’t know that.

Comments closed

Sqoop Scheduling and Security

Jon Moirsi continues a series on Sqoop:

In previous articles, I’ve walk through using Sqoop to import data to HDFS.  I’ve also detailed how to perform full and incremental imports to Hive external and Hive managed tables.

In this article I’m going to show you how to automate execution of Sqoop jobs via Cron.

However, before we get to scheduling we need to address security.  In prior examples I’ve used -P to prompt the user for login credentials interactively.  With a scheduled job, this isn’t going to work.  Fortunately Sqoop provides us with the “password-alias” arg which allows us to pass in passwords stored in a protected keystore.

That particular keystore tie-in works quite smoothly in my experience.

Comments closed

Connecting Kafka to Elasticsearch

Danny Kay and Liz Bennett build an example of writing Kafka topic data to Elasticsearch:

The Elasticsearch sink connector helps you integrate Apache Kafka® and Elasticsearch with minimum effort. You can take data you’ve stored in Kafka and stream it into Elasticsearch to then be used for log analysis or full-text search. Alternatively, you can perform real-time analytics on this data or use it with other applications like Kibana.

For some background on what Elasticsearch is, you can read this blog post by Sarwar Bhuiyan. You can also learn more about Kafka Connect in this blog post by Tiffany Chang and in this presentation from Robin Moffatt.

This is a demo-heavy walkthrough, so check it out.

Comments closed

Implicit Type Conversions with Spark SQL

Manoj Pandey walks us through an unexpected error with Spark SQL:

While working on some data analysis I saw one Spark SQL query was not getting me expected results. The table had some good amount of data, I was filtering on a value but some records were missing. So, I checked online and found that Spark SQL works differently compared to SQL Server, in this case while comparing 2 different datatypes columns or variables.

Read on to learn more about the issue. This is the downside of Feasel’s Law: just because both system interfaces are SQL doesn’t mean that they’re equivalent or that the assertions and assumptions you can make for one follow through to the next.

Comments closed