Press "Enter" to skip to content

Category: Hadoop

Running Confluent Platform with .NET

Niels Berglund shows how you can install Confluent Platform as a Docker container and use the .NET client against it:

What we see in Figure 16 are the various project related files, including the source file Program.cs. What is missing now is a Kafka client. For .NET there exists a couple of clients, and theoretically, you can use any one of them. However, in practice, there is only one, and that is the Confluent Kafka DotNet client. The reason I say this is because it has the best parity with the original Java client. The client has NuGet packages, and you install it via VS Code’s integrated terminal: dotnet add package Confluent.Kafka --version 1.0.1.1:

Definitely use the Confluent client. The others were from a time when there was no official driver; most aren’t even maintained anymore.

Comments closed

When Not to Use Spark

Ramandeep Kaur gives us several cases when it makes sense not to use Apache Spark:

There can be use cases where Spark would be the inevitable choice. Spark considered being an excellent tool for use cases like ETL of a large amount of a dataset, analyzing a large set of data files, Machine learning, and data science to a large dataset, connecting BI/Visualization tools, etc.
But its no panacea, right?

Let’s consider the cases where using Spark would be no less than a nightmare.

No tool is perfect at everything. Click through for a few use cases where the Spark experience degrades quickly.

Comments closed

Hyperparameter Tuning with MLflow

Joseph Bradley shows how you can perform hyperparameter tuning of an MLlib model with MLflow:

Apache Spark MLlib users often tune hyperparameters using MLlib’s built-in tools CrossValidator and TrainValidationSplit.  These use grid search to try out a user-specified set of hyperparameter values; see the Spark docs on tuning for more info.

Databricks Runtime 5.3 and 5.3 ML and above support automatic MLflow tracking for MLlib tuning in Python.

With this feature, PySpark CrossValidator and TrainValidationSplit will automatically log to MLflow, organizing runs in a hierarchy and logging hyperparameters and the evaluation metric.  For example, calling CrossValidator.fit() will log one parent run.  Under this run, CrossValidator will log one child run for each hyperparameter setting, and each of those child runs will include the hyperparameter setting and the evaluation metric.  Comparing these runs in the MLflow UI helps with visualizing the effect of tuning each hyperparameter.

Hyperparameter tuning is critical for some of the more complex algorithms like random forests, gradient boosting, and neural networks.

Comments closed

TensorFrames: Spark Plus TensorFlow

Adi Polak gives us an introduction to TensorFrames:

In all TensorFrames functionality, the DataFrame is sent together with the computations graph. The DataFrame represents the distributed data, meaning in every machine there is a chunk of the data that will go through the graph operations/ transformations. This will happen in every machine with the relevant data. Tungsten binary format is the actual binary in-memory data that goes through the transformation, first to Apache Spark Java object and from there it is sent to TensorFlow Jave API for graph calculations. This all happens in the Spark Worker process, the Spark worker process can spin many tasks which mean various calculation at the same time over the in-memory data.

An interesting bit of turnabout here is that the Scala API is the underdeveloped one; normally for Spark, the Python API is the Johnny-Come-Lately version.

Comments closed

Explaining Hadoop Struggles

Alex Woodie has an editorial taking us through why the major Hadoop distributors (Cloudera and MapR) are struggling so much lately:

The future is not clear for either Cloudera and MapR. While there are similarities in the two companies’ positions, there are big differences too.

Cloudera does not have a permanent CEO at the moment, and it still hasn’t shipped the new converged Hadoop distribution, dubbed Cloudera Data Platform (CDP), that will replace the old Cloudera and Hortonworks distributions. During its first quarter ended April 30, Cloudera said customers are holding off investing in the old Hadoop products since they know the new CDP is due by the end of the year. That fact led Cloudera to dramatically lower its revenue expectations for the year, which upset stockholders, who pushed Cloudera’s stock (NYSE: CLDR) down 40% the following day.

The way I’m phrasing it is that the Hadoop ecosystem is strong (with the successes of companies like Databricks and Confluent), but core Hadoop companies are struggling.

Comments closed

MLflow 1.0 Released

Clemens Mewald and Matei Zaharia announce the release of MLflow 1.0:

Today we are excited to announce the release of MLflow 1.0. Since its launch one year ago, MLflow has been deployed at thousands of organizations to manage their production machine learning workloads, and has become generally available on services like Managed MLflow on Databricks. The MLflow community has grown to over 100 contributors, and the MLflow PyPI package download rate has reached close to 600K times a month. The 1.0 release not only marks the maturity and stability of the APIs, but also adds a number of frequently requested features and improvements.

The release is publicly available starting today. Install MLflow 1.0 using PyPl, read our documentation to get started, and provide feedback on GitHub. Below we describe just a few of the new features in MLflow 1.0. Please refer to the release notes for a full list.

And it looks like they’re going to keep pushing on it from there.

Comments closed

Kafka In Front of ELK

Daniel Berman sets up a simple Elasticsearch-Logstash-Kibana (ELK) stack and throws Kafka in front of it:

To perform the steps below, I set up a single Ubuntu 16.04 machine on AWS EC2 using local storage. In real-life scenarios you will probably have all these components running on separate machines.

I started the instance in the public subnet of a VPC and then set up a security group to enable access from anywhere using SSH and TCP 5601 (for Kibana). Finally, I added a new elastic IP address and associated it with the running instance.

The example logs used for the tutorial are Apache access logs.

This is a great walkthrough on setup and basic configuration. If you don’t have something in place to manage logs, the ELK stack is fine.

Comments closed

HBase Updates in CDH 6.2

Krishna Maheshwari announces updates to the Cloudera Distribution of Hadoop:

Starting with CDH 6.2, Cloudera now includes the ability to use Intel’s newly released Optane Memory as an alternate destination for the 2nd tier of the bucket cache.  This deployment configuration enables you to have ~3x the size of the cache for constant cost (as compared to off-heap cache on DRAM). It does incur some additional latency compared to the traditional off-heap configuration, but our testing indicates that by allowing more (if not all) of the data’s working set to fit in the cache the set up results in a net performance improvement when the data is ultimately stored on HDFS (using HDDs).

When deploying to the cloud or using on-prem object storage, the performance improvement will be even better as object storage tends to be very expensive for random reads of small amounts of data.

There aren’t too many changes to HBase in the blog post, but the two mentioned are pretty good ones.

Comments closed

Connecting PolyBase to Spark

I have a blog post connecting PolyBase to a Spark cluster:

If you do define your Spark DataFrames well, you get a much happier result. Here’s me creating a better-looking DataFrame in Spark:

import org.apache.spark.sql.functions._
spark.sql("""
SELECT
INT(SUMLEV) AS SummaryLevel,
INT(COUNTY) AS CountyID,
INT(PLACE) AS PlaceID,
BOOLEAN(PRIMGEO_FLAG) AS IsPrimaryGeography,
NAME AS Name,
POPTYPE AS PopulationType,
INT(YEAR) AS Year,
INT(POPULATION) AS Population
FROM NorthCarolinaPopulation
WHERE
POPULATION <> 'A'
""")
.write.format("orc").saveAsTable("NorthCarolinaPopulationTyped")

It’s not all perfect, though: I also cover driver problems that I ran into here with Spark and Hive.

Comments closed

An Introduction to Azure Databricks

Brad Llewellyn has an introduction to Azure Databricks:

So, what is Azure Databricks?  To answer this question, let’s start all the way at the bottom of the hole and climb up.  So, what is Hadoop?  Apache Hadoop is an open-source, distributed storage and computing ecosystem designed to handle incredibly large volumes of data and complex transformations.  It is becoming more common as organizations are starting to integrate massive data sources, such as social media, financial transactions and the Internet of Things.  However, Hadoop solutions are extremely complex to manage and develop.  So, many people have worked together to create platforms that layer on top of Hadoop to provide a simpler way to solve certain types of problems.  Apache Spark is one of these platforms.  You can read more about Apache Hadoop here and here.

It’s Hadoop turtles all the way down.

Comments closed