Press "Enter" to skip to content

Category: Hadoop

Distributed Model Training with Cloudera ML

Zuling Kang and Anand Patil show us how to train models across several nodes using Cloudera Machine Learning:

Deep learning models are generally trained using the stochastic gradient descendent (SGD) algorithm. For each iteration of SGD, we will sample a mini-batch from the training set, feed it into the training model, calculate the gradient of the loss function of the observed values and the real values, and update the model parameters (or weights). As it is well known that the SGD iterations have to be executed sequentially, it is not possible to speed up the training process by parallelizing iterations. However, as processing one single iteration for a number of commonly used models like CIFAR10 or IMAGENET takes a long time, even using the most sophisticated GPU, we can still try to parallelize the feedforward computation as well as the gradient calculation within each iteration to speed up the model training process.

In practice, we will split the mini-batch of the training data into several parts, like 4, 8, 16, etc. (in this article, we will use the term sub-batch to refer to these split parts), and each training worker takes one sub-batch. Then the training workers do feedforward, gradient computation, and model updating using the sub-batches, respectively, just as in the monolithic training mode. After these steps, a process called model average is invoked, averaging the model parameters of all the workers participating in the training, so as to make the model parameters exactly the same when a new training iteration begins. Then the new round of the training iteration starts again from the data sampling and splitting step.

Read on for the high-level explanation, followed by some Python code working in TensorFlow.

Comments closed

Azure Synapse Analytics in Preview

Simon Whiteley clarifies a Build announcement:

Today’s the day! There’s much buzz & excitement as we FINALLY get to see Azure Synapse Analytics in public preview, ready for us all to get our hands on it. There’s a raft of other announcements that come hand & hand with it too.

What’s that? You thought Azure Synapse Analytics was already available? You’ve been using all year and don’t see what the fuss is about??

I’m expecting this to be the common reaction. The marketing story for Synapse has been… interesting… to say the least. I’ve been asked several times in the last week exactly what the new story is and, given today’s news, I thought I’d clarify.

The big picture is the version of Azure Synapse Analytics I’ve been interested in for a bit, so it’s nice to see the movement here.

Comments closed

The Roadmap for Zookeeper-less Kafka

Colin McCabe explains the mechanics behind KIP-500:

So what is the problem with ZooKeeper? Actually, the problem is not with ZooKeeper itself but with the concept of external metadata management.

Having two systems leads to a lot of duplication. Kafka, after all, is a replicated distributed log with a pub/sub API on top. ZooKeeper is a replicated distributed log with a filesystem API on top. Each has its own way of doing network communication, security, monitoring, and configuration. Having two systems roughly doubles the total complexity of the result for the operator. This leads to an unnecessarily steep learning curve and increases the risk of some misconfiguration causing a security breach.

Storing metadata externally is not very efficient. We run at least three additional Java processes, and sometimes more. In fact, we often see Kafka clusters with just as many ZooKeeper nodes as Kafka nodes! Additionally, the data in ZooKeeper also needs to be reflected on the Kafka controller, which leads to double caching.

Read on to see how they’re looking to cut out Zookeeper dependencies. It’s an interesting story of post hoc dependency removal.

Comments closed

Writing a Custom Serializer Class for Kafka

Ramandeep Kaur shows how to create custom classes to serialize and deserialize data in Apache Kafka:

Need?

Basically, in order to prepare the message for transmission from the producer to the broker, we use serializers. In other words, before transmitting the entire message to the broker, let the producer know how to convert the message into a byte array we use serializers. Similarly, to convert the byte array back to the object we use the deserializers by the consumer.

Click through for an example.

Comments closed

Spark UDFs and Error Handling

Bipin Patwardhan takes us through an error-handling scenario when writing a Spark User-Defined Function:

A couple of weeks ago, at my work place, I wrote a metadata-driven data validation framework for Spark. After the initial euphoria of having created the framework in Scala/Spark and Python/Spark, I started reviewing the framework. During the review, I noted that the User Defined Functions (UDF) I had written were prone to throw an error in certain situations.

I then explored various options to make the UDFs fail-safe.

These are like any other code: you want it to be as robust to failure as you can get it (or at least robust enough at the margin).

Comments closed

Bot-Building with ksqlDB

Robin Moffatt has an interesting project for us:

But what if you didn’t need any datastore other than Kafka itself? What if you could ingest, filter, enrich, aggregate, and query data with just Kafka? With ksqlDB we can do just this, and I want to show you exactly how.

We’re going to build a simple system that captures Wi-Fi packets, processes them, and serves up on-demand information about the devices connecting to Wi-Fi. The “secret sauce” here is ksqlDB’s ability to build stateful aggregates that can be directly accessed using pull queries. This is going to power a very simple bot for the messaging platform Telegram, which takes a unique device name as input and returns statistics about its Wi-Fi probe activities to the user:

Click through for the tutorial.

Comments closed

Using the SSIS Hadoop Components

Hadi Fadlallah walks us through the HDFS file source and destination components:

To test these components, we will create an SSIS package and add three connection managers:

1. Hadoop Connection Manager: to connect with the Hadoop cluster (check the previous article)
2. OLE DB Connection Manager: to connect to SQL Server instance where AdventureWorks2017 database is stored
3. Flat File Connection Manager: We will use it to export data from HDFS Source:

I wonder if they ever fixed the 4K screen resolution problem (kind of tells you how often I use SSIS anymore…). That was one of the things which made these components unusable for me on any modern screen.

Comments closed

Don’t Install Hadoop on Windows

Hadi speaks truth:

A few days ago, I published the installation guides for Hadoop, Hive, and Pig on Windows 10. And yesterday, I finished installing and configuring the ecosystem. The only consequence I have is that “Think 1000 times before installing Hadoop and related technologies on Windows!”.

The biggest problem is that Microsoft got flaky about this. Back in 2012-2013, they backed running Hadoop on Windows as part of getting HDInsight up and running. I even remember the HDInsight emulator which could run on a local desktop. By 2014 or so, they shifted directions and decided it wasn’t worth the effort. Because Apache Spark (which does have pretty decent Windows support, at least for development) really wants Hive, you can fake it with winutils.

Comments closed