Category: Hadoop

Your Data’s Not That Big

Published 2018-08-21 by Kevin Feasel

Larry White throws a bit of cold water on the distributed computing movement:

Someone recently told me about a data analysis application written in Python. He managed five Java engineers who built the cluster management and pipeline infrastructure needed to make the analysis run in the 12 hours allotted. They used Python, he said, because it was “easy,” which it was, if you ignore all the work needed to make it go fast. It seemed pretty clear to me that it could have been written in Java to run on a single machine with a much smaller staff.

One definition of “big data” is “Data that is too big to fit on one machine.” By that definition what is “big data” for one language is plain-old “data” for another. Java, with it’s efficient memory management, high performance, and multi-threading can get a lot done on one machine. To do data science in Java, however, you need data science tools: Tablesaw is an open-source (Apache 2) Java data science platform that lets users work with data on a single machine. It’s a dataframe and visualization framework. Most data science currently done in clusters could be done on a single machine using Tablesaw paired with a Java machine learning library like Smile.

But you don’t have to take my word for that.

There are some interesting thoughts in this post, but there are limits to what a single machine can do.

Comments closed

Faster User-Defined Functions In SparkR

Published 2018-08-20 by Kevin Feasel

Liang Zhang and Hossein Falaki note a major performance improvement for functions in SparkR using the latest version of the Databricks Runtime:

SparkR offers four APIs that run a user-defined function in R to a SparkDataFrame

dapply()

dapplyCollect()

gapply()

gapplyCollect()

dapply() allows you to run an R function on each partition of the SparkDataFrame and returns the result as a new SparkDataFrame, on which you may apply other transformations or actions. gapply() allows you to apply a function to each grouped partition consisting of a key and the corresponding rows in a SparkDataFrame. dapplyCollect() and gapplyCollect()are shortcuts if you want to call collect() on the result.

The following diagram illustrates the serialization and deserialization performed during the execution of the UDF. The data gets serialized twice and deserialized twice in total, all of which are row-wise.

By vectorizing data serialization and deserialization in Databricks Runtime 4.3, we encode and decode all the values of a column at once. This eliminates the primary bottleneck which row-wise serialization, and significantly improves SparkR’s UDF performance. Also, the benefit from the vectorization is more drastic for larger datasets.

It looks like they get some pretty serious gains from this change.

Comments closed

Last-Click Attribution With Databricks Delta

Published 2018-08-16 by Kevin Feasel

Caryl Yuhas and Denny Lee give us an example of building a last-click digital marketing attribution model with Databricks Delta:

The first thing we will need to do is to establish the impression and conversion data streams. The impression data stream provides us a real-time view of the attributes associated with those customers who were served the digital ad (impression) while the conversion stream denotes customers who have performed an action (e.g. click the ad, purchased an item, etc.) based on that ad.

With Structured Streaming in Databricks, you can quickly plug into the stream as Databricks supports direct connectivity to Kafka (Apache Kafka, Apache Kafka on AWS, Apache Kafka on HDInsight) and Kinesis as noted in the following code snippet (this is for impressions, repeat this step for conversions)

This is definitely an interesting approach to the problem. Check it out.

Comments closed

Working With Kafka At Scale

Published 2018-08-15 by Kevin Feasel

Tony Mancill has some tips for working with large-scale Kafka clusters:

Unless you have architectural needs that require you to do otherwise, use random partitioning when writing to topics. When you’re operating at scale, uneven data rates among partitions can be difficult to manage. There are three main reasons for this:

First, consumers of the “hot” (higher throughput) partitions will have to process more messages than other consumers in the consumer group, potentially leading to processing and networking bottlenecks.
Second, topic retention must be sized for the partition with the highest data rate, which can result in increased disk usage across other partitions in the topic.
Third, attaining an optimum balance in terms of partition leadership is more complex than simply spreading the leadership across all brokers. A “hot” partition might carry 10 times the weight of another partition in the same topic.

There’s some interesting advice in here.

Comments closed

Kafka Blindness

Published 2018-08-15 by Kevin Feasel

George Vetticaden and Houshang Livian look at a common problem with Apache Kafka installations:

Over the last 12 months, the product team has been talking to our largest Kafka customers who are using this technology to implement a diverse set of use cases. We posed to them the following question:

What are your key challenges with using Kafka in production? What do you need to be successful with this powerful technology?

The most common response was the need for better tools to monitor and manage Kafka in production. Specifically, users wanted better visibility in understanding what is going on in the cluster across the four key entities with Kafka: producers, topics, brokers, and consumers. In fact, because we heard this same response over and over from the users we interviewed, we gave it a name: The Kafka Blindness.

Kafka’s Omnipresence has led to Kafka blindness – the enterprise’s struggle to monitor, troubleshoot and see whats happening in their Kafka clusters.

It looks like the folks at Hortonworks are building tooling around visualizing Kafka topic status. There are a bunch of these tools out there (each one typically with its own focus and blind spots), so we’ll see how theirs stacks up.

Comments closed

Scaling Kafka With Kafka-Kit

Published 2018-08-15 by Kevin Feasel

Jamie Alquiza announces Kafka-Kit:

Kafka-Kit is a collection of tools that handle partition to broker mappings, failed broker replacements, storage based partition rebalancing, and replication auto-throttling. The two primary tools are topicmappr and autothrottle.

These tools cover two categories of our Kafka operations: data placement and replication auto-throttling.

It looks like an interesting project, and is available on GitHub.

Comments closed

Getting Started With Azure Databricks

Published 2018-08-15 by Kevin Feasel

David Peter Hansen has a quick walkthrough of Azure Databricks:

RUN MACHINE LEARNING JOBS ON A SINGLE NODE

A Databricks cluster has one driver node and one or more worker nodes. The Databricks runtime includes common used Python libraries, such as scikit-learn. However, they do not distribute their algorithms.

Running a ML job only on the driver might not be what we are looking for. It is not distributed and we could as well run it on our computer or in a Data Science Virtual Machine. However, some machine learning tasks can still take advantage of distributed computation and it a good way to take an existing single-node workflow and transition it to a distributed workflow.

This great example notebooks that uses scikit-learn shows how this is done.

Read the whole thing.

Comments closed

Running Apache Kafka On Kubernetes

Published 2018-08-10 by Kevin Feasel

Rohit Bakhshi walks us through how to install Kafka on a Kubernetes cluster:

Now available on GitHub in developer preview are open-source Helm Chart deployment templates for Confluent Platform components. These templates enable developers to quickly provision Apache Kafka, Apache ZooKeeper, Confluent Schema Registry, Confluent REST Proxy, and Kafka Connect on Kubernetes, using official Confluent Platform Docker images.

Helm is an open-source packaging tool that helps you install applications and services on Kubernetes. Helm uses a packaging format called charts. A chart is a collection of YAML templates that describe a related set of Kubernetes resources.

For stateful components like Kafka and ZooKeeper, the Helm Charts use both StatefulSets to provide an identity to each pod in the form of an ordinal index, and Persistent Volumes that are always mounted for the pod. For stateless components, like REST Proxy, the Helm Charts utilize Deployments instead to provide an identity to each pod. Each component’s charts utilize Services to provide access to each pod.

Read on for more.

Comments closed

Databricks Delta: Data Skipping And ZORDER Clustering

Published 2018-08-08 by Kevin Feasel

Adrian Ionescu explains a couple of concepts which can help make selective queries with Databricks much faster:

The general use-case for these features is to improve the performance of needle-in-the-haystack kind of queries against huge data sets. The typical RDBMS solution, namely secondary indexes, is not practical in a big data context due to scalability reasons.

If you’re familiar with big data systems (be it Apache Spark, Hive, Impala, Vertica, etc.), you might already be thinking: (horizontal) partitioning.

Quick reminder: In Spark, just like Hive, partitioning works by having one subdirectory for every distinct value of the partition column(s). Queries with filters on the partition column(s) can then benefit from partition pruning, i.e., avoid scanning any partition that doesn’t satisfy those filters.

The main question is: What columns do you partition by?
And the typical answer is: The ones you’re most likely to filter by in time-sensitive queries.
But… What if there are multiple (say 4+), equally relevant columns?

Read the whole thing.

Comments closed

Combining Apache Kafka With TensorFlow

Published 2018-08-06 by Kevin Feasel

Kai Waehner has an example of an application which uses Apache Kafka to stream car sensor data to TensorFlow on Google ML Engine:

A great benefit of Confluent MQTT Proxy is simplicity for realizing IoT scenarios without the need for a MQTT Broker. You can forward messages directly from the MQTT devices to Kafka via the MQTT Proxy. This reduces efforts and costs significantly. This is a perfect solution if you “just” want to communicate between Kafka and MQTT devices.

If you want to see the other part of the story (integration with sink applications like Elasticsearch / Grafana), please take a look at the Github project “KSQL for streaming IoT data“. This realizes the integration with ElasticSearch and Grafana via Kafka Connect and the Elastic connector.

Check it out and then take a gander at Kai’s GitHub repo.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31