Category: Hadoop

Databricks MLflow

Published 2018-06-06 by Kevin Feasel

Matai Zaharia announces a new Databricks offering:

MLflow is inspired by existing ML platforms, but it is designed to be open in two senses:

Open interface: MLflow is designed to work with any ML library, algorithm, deployment tool or language. It’s built around REST APIs and simple data formats (e.g., a model can be viewed as a lambda function) that can be used from a variety of tools, instead of only providing a small set of built-in functionality. This also makes it easy to add MLflow to your existing ML code so you can benefit from it immediately, and to share code using any ML library that others in your organization can run.

Open source: We’re releasing MLflow as an open source project that users and library developers can extend. In addition, MLflow’s open format makes it very easy to share workflow steps and models across organizations if you wish to open source your code.

Mlflow is still currently in alpha, but we believe that it already offers a useful framework to work with ML code, and we would love to hear your feedback. In this post, we’ll introduce MLflow in detail and explain its components.

Even in alpha, it looks nice.

Comments closed

Using Kafka To Go From Batch To Stream

Published 2018-06-05 by Kevin Feasel

Stephane Maarek has started a series on transforming a batch process into a streaming process using Apache Kafka. Part one introduces the topic and two of the four microservices:

Before jumping straight in, it’s very important to map out the current process and see how we can improve each component. Below are my personal assumptions:

When a user writes a review, it gets POSTed to a Web Service (REST Endpoint), which will store that review into some kind of database table.
Every 24 hours, a batch job (could be Spark) would take all the new reviews and apply a spam filter to filter fraudulent reviews from legitimate ones.
New valid reviews are published to another database table (which contains all the historic valid reviews).
Another batch job or a SQL query computes new stats for courses. Stats include all-time average rating, all-time count of reviews, 90 days average rating, and 90 days count of reviews.
The website displays these metrics through a REST API when the user navigates a website.

Part two finishes up the story:

In the previous section, we learned about the early concepts of Kafka Streams, to take a stream and split it in two based on a spam evaluation function. Now, we need to perform some stateful computations such as aggregations, windowing in order to compute statistics on our stream of reviews.

Thankfully we can use some pre-defined operators in the High-Level DSL that will transform a KStream into a KTable. A KTable is basically a table that gets new events every time a new element arrives in the upstream KStream. The KTable then has some level of logic to update itself. Any KTable updates can then be forwarded downstream. For a quick overview of KStream and KTable, I recommend the quickstart on the Kafka website.

This is a nice introduction to Kafka Streams using a realistic example.

Comments closed

Analytics Platform System V7 Released

Published 2018-06-05 by Kevin Feasel

Microsoft has released a new version of their Analytics Platform System:

Microsoft is pleased to announce that the Analytics Platform System (APS) appliance update 7 (AU7) is now generally available. APS is Microsoft’s scale-out Massively Parallel Processing (MPP) system based on SQL Server for data warehouse specific workloads on-premises.

Customers will get significantly improved query performance and enhanced security features with this release. APS AU7 builds on appliance update 6 (APS 2016) release as a foundation. Upgrading to APS appliance update 6 is a prerequisite to upgrade to appliance update 7.

This is useful for the six customers which can afford the licensing for APS.

Comments closed

Hortonworks Released HDP 2.6.5

Published 2018-05-31 by Kevin Feasel

Mitra Mohsenian and Roni Fontaine announce Hortonworks Data Platform 2.6.5:

We are excited to make several product announcements including the general availability of :

HDP 2.6.5

Apache Kafka 1.0

Apache Spark 2.3

Apache Ambari 2.6.2

SmartSense 1.4.5

HDP 2.6.5 is an important release for Hortonworks given it is the first release that enables Apache Kafka 1.0 and Apache Spark 2.3

It looks like Ubuntu 18.04 isn’t supported just yet, but I imagine that’s coming.

Comments closed

Using Burrow To Monitor Kafka

Published 2018-05-31 by Kevin Feasel

Gaurav Garg shows us how to install and configure Burrow, a tool for monitoring Apache Kafka clusters:

According to Burrow’s GitHub page: Burrow is a Kafka monitoring tool that keeps track of consumer lag. It does not provide any user interface to monitor. It provides several HTTP request endpoints to get information about Kafka clusters and consumer groups. Burrow also has a notifier system that can notify you (via email or at an HTTP endpoint) if a consumer group has met certain criteria.

Burrow is designed in a modular way that separates the work done into multiple subsystems. Below are the subsystems in Burrow.

Clusters: This component periodically updates the topic list and the last committed offset for each partition.

Consumers: This component fetches the information about consumer groups like consumer lag, etc.

Storage: This component stores all the information in a system.

Evaluator: Gets information from storage and checks the status of consumer groups, like if it’s consuming messages at a slow rate using consumer lag evaluation rules.

Notifier: Requests the status of a consumer group and sends a notification if certain criteria are met via email, etc.

HTTP server: Provides HTTP endpoints to fetch information about a cluster and consumers.

This looks like a good tool to hook into an existing monitoring solution.

Comments closed

A SQL Client For Apache Flink

Published 2018-05-30 by Kevin Feasel

Alex Woodie points out that Apache Flink now has a SQL client built in:

Apache Flink has contained SQL functionality since Flink version 1.1, which introduced a SQL API based on Apache Calcite and a table API, too. While the combined SQL and Table API today provides valuable ways for developers to apply well-understood relational data and SQL constructs to the world of stream data processing, its usefulness is somewhat limited.

For starters, only Scala and Java experts can avail themselves of API, according to the description of the new SQL client, which is codenamed FLIP-24. What’s more, any table program that was written with the SQL and Table API had to be packaged with Apache Maven, a Java-based project management tool, and submitted to the Flink cluster before running.

With the launch of the SQL CLI Client in Flink version 1.5, the Flink community is taking its support for SQL in a new direction. According to the FLIP-24 project page, providing an interactive shell will not only make Flink accessible to non-programmers, including data scientists, but it will also eliminate the need for a full IDE to program Flink apps. With millions of SQL-loving data analysts out there, the benefits could certainly be vast.

Good stuff. Feasel’s Law in action.

Comments closed

Stream-To-Stream Joins In Spark

Published 2018-05-25 by Kevin Feasel

Ayush Tiwari shows how to join a pair of streams in Apache Spark 2.3:

In Spark 2.3, it added support for stream-stream joins, i.e, we can join two streaming Datasets/DataFrames and in this blog we are going to see how beautifully spark now give support for joining the two streaming dataframes.

I this example, I am going to use
Apache Spark 2.3.0
Apache Kafka 0.11.0.1
Scala 2.11.8

Click through for the demo.

Comments closed

Spark: DataFrame To RDD For Data Cleansing

Published 2018-05-24 by Kevin Feasel

Gilad Moscovitch walks us through a common data cleansing problem with Spark data frames:

A problem can arise when one of the inner fields of the json, has undesired non-json values in some of the records.

For instance, an inner field might contains HTTP errors, that would be interpreted as a string, rather than as a struct.

As a result, our schema would look like:

root

|– headers: struct (nullable = true)

| |– items: array (nullable = true)

| | |– element: struct (containsNull = true)

|– requestBody: string (nullable = true)

Instead of

root

|– headers: struct (nullable = true)

| |– items: array (nullable = true)

| | |– element: struct (containsNull = true)

|– requestBody: struct (nullable = true)

| |– items: array (nullable = true)

| | |– element: struct (containsNull = true)

When trying to explode a “string” type, we will get a miss type error:

org.apache.spark.sql.AnalysisException: Can’t extract value from requestBody#10

Click through to see how to handle this scenario cleanly.

Comments closed

Visualization Over Kafka And KSQL

Published 2018-05-24 by Kevin Feasel

Shant Hovsepian shows off a data visualization tool which can read Kafka Streams data:

KSQL is a game-changer not only for application developers but also for non-technical business users. How? The SQL interface opens up access to Kafka data to analytics platforms based on SQL. Business analysts who are accustomed to non-coding, drag-and-drop interfaces can now apply their analytical skills to Kafka. So instead of continually building new analytics outputs due to evolving business requirements, IT teams can hand a comprehensive analytics interface directly to the business analysts. Analysts get a self-service environment where they can independently build dashboards and applications.

Arcadia Data is a Confluent partner that is leading the charge for integrating visual analytics and BI technology directly with KSQL. We’ve been working to combine our existing analytics stack with KSQL to provide a platform that requires no complicated new skills for your analysts to visualize streaming data. Just as they will create semantic layers, build dashboards, and deploy analytical applications on batch data, they can now do the same on streaming data. Real-time analytics and visualizations for business users have largely been a misnomer until now. For example, some architectures enabled visualizations for end users by staging Kafka data into a separate data store, which added latency. KSQL removes that latency to let business users see the most recent data directly in Kafka and react immediately.

Click through for a couple repos and demos.

Comments closed

Installing Kafka On Ubuntu

Published 2018-05-22 by Kevin Feasel

Gaurav Garg has an article on installing Apache Kafka on a fresh Ubuntu installation:

For beginners, the default configurations of the Kafka broker are good enough, but for production-level setup, one must understand each configuration. I am going to explain some of these configurations.

broker.id: The ID of the broker instance in a cluster.

zookeeper.connect: The ZooKeeper address (can list multiple addresses comma-separated for the ZooKeeper cluster). Example: localhost:2181,localhost:2182.

zookeeper.connection.timeout.ms: Time to wait before going down if, for some reason, the broker is not able to connect.

Read the whole thing.

Comments closed