Hadoop – Page 68 – Curated SQL

Debugging Spark Applications in Visual Studio

Published 2019-08-22 by Kevin Feasel

Ed Elliott continues a series on spark-dotnet:

There are two approaches, one I have used for years with dotnet when I want to debug something that is challenging to get a debugger attached – think apps which spawn other processes and they fail in the startup routine. You can add a Debugger.Launch() to your program then when spark executes it, a prompt will be displayed and you can attach Visual Studio to your program. (as an aside I used to do this manually a lot by writing an __asm int 3 into an app to get it to break at an appropriate point, great memories but we don’t need to do that anymore luckily :).
The second approach is to start the spark-dotnet driver in debug mode which instead of launching your app, it starts and listens for incoming requests – you can then run your program as normal (f5), set a breakpoint and your breakpoint will be hit.

Read on to see how it’s done, as well as a possibly-accidental benefit to this.

Comments closed

Performance Tuning Neural Network Training

Published 2019-08-20 by Kevin Feasel

Sean Owen takes us through a few techniques for speeding up neural network model training:

Step #2: Use Early Stopping
Keras (and other frameworks) have built-in support for stopping when further training appears to be making the model worse. In Keras, it’s the EarlyStopping callback. Using it means passing the validation data to the training process for evaluation on every epoch. Training will stop after several epochs have passed with no improvement. restore_best_weights=True ensures that the final model’s weights are from its best epoch, not just the last one. This should be your default.

Sean focuses here on Keras + TensorFlow on Spark, but several of the tips are cross-product and generally applicable.

Comments closed

Machine Learning and Delta Lake

Published 2019-08-19 by Kevin Feasel

Brenner Heintz and Denny Lee walk us through solving data engineering problems with Delta Lake:

As a result, companies tend to have a lot of raw, unstructured data that they’ve collected from various sources sitting stagnant in data lakes. Without a way to reliably combine historical data with real-time streaming data, and add structure to the data so that it can be fed into machine learning models, these data lakes can quickly become convoluted, unorganized messes that have given rise to the term “data swamps.”
Before a single data point has been transformed or analyzed, data engineers have already run into their first dilemma: how to bring together processing of historical (“batch”) data, and real-time streaming data. Traditionally, one might use a lambda architecture to bridge this gap, but that presents problems of its own stemming from lambda’s complexity, as well as its tendency to cause data loss or corruption.

Read the whole thing.

Comments closed

Cloudera Stream Processing

Published 2019-08-19 by Kevin Feasel

Dinesh Chandrasekhar announces the new iteration of Cloudera’s streaming data processor:

Cloudera Stream Processing (CSP) is a product within the Cloudera DataFlow platform that packs Kafka along with some key streaming components that empower enterprises to handle some of the most complex and sophisticated streaming use cases. CSP provides advanced messaging, real-time processing and analytics on real-time streaming data using Apache Kafka. CSP also supports key management and monitoring capabilities powered by Cloudera Streams Management (CSM).

Sounds like they’re taking the Kafka route over Spark Streaming, Flink, Airflow, etc.

Comments closed

Kafka 2.3 and Kafka Connect Improvements

Published 2019-08-16 by Kevin Feasel

Robin Moffatt goes over improvements in Kafka Connect with the release of Apache Kafka 2.3:

A Kafka Connect cluster is made up of one or more worker processes, and the cluster distributes the work of connectors as tasks. When a connector or worker is added or removed, Kafka Connect will attempt to rebalance these tasks. Before version 2.3 of Kafka, the cluster stopped all tasks, recomputed where to run all tasks, and then started everything again. Each rebalance halted all ingest and egress work for usually short periods of time, but also sometimes for a not insignificant duration of time.
Now with KIP-415, Apache Kafka 2.3 instead uses incremental cooperative rebalancing, which rebalances only those tasks that need to be started, stopped, or moved. For more details, there are available resources that you can read, listen, and watch, or you can hear the lead engineer on the work, Konstantine Karantasis, talk about it in person at the upcoming Kafka Summit.

Looks like some nice improvements here.

Comments closed

The Databricks File System

Published 2019-08-14 by Kevin Feasel

Brad Llewellyn takes us through the Azure Databricks File System:

Today, we’re going to talk about the Databricks File System (DBFS) in Azure Databricks. If you haven’t read the previous posts in this series, Introduction, Cluster Creation and Notebooks, they may provide some useful context. You can find the files from this post in our GitHub Repository. Let’s move on to the core of this post, DBFS.
As we mentioned in the previous post, there are three major concepts for us to understand about Azure Databricks, Clusters, Code and Data. For this post, we’re going to talk about the storage layer underneath Azure Databricks, DBFS. Since Azure Databricks manages Spark clusters, it requires an underlying Hadoop Distributed File System (HDFS). This is exactly what DBFS is. Basically, HDFS is the low cost, fault-tolerant, distributed file system that makes the entire Hadoop ecosystem work. We may dig deeper into HDFS in a later post. For now, you can read more about HDFS here and here.

Click through for more detail on DBFS.

Comments closed

An Apache Flume Overview

Published 2019-08-13 by Kevin Feasel

Daniel Berman takes us through an overview of Apache Flume:

Apache Flume was developed by Cloudera to provide a way to quickly and reliably stream large volumes of log files generated by web servers into Hadoop. There, applications can perform further analysis on data in a distributed environment. Initially, Apache Flume was developed to handle only log data. Later, it was equipped to handle event data as well.

Click through to get a code-free, high-level understanding of Flume and where it can work for you.

Comments closed

Apache Kafka Tutorials

Published 2019-08-12 by Kevin Feasel

Michael Drogalis announces Tutorials for Apache Kafka:

For beginners, Kafka Tutorials reveals the “shape” of the problems that event streaming can solve. It makes it easier to recognize the domain of things that you might use event streaming for. Moreover, each tutorial reliably takes you from zero to working code by following each of the steps.
For the experienced, it’s a crucial reference guide that makes your work easier. Easily look up how to join a stream and a table together when you’re rusty, or quickly recall how to merge discrete streams together. Over time, we’ll introduce more advanced material that makes use of the entire stack.

Check it out, including a discussion of their YAML renderer.

Comments closed

KSQL UDFs

Published 2019-08-08 by Kevin Feasel

Mitch Seymour takes us through user-defined functions in Kafka’s flavor of SQL:

One of KSQL’s most powerful features is allowing users to build their own KSQL functions for processing real-time streams of data. These functions can be invoked on individual messages (user-defined functions or UDFs) or used to perform aggregations on groups of messages (user-defined aggregate functions or UDAFs).
The previous blog post How to Build a UDF and/or UDAF in KSQL 5.0 discussed some key steps for building and deploying a custom KSQL UDF/UDAF. Now with Confluent Platform 5.3.0, creating custom KSQL functions is even easier when you leverage Maven, a tool for building and managing dependencies in Java projects.

Read on to see just how easy it is.

Comments closed

Parsing Rows Manually with Spark .NET

Published 2019-08-07 by Kevin Feasel

Ed Elliott shows how we can solve a challenging problem when newlines are in the wrong place:

So the first thing we need to do is to read in the whole file in one chunk, if we just do a standard read the file will get broken into rows based on the newline character:
var file = spark.Read().Option("wholeFile", true).Text(@"C:\git\files\newline-as-data.txt");

This solution is a bit complex. As Ed points out, you’re better off reshaping the file before you try to process it. If it’s a structured file like the example Ed has, a regular expression can do the trick.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Hadoop