Cloudera Accessing Azure Data Lake Store

The Azure Data Lake team has announced that you can now access Azure Data Lake Store using a Cloudera cluster:

The Azure Data Lake (ADL) vision from the beginning has been to transform business data into intelligence by providing analytics on any data at cloud scale. ADL enterprise customers gain insights on their business data using a wide range of tools and platforms. Today’s release of Cloudera Enterprise 5.11 brings another very valuable and widely-used Hadoop computation platform to the set of platforms that can leverage ADLS. No matter what big data analytics platform you choose, Azure Data Lake Store provides a single high throughput enterprise-scale hierarchical file system data lake repository for big data.

Anyone with an Azure subscription can now deploy Cloudera clusters with ADLS. To get started, you can use the Cloudera Enterprise Data Hub template or the Cloudera Director template on Azure Marketplace to create a Cloudera cluster. Once the cluster is up, see here for more information on how to set up your Cloudera cluster with ADLS today!

That’s an interesting development.

Real-Time Weather With HDF

Balaji Kandregula shows how to use Hortonworks Data Flow components to process weather events in real time:

It’s live weather reporting using HDF, Kafka, and Solr.

Here are the environment requirements for implementing:

  • HDF (for HDF 2.0, you need Java 1.8).
  • Kafka.
  • Spark.
  • Solr.
  • Banana.

Now let’s get on to the steps!

There are a lot of moving parts there, but the pieces do plug in well enough and there are a lot of screen shots to guide you along the way.

Data Lake Zoning

Parth Patel, et al, explain that there ought to be several zones of data within a data lake:

Within a Data Lake, zones allow the logical and/or physical separation of data that keeps the environment secure, organized, and Agile. Typically, the use of 3 or 4 zones is encouraged, but fewer or more may be leveraged. A generic 4-zone system might include the following:

  1. Transient Zone — Used to hold ephemeral data, such as temporary copies, streaming spools, or other short-lived data before being ingested.
  2. Raw Zone – The zone in which raw data will be maintained. This is also the zone where sensitive data must be encrypted, tokenized, or otherwise secured.
  3. Trusted Zone – After Data Quality, Validation, or other processing is performed on data in the Raw Zone, it becomes the “source of truth” in this zone for downstream systems.
  4. Refined Zone – Manipulated and enriched data is kept in this zone. This is used to store the output from tools like Hive or external tools that will write into to the Data Lake.

Your particular situation may differ but I’d consider this to be good advice no matter where or how you’re storing data, such as a classical data warehouse or an ODS.

Choosing A Hadoop Data Format

Silvia Oliveros has a set of considerations to help you choose a file format for your data in Hadoop:

What does your pipeline look like, and what steps are involved?

Some of the file formats were optimized to work in certain situations. For example, Sequence files were designed to easily share data between Map Reduce (MR) jobs, so if your pipeline involves MR jobs then Sequence files make an excellent option. In the same vein, columnar data formats such as Parquet and ORC were designed to optimize query times; if the final stage of your pipeline needs to be optimized, using a columnar file format will increase speed while querying data.

At first, I’d suggest just using delimited files, as it’s easiest that way.  Once you have developed a bit of Hadoop maturity, then it makes sense to think about whether rowstore formats (like Parquet and Avro) or columnstore formats (like ORC) make sense for a particular data set.

Spark Deep Learning On AWS

Joseph Spisak, et al, show how to configure and use BigDL in Amazon Web Services’s ElasticMapReduce:

Classify text using BigDL

In this tutorial, we demonstrate how to solve a text classification problem based on the example found here. This example uses a convolutional neural network to classify posts in the 20 Newsgroup dataset into 20 categories.

We’ve provided a companion Jupyter notebook example on GitHub that you can open in the Jupyter dashboard to execute the code sections.

There’s a lot to this tutorial.

Pipeline Architecture With Kafka

Alexandra Wang describes how Pandora Media has used Apache Kafka for real-time ad serving using Kafka Connect:

Our ad server publishes billions of messages per day to Kafka. We soon realized that writing a proprietary Kafka consumer able to handle that amount of data with the desired offset management logic would be non-trivial, especially when requiring exactly-once-delivery semantics. We found that the Kafka Connect API paired with the HDFS connector developed by Confluent would be perfect for our use case.

We’ve also found it painful not having a central authority on data structures that can share their respective schemas across all services and applications. Without a central registry for message schemas, data serialization and deserialization for a variety of applications are troublesome and the pipeline is fragile when schema evolution happens. We found Schema Registry is a great solution for this problem.

To address the above two problems, we integrated the Kafka Connect API and Schema Registry into our Kafka-centered data pipeline.

Well worth reading, especially the difficulties that they’ve had during maintenance periods and in lower environments.

Using On HDInsight

Xiaoyong Zhu shows how to set up on Azure HDInsight:

H2O Flow is an interactive web-based computational user interface where you can combine code execution, text, mathematics, plots and rich media into a single document, much like Jupyter Notebooks. With H2O Flow, you can capture, rerun, annotate, present, and share your workflow. H2O Flow allows you to use H2O interactively to import files, build models, and iteratively improve them. Based on your models, you can make predictions and add rich text to create vignettes of your work – all within Flow’s browser-based environment. In this blog, we will only focus on its visualization part.

H2O FLOW web service lives in the Spark driver and is routed through the HDInsight gateway, so it can only be accessed when the spark application/Notebook is running

You can click the available link in the Jupyter Notebook, or you can directly access this URL:

Setup is pretty easy.

HDInsight 3.6 Available

Ashish Thapliyal points out some Hive improvements in HDInsight 3.6:

2 Create a new Hive table from scratch or alter Table

Create a new table by, clicking on the ‘+’ icon, which opens the create table wizard. Enter table name, column name and choose a data type from the dropdown. You can pick folloiwng advanced hive settings directly from the UI

  • Transactional : Turn on transaction support in Hive, by checking this flag. Note that the table must be bucketed and stored using an ACID compliant format (such as ORC).

  • Location : Hive stores the table data for managed tables in the Hive warehouse directory in HDFS which is configured in hive-site.xml with property hive.metastore.warehouse.dir. The default location is /apps/hive/warehouse. The location can be changed using the Location text field.

  • File Format : The default file format for CREATE TABLE statement is ORC. choose a format from the file format dropdown.

  • Row Format : Select a row format such as Field terminator, Lines terminator, and Stored File type.

  • Table can be altered to add new columns or change the column name or column datatype.

  • Tables can also be renames and altred

Read on for more improvements, including a graphical plan viewer and improved autocomplete.


Carter Shanklin notes that Hive now has the ability to run MERGE statements:

As scalable as Apache Hadoop is, many workloads don’t work well in the Hadoop environment because they need frequent or unpredictable updates. Updates using hand-written Apache Hive or Apache Spark jobs are extremely complex.  Not only are developers responsible for the update logic, they must also implement all rollback logic, detect and resolve write conflicts and find some way to isolate downstream consumers from in-progress updates. Hadoop has limited facilities for solving these problems and people who attempted it usually ended up limiting updates to a single writer and disabling all readers while updates are in progress.

This approach is too complicated and can’t meet reasonable SLAs for most applications. For many, Hadoop became just a place for analytics offload — a place to copy data and run complex analytics where they can’t interfere with the “real” work happening in the EDW.

This post mostly describes the gains rather than showing code, but it does show that Hive developers are looking at expanding beyond common Hadoop warehousing scenarios.

Architecting Kafka Streams

Bill Bejeck walks through a scenario in which one might use Kafka Streams:

Now, you’ve defined your source and we can start creating processors that’ll do the work on the data. The first goal is to mask the credit card numbers recorded in the incoming purchase records. The first processor is used to convert credit card numbers from 1234-5678-9123-2233 to xxxx-xxxx-xxxx-2233. The Stream.mapValues method performs the masking. The KStream.mapValues method returns a new KStream instance that changes the values, as specified by the given ValueMapper, as records flow through the stream. This particular KStream instance is the parent processor for any other processors you define. Our new parent processor provides the masked credit card numbers to any downstream processors with Purchase objects.

Unfortunately, this article seems like a mixture of high-level and low-level information that appeals more to people who already know how Kafka Streams works, but it is nevertheless interesting.


April 2017
« Mar