Press "Enter" to skip to content

Category: Hadoop

Azure Databricks and Delta Lake

Brad Llewellyn starts a new series on Delta Lake in Azure Databricks:

Saving the data in Delta format is as simple as replacing the .format(“parquet”) function with .format(“delta”).  However, we see a major difference when we look at the table creation.  When creating a table using Delta, we don’t have to specify the schema, because the schema is already strongly defined when we save the data.  We also see that Delta tables can be easily queried using the same SQL we’re used to.  Next, let’s compare what the raw files look like by examining the blob storage container that we are storing them in.

There are some good demos in this post and it promises to be a nice series.

Comments closed

Converting Databricks Notebooks to ipynb

Dave Wentzel shows how we can convert a Databricks notebook (in DBC format) to a normal Jupyter notebook (in ipynb format):

Databricks natively stores it’s notebook files by default as DBC files, a closed, binary format. A .dbc file has a nice benefit of being self-contained. One dbc file can consist of an entire folder of notebooks and supporting files. But other than that, dbc files are frankly obnoxious.

Read on to see how to convert between these two formats.

Comments closed

Querying Pulsar Streams with Apache Flink

Sijie Guo and Markos Sfikas show how we can interact with Apache Pulsar using Apache Flink:

The latest integration between Flink 1.9.0 and Pulsar addresses most of the previously mentioned shortcomings. The contribution of Alibaba’s Blink to the Flink repository adds many enhancements and new features to the processing framework that make the integration with Pulsar significantly more powerful and impactful. Flink 1.9.0 brings Pulsar schema integration into the picture, makes the Table API a first-class citizen and provides an exactly-once streaming source and at-least-once streaming sink with Pulsar. Lastly, with schema integration, Pulsar can now be registered as a Flink catalog, making running Flink queries on top of Pulsar streams a matter of a few commands. In the following sections, we will take a closer look at the new integrations and provide examples of how to query Pulsar streams using Flink SQL.

Read on to see this integration in action.

Comments closed

JupyterLab Integration for Databricks

Bernhard Walter announces an integration between JupyterLab and Databricks:

This blog post starts with a quick overview how using a remote Databricks cluster from your local JupyterLab would look like. It then provides an end to end example of working with JupyterLab Integration followed by explaining the differences to Databricks Connect. If you want to try it yourself, the last section explains the installation.

I like this a lot, as it fights back a bit against the balkanization of data science: it means I don’t need to keep one set of notebooks here and another set of notebooks there and a third set of notebooks somewhere else.

Comments closed

KSQL to ksqlDB

Jay Kreps announces a new naming for KSQL:

Today marks a new release of KSQL, one so significant that we’re giving it a new name: ksqlDB. Like KSQL, ksqlDB remains freely available and community licensed, and you can get the code directly on GitHub. I’ll first share about what we’ve added in this release, then talk about why I think it is so important and explain the new naming.

There are two new major features we’re adding: pull queries and connector management.

This looks really interesting.

Comments closed

Visualizing Kafka Data Using D3

Mihalis Tsoukalos extracts, explores, and visualizes data (with D3) from a Kafka topic:

Now that you have your data in JSON format, you will use D3.js in order to visualize it. As JavaScript code is embedded in HTML files, the final version of the D3.js code can be found in visualize-spatial.html, which contains the following code:

D3 is extremely powerful, though that power comes with a fairly steep learning curve.

Comments closed

Columnar File Formats in Hadoop

Matthew Rathbone gives us an overview of the benefits behind the ORC and Parquet file formats:

People throw this term around a lot, but I don’t think it is always clear exactly what this means in practice.

The textbook definition is that columnar file formats store data by column, not by row. CSV, TSV, JSON, and Avro, are traditional row-based file formats. Parquet, and ORC file are columnar file formats.

Read on for a comparison and example. In the SQL Server world, think columnstore versus rowstore indexes and you won’t be too far off.

Comments closed

Profiling Hive Jobs on Tez

Dmitry Tolpeko takes us through Hive query diagnostics:

I was asked to diagnose and tune a long and complex ad-hoc Hive query that spent more than 4 hours on the reduce stage. The fetch from the map tasks and the merge phase completed fairly quickly (within 10 minutes) and the reducers spent most of their time iterating the input rows and performing the aggregations defined by the query – MIN, SUM, COUNT and PERCENTILE_APPROX and others on the specific columns.

After the merge phase a Tez reducer does not output many log records to help you diagnose the performance issues and find the bottlenecks. In this article I will describe how you can profile an already running Tez task without restarting the job.

Click through for the process, as well as the root cause of the problem.

Comments closed

Securing Data on ElasticMapReduce

Duncan Chen takes us through data encryption options when using ElasticMapReduce:

Data encryption is an effective solution to bolster data security. You can make sure that only authorized users or applications read your sensitive data by encrypting your data and managing access to the encryption key. One of the main reasons that customers from regulated industries such as healthcare and finance choose Amazon EMR is because it provides them with a compliant environment to store and access data securely.

This post provides a detailed walkthrough of two new encryption options to help you secure your EMR cluster that handles sensitive data. The first option is native EBS encryption to encrypt volumes attached to EMR clusters. The second option is an Amazon S3 encryption that allows you to use different encryption modes and customer master keys (CMKs) for individual S3 buckets with Amazon EMR.

Click through for more details on each.

Comments closed