Press "Enter" to skip to content

Category: Hadoop

Querying Pulsar Streams with Apache Flink

Sijie Guo and Markos Sfikas show how we can interact with Apache Pulsar using Apache Flink:

The latest integration between Flink 1.9.0 and Pulsar addresses most of the previously mentioned shortcomings. The contribution of Alibaba’s Blink to the Flink repository adds many enhancements and new features to the processing framework that make the integration with Pulsar significantly more powerful and impactful. Flink 1.9.0 brings Pulsar schema integration into the picture, makes the Table API a first-class citizen and provides an exactly-once streaming source and at-least-once streaming sink with Pulsar. Lastly, with schema integration, Pulsar can now be registered as a Flink catalog, making running Flink queries on top of Pulsar streams a matter of a few commands. In the following sections, we will take a closer look at the new integrations and provide examples of how to query Pulsar streams using Flink SQL.

Read on to see this integration in action.

Comments closed

JupyterLab Integration for Databricks

Bernhard Walter announces an integration between JupyterLab and Databricks:

This blog post starts with a quick overview how using a remote Databricks cluster from your local JupyterLab would look like. It then provides an end to end example of working with JupyterLab Integration followed by explaining the differences to Databricks Connect. If you want to try it yourself, the last section explains the installation.

I like this a lot, as it fights back a bit against the balkanization of data science: it means I don’t need to keep one set of notebooks here and another set of notebooks there and a third set of notebooks somewhere else.

Comments closed

KSQL to ksqlDB

Jay Kreps announces a new naming for KSQL:

Today marks a new release of KSQL, one so significant that we’re giving it a new name: ksqlDB. Like KSQL, ksqlDB remains freely available and community licensed, and you can get the code directly on GitHub. I’ll first share about what we’ve added in this release, then talk about why I think it is so important and explain the new naming.

There are two new major features we’re adding: pull queries and connector management.

This looks really interesting.

Comments closed

Visualizing Kafka Data Using D3

Mihalis Tsoukalos extracts, explores, and visualizes data (with D3) from a Kafka topic:

Now that you have your data in JSON format, you will use D3.js in order to visualize it. As JavaScript code is embedded in HTML files, the final version of the D3.js code can be found in visualize-spatial.html, which contains the following code:

D3 is extremely powerful, though that power comes with a fairly steep learning curve.

Comments closed

Columnar File Formats in Hadoop

Matthew Rathbone gives us an overview of the benefits behind the ORC and Parquet file formats:

People throw this term around a lot, but I don’t think it is always clear exactly what this means in practice.

The textbook definition is that columnar file formats store data by column, not by row. CSV, TSV, JSON, and Avro, are traditional row-based file formats. Parquet, and ORC file are columnar file formats.

Read on for a comparison and example. In the SQL Server world, think columnstore versus rowstore indexes and you won’t be too far off.

Comments closed

Profiling Hive Jobs on Tez

Dmitry Tolpeko takes us through Hive query diagnostics:

I was asked to diagnose and tune a long and complex ad-hoc Hive query that spent more than 4 hours on the reduce stage. The fetch from the map tasks and the merge phase completed fairly quickly (within 10 minutes) and the reducers spent most of their time iterating the input rows and performing the aggregations defined by the query – MIN, SUM, COUNT and PERCENTILE_APPROX and others on the specific columns.

After the merge phase a Tez reducer does not output many log records to help you diagnose the performance issues and find the bottlenecks. In this article I will describe how you can profile an already running Tez task without restarting the job.

Click through for the process, as well as the root cause of the problem.

Comments closed

Securing Data on ElasticMapReduce

Duncan Chen takes us through data encryption options when using ElasticMapReduce:

Data encryption is an effective solution to bolster data security. You can make sure that only authorized users or applications read your sensitive data by encrypting your data and managing access to the encryption key. One of the main reasons that customers from regulated industries such as healthcare and finance choose Amazon EMR is because it provides them with a compliant environment to store and access data securely.

This post provides a detailed walkthrough of two new encryption options to help you secure your EMR cluster that handles sensitive data. The first option is native EBS encryption to encrypt volumes attached to EMR clusters. The second option is an Amazon S3 encryption that allows you to use different encryption modes and customer master keys (CMKs) for individual S3 buckets with Amazon EMR.

Click through for more details on each.

Comments closed

Databricks + Azure Synapse Analytics

David Meyer and Clinton Ford explain how you can integrate Azure Databricks with Azure Synapse Analytics:

In the last two years since it first became available, thousands of companies have adopted Azure Databricks, making it one of the fastest growing data and AI services on Microsoft Azure. Customers now process over 2 exabytes per month with millions of server-hours spinning up every day. All of this is driven by organizations like ElectroluxShell, and renewables.AI that are using Azure Databricks to process data at massive scale for data science and analytics.

Within this amazing adoption is a specific solution architecture to highlight called the Modern Data Warehouse (MDW). Earlier this year we wrote about the performance and scale benefits of this solution, and part of the pattern’s success has been our close integration to Azure SQL Data Warehouse with a high-performance connector that was jointly engineered to make it fast and easy to move data between the two services.

Something interesting about Synapse is that its implementation of Spark is not the same as the Databricks implementation (perhaps for licensing reasons). But that doesn’t stop us from using Databricks to process and curate data for Synapse Analytics.

Comments closed

Azure Synapse Analytics, Nee Azure SQL DW

John Macintire explains Azure Synapse Analytics:

A cloud native, distributed SQL processing engine is at the foundation of Azure Synapse and is what enables the service to support the most demanding enterprise data warehousing workloads. This week at Ignite we introduced a number of exciting features to make data warehousing with Azure Synapse easier and allow organizations to use SQL for a broader set of analytics use cases.

There’s a fair amount of marketing-speak in here, but the gist is Azure SQL Data Warehouse + Spark + on-demand serverless queries (so you can, among other things, write T-SQL against your HDFS data). I think it has a better chance of long-lasting success than Azure SQL Data Warehouse.

Comments closed