Press "Enter" to skip to content

Category: Hadoop

Databricks + Azure Synapse Analytics

David Meyer and Clinton Ford explain how you can integrate Azure Databricks with Azure Synapse Analytics:

In the last two years since it first became available, thousands of companies have adopted Azure Databricks, making it one of the fastest growing data and AI services on Microsoft Azure. Customers now process over 2 exabytes per month with millions of server-hours spinning up every day. All of this is driven by organizations like ElectroluxShell, and renewables.AI that are using Azure Databricks to process data at massive scale for data science and analytics.

Within this amazing adoption is a specific solution architecture to highlight called the Modern Data Warehouse (MDW). Earlier this year we wrote about the performance and scale benefits of this solution, and part of the pattern’s success has been our close integration to Azure SQL Data Warehouse with a high-performance connector that was jointly engineered to make it fast and easy to move data between the two services.

Something interesting about Synapse is that its implementation of Spark is not the same as the Databricks implementation (perhaps for licensing reasons). But that doesn’t stop us from using Databricks to process and curate data for Synapse Analytics.

Comments closed

Azure Synapse Analytics, Nee Azure SQL DW

John Macintire explains Azure Synapse Analytics:

A cloud native, distributed SQL processing engine is at the foundation of Azure Synapse and is what enables the service to support the most demanding enterprise data warehousing workloads. This week at Ignite we introduced a number of exciting features to make data warehousing with Azure Synapse easier and allow organizations to use SQL for a broader set of analytics use cases.

There’s a fair amount of marketing-speak in here, but the gist is Azure SQL Data Warehouse + Spark + on-demand serverless queries (so you can, among other things, write T-SQL against your HDFS data). I think it has a better chance of long-lasting success than Azure SQL Data Warehouse.

Comments closed

Troubleshooting Tez Performance

Dmitry Tolpeko digs through Tez logs to figure out a performance issue:

Why did it take so long to run the job? Is there any way to improve its performance?

Tez Application Master Log
I am going to use the Tez AM log to investigate vertex performance and find possible bottlenecks.
Note that there is the Timeline Server REST API that you can use to get the statistics for Tez jobs, but the application master log is “event-driven”, shows the exact order of all events and contains much more details in general.

Click through for the process.

Comments closed

Using Lenses and GitOps to Migrate Kafka to HDInsight

Andrew Stevenson takes us through migrating from a self-managed Kafka cluster on Lenses to HDInsight using GitOps:

Let’s dig deeper with an example. I have a Self-Managed Kafka cluster and I want to migrate to HDInsight Kafka.

First, we will concentrate on topics. I may have 1000’s of topics. How do I ensure that the configuration (the metadata) are migrated efficiently?

I could do this manually, but this is error-prone, time-consuming and importantly also lacks governance and auditing. A better approach would be to automate this, which is what we can achieve with Lenses and a GitOps approach.

Click through to see how to automate this.

Comments closed

Kafka Consumer Delivery Semantics

Sylvester Daniel continues a series on Apache Kafka:

Enable.auto.commit
This defines how offsets are committed to Kafka — by default  enable.auto.commit is set to true. When this property is set to true, you may also want to set how frequent offsets should be committed using  auto.commit.interval.ms.

By default,auto.commit.interval.ms is set to 5,000ms (5 seconds). When enable.auto.commit is set to true, consumer delivery semantics is “At most once, and commits are async.

Click through for more information on what this means, as well as several other important settings.

Comments closed

Joining RDDs in Spark

Brad Llewellyn takes us through more Spark RDD and DataFrame exercises, including joins:

We can make use of the built-in .join() function for RDDs.  Similar to the .aggregateByKey() function we saw in the previous post, the .join() function for RDDs requires a 2-element tuple, with the first element being the key and the second element being the value.  So, we need to use the .map() function to restructure our RDDs to store the keys in the first element and the original array/tuple in the second element.  After the join, we end up with an awkward nested structure of arrays and tuples that we need to restructure using another .map() function, leading to a lengthy code snippet.

This is a place where DataFrames make so much more sense.

Comments closed

Azure AD Credential Passthrough and Databricks

Anna Shrestinian, et al, explain how Azure Databricks enables Azure Active Directory credential passthrough when working with Azure Data Lake Storage Gen2:

Azure Data Lake Storage (ADLS) Gen2, which became generally available earlier this year, is quickly becoming the standard for data storage in Azure for analytics consumption. ADLS Gen2 enables a hierarchical file system that extends Azure Blob Storage capabilities and provides enhanced manageability, security and performance.

The hierarchical file system provides granular access control to ADLS Gen2. Role-based access control (RBAC) could be used to grant role assignments to top-level resources and POSIX compliant access control lists  (ACLs) allow for finer grain permissions at the folder and file level. These features allow users to securely access their data within Azure Databricks using the Azure Blob File System (ABFS) driver, which is built into the Databricks Runtime.

There are some tradeoffs involved, particularly around using High Concurrency clusters (or limiting yourself to one user account), but it’s a nice bit of added value when you’re a heavy Azure user.

Comments closed

A New Notebook Tool: Polynote

Jeremy Smith, et al, announce a new product:

We are pleased to announce the open-source launch of Polynote: a new, polyglot notebook with first-class Scala support, Apache Spark integration, multi-language interoperability including Scala, Python, and SQL, as-you-type autocomplete, and more.

Polynote provides data scientists and machine learning researchers with a notebook environment that allows them the freedom to seamlessly integrate our JVM-based ML platform — which makes heavy use of Scala — with the Python ecosystem’s popular machine learning and visualization libraries. It has seen substantial adoption among Netflix’s personalization and recommendation teams, and it is now being integrated with the rest of our research platform.

There are some nice pieces to it, especially around language interop.

Comments closed

JDBC Resource Pools and Kerberos

Guy Shilo has a tip for us around JDBC connectivity when your Hadoop cluster is configured for Kerberos:

This is a quick tip about connecting to Hive or Impala via JDBC.

Accessing hive or impala using their JDBC driver is very convenient. Client programs s like beeline or Jetbrains DataGrip use it as the main way of accessing Hive/Impala and many people also use it in their own written programs.

Things get a little trickier when the cluster is kerberized. In this case you should add few extra parameters to the JDBC connect string.

Read on to see what to do.

Comments closed

Spark Transformations and Actions

Divyansh Jain differentiates the key sets of functions in Spark:

Now there is a point to be noted here and that is when you apply the transformation on any RDD it will not perform the operation immediately. It will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the transformation in Spark are lazy.

Read on for more details.

Comments closed