Press "Enter" to skip to content

Category: Cloud

Variable Header Counts and Azure Data Factory

Mark Kromer shows how you can convince Azure Data Factory to skip a variable number of lines before beginning processing:

A common requirement that I see from customers who are processing text files in data lakes with Azure Data Factory, is to read and process files where there are variable numbers of lines that precede both the data headers and the data that needs to be processed. ADF already has facilities that handle the ability to switch headers off or on as well as the ability to specify parameterized skip line counts. However, in many cases, files that are received for processing have variable numbers of superfluous lines that need to be skipped.

In ADF, between pipeline activities and data flows, there are a number of ways to handle this scenario. In this post, I am going to demonstrate one such technique. 

Read on to see which technique Mark chose and how to get it working.

Comments closed

HBase and S3

Krishna Maheshwari, et al, explain how we can allow Apache HBase to use S3 for storage:

Cloudera Data Platform (CDP) provides an out-of-the-box solution that allows Apache HBase deployments to use Amazon Simple Storage Service (S3) as its main persistence layer for saving table data. Amazon S3 is an object store which offers a high degree of durability with a pay-per-use cost structure. There is no server-side component to run or manage for S3 — all that is needed is the S3 client library and AWS credentials. However, HBase requires a consistent and atomic filesystem which means that it cannot directly use S3 because it is an eventually consistent object store. Both CDH and HDP have only provided HBase solely using HDFS because there have been long-standing impediments that prevented HBase from natively using S3. To address these issues, we’ve built an out-of-the-box solution which we are delivering for the first time via CDP. When you launch an Operational Database (HBase) cluster on CDP, HBase StoreFiles (the backing files for HBase tables) are stored in S3 and HBase write-ahead-logs (WAL) are stored in an HDFS instance run alongside HBase per usual.

I hadn’t thought of using S3, but it’s an interesting post.

Comments closed

Azure Data Factory Pipeline Hierarchies

Paul Andrew explains the idea of pipeline hierarchies with respect to Azure Data Factory:

Next, even if the concept isn’t new, where I’d like to call out two big differences in my approach to orchestration with ADF comes from working within Microsoft Azure. The highly scalable cloud platform presents some new challenges that SSIS simply didn’t. For me these are:

– Needing to consider our wider solution and what things now cost. I’m fairly sure I’ve said it before. When working with ‘Pay-as-you-go’ services we need to think about designing for cost/consumption as well as all our other data transformation and output requirements. In Azure it is so easy to just leave resources running night and day, when only a short window of compute is needed.
– We need to consider the scale out capabilities of the other services that ADF is going to invoke. Or, to put it another way, how much parallel activity execution do we want ADF to achieve? As you may know the ADF ForEach activity by default allows us to execution inner activities in parallel, but is that enough?

It’s a very interesting idea; read the whole thing.

Comments closed

Azure Kubernetes Service Max Volume Count

Chris Taylor explains an error message in Azure Kubernetes Service:

Whilst playing around with my session for Techorama.nl I encountered an error I hadn’t seen previously whilst deploying SQL Server on Linux in Azure Kubernetes Service (AKS)

0/1 nodes are available: 1 node(s) exceed max volume count

The yaml I used was only slightly modified (mainly names) from scripts used on minikube and docker-desktop so I was a little confused as to why I was getting this in AKS.

Read on to understand what’s happening here and how you can fix it.

Comments closed

Registering SignalR to the Cosmos DB Change Feed

Hasan Savran shows us how we can hook up SignalR to view the Cosmos DB Change Feed:

SignalR allows server code to send asynchronous notifications to client-side web applications. By using it, Azure Functions can send real-time messages to your web applications. Prices can get change whenever data changes in database. Notices can be sent if user needs to be notified. Numbers in dashboard can change dynamically when data changes in Cosmos DB. You can do all those with Azure Cosmos DB + Azure Functions and SignalR. This combination works like David Copperfield magic.

There’s a bit of work involved but Hasan shows us how to get it done.

Comments closed

Diagnosing TCP SACKs-Related Slowdown in Databricks

Chris Stevens, et al, walk us through troubleshooting a slowdown after using Linux images which have been patched for the TCP SACKs vulnerabilities:

In order to figure out why the straggler task took 15 minutes, we needed to catch it in the act. We reran the benchmark while monitoring the Spark UI, knowing that all but one of the tasks for the save operation would complete within a few minutes. Sorting the tasks in that stage by the Status column, it did not take long for there to be only one task in the RUNNING state. We had found our skewed task and the IP address in the Host column pointed us at the executor experiencing the regression.

This is a nice case study of network troubleshooting, so of course there are Wireshark screenshots in it.

Comments closed

Accessing Data in Azure Data Lake Storage Gen 2

James Serra gives us several methods to access data in Azure Data Lake Storage Gen 2:

With data lakes becoming popular, and Azure Data Lake Store (ADLS) Gen2 being used for many of them, a common question I am asked about is “How can I access data in ADLS Gen2 instead of a copy of the data in another product (i.e. Azure SQL Data Warehouse)?”. The benefits of accessing ADLS Gen2 directly is less ETL, less cost, to see if the data in the data lake has value before making it part of ETL, for a one-time report, for a data scientist who wants to use the data to train a model, or for using a compute solution that points to ADLS Gen2 to clean your data. While these are all valid reasons, you still want to have a relational database (see Is the traditional data warehouse dead?). The trade-off in accessing data directly in ADLS Gen2 is slower performance, limited concurrency, limited data security (no row-level, column-level, dynamic data masking, etc) and the difficulty in accessing it compared to accessing a relational database.

Since ADLS Gen2 is just storage, you need other technologies to copy data to it or to read data in it.

Read on for the solution.

Comments closed

Disambiguating Azure SQL Database Classes

Arun Sirpal explains the different types of Azure SQL Database available to us:

I want to do a quick summary post of the many different types of Azure SQL Database available and I am not talking about elastic pools, VMs etc, more so the singleton type.

Azure SQL Database (I call normal mode) – A choice between the DTU model (Basic, Standard and Premium) and vCore (General Purpose and Business Critical). Within this space there are two different architecture types used by Microsoft under the covers.

As the product expands, we get more and more options, and Arun clarifies where each fits.

Comments closed

Upgrading Azure Kubernetes Service

Chris Taylor has a point updates to jump in Azure Kubernetes Service:

As it is late at night my brain wasn’t working as it should be but thought I’d put a quick blog out there to say that if you are on v1.11.5 and want to upgrade to >= v1.13.10 then you have to do this in a 2 stage process by upgrading to v1.12.8 first:

Fortunately, upgrading is pretty easy using the Azure command line or even the Azure portal.

Comments closed

Azure Data Factory and Schema Drift

Mark Kromer walks us through two techniques we can use in Azure Data Factory to deal with schema drift:

Azure Data Factory’s Mapping Data Flows have built-in capabilities to handle complex ETL scenarios that include the ability to handle flexible schemas and changing source data. We call this capability “schema drift“.

When you build transformations that need to handle changing source schemas, your logic becomes tricky. In ADF, you can either build data flows that always look for patterns in the source and utilize generic transformation functions, or you can add a Derived Column that defines your flow’s canonical model.

Click through for the discussion and comparison. Schema drift has been the bane of Integration Services’s existence, so it’s good to see them tackling the idea in Azure Data Factory.

Comments closed