Press "Enter" to skip to content

Category: Cloud

From Kafka Into Azure Data Explorer

Anagha Khanolkar walks us through a data movement scenario:

Here is an end-to-end, hands-on lab showcasing the connector in action. You can see an overview of the lab below. In our lab example, we’re going to stream the Chicago crimes public dataset to Kafka on Confluent Cloud on Azure using Spark on Azure Databricks. Then, we will use the Kusto connector to stream the data from Kafka to Azure Data Explorer.

There’s also a lab to try this out, though the estimated spend is a bit high.

Comments closed

Azure Synapse Analytics Query Options

James Serra has a breakdown of what can query what in Azure Synapse Analytics:

The public preview version of Azure Synapse Analytics has three compute options and four types of storage that it can access (mentioned in my blog at SQL on-demand in Azure Synapse Analytics). This gives twelve possible combinations of querying data. Not all of these combinations currently are supported and some have a few quirks of which I list below.

Read on for a table which breaks down current functionality as well as expected GA functionality.

Comments closed

The Cost of Verifying Backups to Azure

Matt Robertshaw reminds us that TANSTAAFL:

Within two weeks of backups being written to Blog Storage, we observed a significant upward trend in cost associated with a Storage Account.  When compared to the previous month, there was an increase of c. £270.  After some further analysis, we were able to associate this increase with “bandwidth” charges.  This didn’t feel right – you don’t pay anything to upload data to Azure (ingress), you only pay when downloading data from Azure (egress).

Using Azure Monitor, we profiled the ingress and egress rates for the affected Storage Account and noticed the following pattern:

Each day, c. 150GB of backups were being written to blob storage (in blue), but shortly after, the same amount was being downloaded (in red).  Over this period, we calculated 4TB of data had been uploaded and then downloaded again.  Based on Microsoft’s latest Bandwidth pricing, whilst the first 5GB of egress per month is free, the next 5GB – 10TB is charged at £0.065 per month.  Some simple maths confirms it to be the additional £270 we observed.

Read on for three possible solutions. My preference for an on-prem solution would be to verify locally and then push to Blob Storage / S3. Backups tend to be faster that way as well, as your disk is likely faster than your network.

Comments closed

Using an Azure VM’s D Drive for tempdb

William Assaf shows how you can use the temporary D drive on an Azure VM to host tempdb in SQL Server:

Moving your SQL Server instance’s TempDB files to the D: volume is recommended for performance, as long as the TempDB files fit it the D: that has been allocated, based on your VM size. 
When the D: is lost due to deallocation, as expected, the subfolder you created for the TempDB files (if applicable) and the NTFS permissions granting SQL Server permission to the folder are no longer present. SQL Server will be unable to create the TempDB files in the subfolder and will not start. Even if you put the TempDB data and log files in the root of D:, after deallocation, that’s still not a solution, as the NTFS permissions to the root of D: won’t exist. In either case, SQL Server will be unable to create the TempDB files and will not start.

Read on for a few options and William’s thoughts on the relative merits of each.

Comments closed

Power BI (Lack of) Performance with Sharepoint

Matt Allington is not impressed:

On the face of it, it seems like a great idea to leverage SharePoint as a storage location for CSV and Excel files.

– Everyone has easy access to the files for editing and storage
– SharePoint manages version control, check in, check out etc
– SharePoint can facilitate shared editing of files
– You can build a Power BI report that will refresh online without the need to install a gateway.

Unfortunately, despite the benefits, the experience is not great.  Power BI performance with SharePoint as a data source is simply terrible.  Ultimately, the problems come down to performance in 2 areas.

Read on to learn more about these two issues and what you can do instead.

Comments closed

Passing an Array of Arrays as a Parameter in Azure Data Factory

Rayis Imayev has a list for us:

In my previous blog post – Setting default values for Array parameters/variables in Azure Data Factory, I had helped myself to remember that arrays could be passed as parameters to my Azure Data Factory (ADF) pipelines. This time I’m helping myself to remember that an array of other arrays can also exist as ADF pipeline parameters’ values.

Read on for the example.

Comments closed

Indexing S3 Data with CDP Data Hub

Eva Nahari, et al, show how to perform indexing and serving of S3 data in Cloudera Data Platform:

This blog post will present a simple “hello world” kind of example on how to get data that is stored in S3 indexed and served by an Apache Solr service hosted in a Data Discovery and Exploration cluster in CDP. For the curious: DDE is a pre-templeted Solr-optimized cluster deployment option in CDP, and recently released in tech preview. We will only cover AWS and S3 environments in this blog. Azure and ADLS deployment options are also available in tech preview, but will be covered in a future blog post.

We will depict the simplest scenario to make it easy to get started. There are of course more advanced data pipeline setups and more rich schemas possible, but this is a good starting point for a beginner. 

Read on for the instructions.

Comments closed

Spark SQL in Delta Lake

Kundan Kumarr walks us through some of the basic SQL operations you can perform with Delta Lake in Apache Spark:

Nowadays Delta lake is a buzz word in the Big Data world, especially among the spark developers because it relegates lots of issues found in the Big Data domain. Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It is evolving day by day and adds cool features in its every release. On 19th June 2020, Delta lake version 0.7.0 was released and this is the first release on Spark 3.x. This release involves important key features that can make the spark developer’s work easy.

One of the interesting key features in this release is the support for metastore-defined tables and SQL DDLs. So now we can define Delta tables in the Hive metastore and use the table name in all SQL operations. We can perform SQL DDLs to create tables, insert into tables, explicitly alter the schema of the tables, and so on. So in this blog, we will learn how we can perform SQL DDLs/DMLS/DQL in Delta Lake 0.7.0.

Click through for the examples.

Comments closed

Building an Azure Function to Automate CHECKDB

Arun Sirpal shows us how to build an Azure Function:

The title is a mouthful and so is this post. In the past I have linked to blog posts from Microsoft that say consistency checks for Azure SQL Database is the responsibility of Microsoft. (https://azure.microsoft.com/en-gb/blog/data-integrity-in-azure-sql-database/)

However, Paul Randal got me thinking about his thoughts on it (via his insider email). Forming the core of this post. If you desire to run DBCC CHECKDB against Azure SQL Database (which I know people do) – how can you do this? There are many ways, but for this blog post – Enter Azure functions. There are many moving parts to this, but once setup and coded it is a very satisfying experience. Let’s dig in. I am NOT going to copy and paste every little element of the high-level guide from Microsoft, there is no point in that but I will show you the links that you need to setup the relevant function app project then the tailored bits around CHECKDB forms the bulk of this post.

This isn’t necessary to do, but if you want to learn how Azure Functions work, it’s a good example of working through the mechanics.

Comments closed