Category: Cloud

The Cost of Verifying Backups to Azure

Published 2020-09-16 by Kevin Feasel

Matt Robertshaw reminds us that TANSTAAFL:

Within two weeks of backups being written to Blog Storage, we observed a significant upward trend in cost associated with a Storage Account. When compared to the previous month, there was an increase of c. £270. After some further analysis, we were able to associate this increase with “bandwidth” charges. This didn’t feel right – you don’t pay anything to upload data to Azure (ingress), you only pay when downloading data from Azure (egress).
Using Azure Monitor, we profiled the ingress and egress rates for the affected Storage Account and noticed the following pattern:
Each day, c. 150GB of backups were being written to blob storage (in blue), but shortly after, the same amount was being downloaded (in red). Over this period, we calculated 4TB of data had been uploaded and then downloaded again. Based on Microsoft’s latest Bandwidth pricing, whilst the first 5GB of egress per month is free, the next 5GB – 10TB is charged at £0.065 per month. Some simple maths confirms it to be the additional £270 we observed.

Read on for three possible solutions. My preference for an on-prem solution would be to verify locally and then push to Blob Storage / S3. Backups tend to be faster that way as well, as your disk is likely faster than your network.

Comments closed

Using an Azure VM’s D Drive for tempdb

Published 2020-09-15 by Kevin Feasel

William Assaf shows how you can use the temporary D drive on an Azure VM to host tempdb in SQL Server:

Moving your SQL Server instance’s TempDB files to the D: volume is recommended for performance, as long as the TempDB files fit it the D: that has been allocated, based on your VM size.
When the D: is lost due to deallocation, as expected, the subfolder you created for the TempDB files (if applicable) and the NTFS permissions granting SQL Server permission to the folder are no longer present. SQL Server will be unable to create the TempDB files in the subfolder and will not start. Even if you put the TempDB data and log files in the root of D:, after deallocation, that’s still not a solution, as the NTFS permissions to the root of D: won’t exist. In either case, SQL Server will be unable to create the TempDB files and will not start.

Read on for a few options and William’s thoughts on the relative merits of each.

Comments closed

Power BI (Lack of) Performance with Sharepoint

Published 2020-09-15 by Kevin Feasel

Matt Allington is not impressed:

On the face of it, it seems like a great idea to leverage SharePoint as a storage location for CSV and Excel files.
– Everyone has easy access to the files for editing and storage
– SharePoint manages version control, check in, check out etc
– SharePoint can facilitate shared editing of files
– You can build a Power BI report that will refresh online without the need to install a gateway.
Unfortunately, despite the benefits, the experience is not great. Power BI performance with SharePoint as a data source is simply terrible. Ultimately, the problems come down to performance in 2 areas.

Read on to learn more about these two issues and what you can do instead.

Comments closed

Passing an Array of Arrays as a Parameter in Azure Data Factory

Published 2020-09-14 by Kevin Feasel

Rayis Imayev has a list for us:

In my previous blog post – Setting default values for Array parameters/variables in Azure Data Factory, I had helped myself to remember that arrays could be passed as parameters to my Azure Data Factory (ADF) pipelines. This time I’m helping myself to remember that an array of other arrays can also exist as ADF pipeline parameters’ values.

Read on for the example.

Comments closed

Indexing S3 Data with CDP Data Hub

Published 2020-09-11 by Kevin Feasel

Eva Nahari, et al, show how to perform indexing and serving of S3 data in Cloudera Data Platform:

This blog post will present a simple “hello world” kind of example on how to get data that is stored in S3 indexed and served by an Apache Solr service hosted in a Data Discovery and Exploration cluster in CDP. For the curious: DDE is a pre-templeted Solr-optimized cluster deployment option in CDP, and recently released in tech preview. We will only cover AWS and S3 environments in this blog. Azure and ADLS deployment options are also available in tech preview, but will be covered in a future blog post.
We will depict the simplest scenario to make it easy to get started. There are of course more advanced data pipeline setups and more rich schemas possible, but this is a good starting point for a beginner.

Read on for the instructions.

Comments closed

Spark SQL in Delta Lake

Published 2020-09-11 by Kevin Feasel

Kundan Kumarr walks us through some of the basic SQL operations you can perform with Delta Lake in Apache Spark:

Nowadays Delta lake is a buzz word in the Big Data world, especially among the spark developers because it relegates lots of issues found in the Big Data domain. Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It is evolving day by day and adds cool features in its every release. On 19th June 2020, Delta lake version 0.7.0 was released and this is the first release on Spark 3.x. This release involves important key features that can make the spark developer’s work easy.
One of the interesting key features in this release is the support for metastore-defined tables and SQL DDLs. So now we can define Delta tables in the Hive metastore and use the table name in all SQL operations. We can perform SQL DDLs to create tables, insert into tables, explicitly alter the schema of the tables, and so on. So in this blog, we will learn how we can perform SQL DDLs/DMLS/DQL in Delta Lake 0.7.0.

Click through for the examples.

Comments closed

Building an Azure Function to Automate CHECKDB

Published 2020-09-11 by Kevin Feasel

Arun Sirpal shows us how to build an Azure Function:

The title is a mouthful and so is this post. In the past I have linked to blog posts from Microsoft that say consistency checks for Azure SQL Database is the responsibility of Microsoft. (https://azure.microsoft.com/en-gb/blog/data-integrity-in-azure-sql-database/)
However, Paul Randal got me thinking about his thoughts on it (via his insider email). Forming the core of this post. If you desire to run DBCC CHECKDB against Azure SQL Database (which I know people do) – how can you do this? There are many ways, but for this blog post – Enter Azure functions. There are many moving parts to this, but once setup and coded it is a very satisfying experience. Let’s dig in. I am NOT going to copy and paste every little element of the high-level guide from Microsoft, there is no point in that but I will show you the links that you need to setup the relevant function app project then the tailored bits around CHECKDB forms the bulk of this post.

This isn’t necessary to do, but if you want to learn how Azure Functions work, it’s a good example of working through the mechanics.

Comments closed

Reading S3 Data into Kafka

Published 2020-09-10 by Kevin Feasel

Ramu Kakarla takes us through using Apache Camel to load Amazon S3 data into a Kafka topic:

Apache Camel is an open-source framework for message-oriented middleware with a rule-based routing and mediation engine that provides a Java object-based implementation of the Enterprise Integration Patterns using an application programming interface to configure routing and mediation rules

Read on to see how this fits together.

Comments closed

ACID Transactions with Hive LLAP in ElasticMapReduce

Published 2020-09-09 by Kevin Feasel

Suthan Phillips and Chao Gao walk us through ACID transactions when using Hive on Amazon’s ElasticMapReduce platform:

ACID (atomicity, consistency, isolation, and durability) properties make sure that the transactions in a database are atomic, consistent, isolated, and reliable.
Amazon EMR 6.1.0 adds support for Hive ACID transactions so it complies with the ACID properties of a database. With this feature, you can run INSERT, UPDATE, DELETE, and MERGE operations in Hive managed tables with data in Amazon Simple Storage Service (Amazon S3). This is a key feature for use cases like streaming ingestion, data restatement, bulk updates using MERGE, and slowly changing dimensions.
This post demonstrates how to enable Hive ACID transactions in Amazon EMR, how to create a Hive transactional table, how it can achieve atomic and isolated operations, and the concepts, best practices, and limitations of using Hive ACID in Amazon EMR.

Click through for a demonstration.

Comments closed

Automating Azure Resource Deletion

Published 2020-09-09 by Kevin Feasel

Drew Skwiers-Koballa has put together a runbook to remove Azure resources tagged in a certain way:

Microsoft puts a lot of effort into making it easy for you to log on to portal.azure.com and spin up a VM, SQL Database, Function, or other instance. The Visual Studio Enterprise subscription with $150 credit per month can go quickly if you’re not careful to stop or remove big ticket items as soon as you’re done with them. Even if you have a virtually unlimited Azure account connected to a credit card, expense account, or a trust fund – you probably don’t want to accidentally leave something running longer than it is needed.
With an Azure Automation account and a Powershell workflow runbook I’m able to use resource tags to set resources for autodeletion by date or immediately that evening.

Click through for that runbook.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31