Category: Cloud

2) Enable the Delta cache – spark.databricks.io.cache.enabledtrue
There is a very good resource available on configuring this Spark config setting: https://docs.microsoft.com/en-us/azure/databricks/delta/optimizations/delta-cache
And this will be very helpful in your Databricks notebook’s queries when you try to access a similar dataset multiple times. Once you read this dataset for the first time, Spark places it into internal local storage cache and will speed up the process of further referencing it for you.

Click through for several more along these lines.

Comments closed

Fun with Metaphors: Data Lakehouses

Published 2020-02-05 by Kevin Feasel

Ben Lorica, et al, have a new metaphor to try out:

Over the past few years at Databricks, we’ve seen a new data management paradigm that emerged independently across many customers and use cases: the lakehouse. In this post we describe this new paradigm and its advantages over previous approaches.

The Data Lake’s Aristotelian counterpart is the Data Swamp. I’m working on a similar comp for the Data Lakehouse (Data Swampboat? Data Swamphouse is too easy), but in the meantime, that one person who goes and slaughters your application’s performance by butchering the data in your Data Lakehouse? That’s a Data Jason.

1 Comment

Azure SQL Hyperscale Auto-Scaling

Published 2020-02-05 by Kevin Feasel

Davide Mauri explains how automatically to scale Azure SQL Hyperscale:

Azure SQL Hyperscale is the latest architectural evolution of Azure SQL, that has been natively designed to take advantage of the cloud. One of the main key features of this new architecture is the complete separation of Compute Nodes and Storage Nodes. This allow for independent scale of each service, making Hyperscale more flexible and elastic.
In this article I will describe how it is possible to implement a solution to automatically scale your Azure SQL Hyperscale database up or down, to dynamically and automatically adapt to different workload levels without the requiring manual .

Davide has some test measures of how much downtime you see and give you a couple thoughts on how you can track when it’s time to scale up or down.

Comments closed

Resource Limitations with Azure Data Factory

Published 2020-02-03 by Kevin Feasel

Paul Andrew has a public service announcement for us:

As far as I can tell Microsoft do an excellent job at managing data centre capacity so I completely understand the reason for having limitations on resources in place. There is no such thing as a limitless cloud platform.
Note; in a lot of cases (as you’ll see in the below table for Data Factory) the MAX limitations are only soft restrictions that can easily be lifted via a support ticket. Please check before raising alerts and project risks.

Click through for the limits, and “contact support” definitely is good advice if you’re expecting to push past those limits.

Comments closed

Parameterizing a Data Factory Linked Service to a REST API

Published 2020-01-31 by Kevin Feasel

Meagan Longoria had to parameterize a linked service connecting to a REST API recently:

In order to pass dynamic values to a linked service, we need to parameterize the linked service, the dataset, and the activity.
I have a pipeline where I log the pipeline start to a database with a stored procedure, lookup a username in Key Vault, copy data from a REST API to data lake storage, and log the end of the pipeline with a stored procedure. My username and password are stored in separate secrets in Key Vault, so I had to do a lookup with a web activity to get the username. The password is retrieved using Key Vault inside the linked service. Data Factory doesn’t currently support retrieving the username from Key Vault so I had to roll my own Key Vault lookup there.

Click through for the instructions.

Comments closed

Moving Data Around in Azure Synapse Analytics

Published 2020-01-31 by Kevin Feasel

Niko Neugebauer looks at some techniques for copying data into a table in an Azure Synapse Analytics SQL Pool:

First of all, let us list some of them (and I am not even attempting on providing all of them, of course):
– INSERT INTO … SELECT FROM … (the most well known one)
– SELECT INTO … FROM … (the most well-known to perform well, since it will create a HEAP while copying most of the properties from the original table(s))
– CREATE TABLE … AS SELECT … (the old way, which must be like 10 years old on PDW/APS & Azure SQL DW, but that has never gotten into a Box Product or Azure SQL Database)
– Polybase (that will use the External Tables & externally allocated data to transfer into Azure SQL DW)
– BCP (good old tested friend that will give you a pain in the neck until you dominate it)
– OPENROWSET / BULK INSERT (some very good and very old friends with complicated histories (who remembers all the code pages?, settings and uncertain future mostly because of their original restrictions, I guess)
– COPY INTO … (the brand new command that will allow you under very neat privileges to copy data from the external storage accounts, much like BULK INSERT)
In this blog post I will simply focus on those features that have not been ported (hopefully just yet): CTAS & COPY INTO.

Read on to see how these two work. Also, I too have wanted CTAS in on-premises SQL Server for years.

Comments closed

ACID Transactions with Cosmos DB

Published 2020-01-28 by Kevin Feasel

Hasan Savran shows how you can use the Cosmos DB SDK to create ACID transactions:

What about Azure Cosmos DB? It’s a NoSQL database, probably you can’t do ACID Transactions right? WRONG! Azure Cosmos DB has been supporting ACID transaction for some time now. We were able to create ACID transactions by using stored procedures of Cosmos DB. Last year (2019) Cosmos DB team introduced ACID transactions to Cosmos DB SDK. Now, we can create transactions by using C# just like writing transactions by using SQLClient class for SQL Server!
To create an ACID transaction in Cosmos DB SDK, we need to use TransactionalBatch object. You need add all operations in transaction to TransactionalBatch object. All the operations attached to the TransactionalBatch object must share the same partition key. In the following example, I created three objects and attach them to TransactionalBatch object. To start the transaction, I ran the ExecuteAsync() function. This function runs the transaction and returns the responses for each operation.

I’d think you would need to set a strong consistency level as well.

Comments closed

Managed Identity with Azure Functions

Published 2020-01-28 by Kevin Feasel

Taiob Ali shows how you can safely store credentials which your Azure Function apps need:

With the announcement of Powershell support in Azure Functions, it has become easier for data professionals to use functions to manage cloud resources such as Azure SQL Database, Managed Instances. A common challenge when using functions is how to manage the credentials in function code for authenticating databases. Keeping the credentials secure is an important task. Ideally, the credentials should never appear in the code or in the source control.
Manged Identity can solve this problem as Azure SQL Database and Managed Instance both support Azure AD authentication. You can read mode about Managed Identity here.
In this article, I will show how to set up Azure Function App to use Managed Identity to authenticate functions against Azure SQL Database.

The example connects Azure SQL DB, but this is a general-purpose solution.

Comments closed

Improving Join Performance on ADF Data Flows

Published 2020-01-27 by Kevin Feasel

Mark Kromer has a few tips on improving ADF data flow join performance:

When you include literal values in your join conditions, Spark may see that as a requirement to perform a full cartesian product first, then filter out the joined values. But if you ensure that you (1) have column values from both sides of your join condition, you can avoid this Spark-induced cartesian product and improve the performance of your joins and data flows. (2) Avoid use of literal conditions to represent the results of one side of your join condition.
In other words, avoid this for your join condition:source1@movieId == '1'Instead, implement that with a dummy derived column.

There are several good tips in this post.

Comments closed

Migrating Oracle Exadata Workloads to Azure

Published 2020-01-23 by Kevin Feasel

Kellyn Pot’vin-Gorman shows the process of moving from an Exadata system to Oracle on Azure:

An Exadata is an engineered system- database nodes, secondary cell nodes, (also referred to as storage nodes/cell disks), InfiniBand for fast network connectivity between the nodes, specialized cache, along with software features such as Real Application Clusters, (RAC), hybrid columnar compression, (HCC), storage indexes, (indexes in memory) offloading technology that has logic built into it to move object scans and other intensive workloads to cell nodes from the primary database nodes. There are considerable other features, but understanding that Exadata is an ENGINEERED system, not a hardware solution is important and its both a blessing and a curse for those databases supported by one. The database engineer must understand both Exadata architecture and software along with database administration. There is an added tier of performance knowledge, monitoring and patching that is involved, including knowledge of the Cell CLI, the command line interface for the cell nodes. I could go on for hours on more details, but let’s get down to what is required when I am working on a project to migrate an Exadata to Azure.

Click through for the process.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31