Category: Data Loading

Loading Data into Azure Synapse Analytics Data Explorer Pools

Published 2022-02-11 by Kevin Feasel

In today’s post, we will look at the newest Azure Synapse Analytics pool type available: the Data Explorer pool. As of the time of this post, these pools are in a public preview, so things may change between now and when they go GA.

This ended up taking a lot more time than I expected, mostly because I was fighting some dumb permissions errors.

Comments closed

Quick Insertion into SQL Server from a Spreadsheet

Published 2021-04-28 by Kevin Feasel

Kevin Wilkie gives a quick way to load data from Excel (or any other spreadsheet):

One of the items I do before creating the table in the database is to review all of the data that is in the spreadsheet to make sure that:
1. I understand the data that is going into the database table.
2. Nothing that is just obviously wrong is trying to be pushed into the database. For example, the data I was talking about earlier that was one column over from what it should have been. If you see data that is all 0’s and 1’s up until a certain row, then you have descriptions or names – you probably have some bad data.
The other important part of pushing the data into the database from a spreadsheet is working with the CONCATENATE function of Excel. Let’s go into that now.

Click through for the process, as well as additional explanation.

Comments closed

Incremental Refresh of Any Power BI Data Source

Published 2020-06-12 by Kevin Feasel

Gilbert Quevauvilliers wants incrementally to refresh all the sources:

The pattern that I am talking about is the following below, which will be used as my example below.
– Connect to a data source which can query fold.
> In my example I have installed and configured an Azure SQL Serverless DB
> In this database I have a date table.
– Configure date table to use Incremental refreshing as per the blog post
> Incremental refresh in Power BI
– Create a function which will then use the Date value as part of the parameter
> In my example I have got CSV files which have Exchange rate information from Azure Blob Storage.
> The file names of the CSV files is the date.
– Invoke the Function within the date table to extract the required information.
I know what you might be thinking, that as soon as I add in the column with the function it breaks the Query Folding. That is what I thought too.
The great news is that Incremental refreshing DOES STILL WORK!

Read on for the demonstration.

Comments closed

Bulk Insertion With An Identity Column

Published 2018-12-17 by Kevin Feasel

Kenneth Fisher gives us a quick post on bulk insertion against tables with identity columns:

TL;DR; BULK INSERT doesn’t have an easy way to specify a column list so the columns in the insert file must match the columns in the table unless you use a format file or a staging table.
As simple as they appear on the surface identity columns are fairly complicated things. BULK INSERT, on the other hand, is pretty complicated even on the surface (at least as far as I’m concerned). Because of this, the combination can get even worse. When you do an insert into a table that has an identity column you usually just ignore that column and it’s automatically filled in. However, that’s not true with BULK INSERT. Unless you use a format file, the number of columns (and their order) in the file has to match the table.

Read the whole thing.

Comments closed

Using The Spark Connector To Speed Up Data Loads

Published 2018-05-14 by Kevin Feasel

Denzil Riberio explains how you can use the Spark connector for Azure SQL DB and SQL Server to speed up inserting data from Spark into SQL Server 15x over the native JDBC client:

Since the load was taking longer than expected, we examined the sys.dm_exec_requests DMV while load was running, and saw that there was a fair amount of latch contention on various pages, which wouldn’t not be expected if data was being loaded via a bulk API.

Examining the statements being executed, we saw that the JDBC driver uses sp_prepare followed by sp_execute for each inserted row; therefore, the operation is not a bulk insert. One can further example the Spark JDBC connector source code, it builds a batch consisting of singleton insert statements, and then executes the batch via the prep/exec model.

It’s the power of bulk insertion.

Comments closed

Automatic Retry With Optimistic Concurrency

Published 2017-09-19 by Kevin Feasel

Vladimir Khorikov explains an anti-pattern when dealing with a model using optimistic concurrency (for example, memory-optimized tables):

Alright, back to the original question. So, how to combine optimistic locking and automatic retry? In other words, when the application gets an error from the database saying that the versions of a Product don’t match, how to retry the same operation again?

The short answer is: nohow. You don’t want to do that because it defeats the very purpose of having an optimistic lock in the first place.

Remember that the locking mechanism is a way to ensure that all changes are taken into consideration when changing a record in the database. In other words, someone should review the new version of the record and make an informed decision as to whether they still want to submit the update. And that should be the same client who originated the initial request, you can’t make that decision for them.

Plenty of systems do this sort of data merging automatically, but I get Vladimir’s point: if someone else pulled the rug out from under you, it might change your decision on what that data should look like.

Comments closed

MERGE In Hive

Published 2017-08-17 by Kevin Feasel

Carter Shanklin introduces the MERGE operator in Hive:

USE CASE 2: UPDATE HIVE PARTITIONS.

A common strategy in Hive is to partition data by date. This simplifies data loads and improves performance. Regardless of your partitioning strategy you will occasionally have data in the wrong partition. For example, suppose customer data is supplied by a 3rd-party and includes a customer signup date. If the provider had a software bug and needed to change customer signup dates, suddenly records are in the wrong partition and need to be cleaned up.

It has been interesting to see Hive morph over the past few years from a batch warehousing system to something approaching a relational warehouse. This is one additional step in that direction.

Comments closed

Thinking About The Data Lake

Published 2017-07-03 by Kevin Feasel

James Serra explains at a high level what the data lake metaphor is and how it works:

The data lake introduces a new data analysis paradigm shift:

OLD WAY: Structure -> Ingest -> Analyze

NEW WAY: Ingest -> Analyze -> Structure

This allows you to avoid a lot of up-front work before you are able to analyze data. With the old way, you have to know the questions to ask. The new way supports situations when you don’t know the questions to ask.

This solves the two biggest reasons why many EDW projects fail:

Too much time spent modeling when you don’t know all of the questions your data needs to answer
Wasted time spent on ETL where the net effect is a star schema that doesn’t actually show value

There are some good details here. My addition would be to reiterate the importance of a good data governance policy.

Comments closed

Data Wrangling With Power Query

Published 2017-06-07 by Kevin Feasel

Eugene Meidinger parses a complex report using Power Query:

Hmm, so it looks like I made a mistake. I hope my honesty won’t lose me any izzat, or ability to command respect. I think it’s important to see how people really learn and really solve problems. So, I’m including my screw ups in this post.

Apparently, I created a linked table and I can’t see how to edit the the Power Query portion for that. A linked table is a nice way to pull raw data from the Excel workbook. It’s great for reference tables, but doesn’t solve our problem.

Come for the data analysis, stay for the spelling bee. This is part one of a two-parter, focusing on techniques to get the data in a digestible format; part two will do interesting things with the data.

Comments closed

Using Azure Data Factory With Biml

Published 2017-03-06 by Kevin Feasel

Meagan Longoria has a multi-part series on using Biml to script Azure Data Factory tasks to migrate data from an on-prem SQL Server instance to Azure Data Lake Store. Here’s part 1:

My Azure Data Factory is made up of the following components:

Gateway – Allows ADF to retrieve data from an on premises data source
Linked Services – define the connection string and other connection properties for each source and destination
Datasets – Define a pointer to the data you want to process, sometimes defining the schema of the input and output data
Pipelines – combine the data sets and activities and define an execution schedule

Click through for the Biml.

Comments closed