Loading Azure SQL Data Warehouse

Grant Fritchey enumerates various pains he went through to load data into Azure SQL Data Warehouse:

Let’s bottom line this. Data is dirty. Any ETL process is likely to expose the trashy data, so what are the real issues here? First up, Powershell is my bestest buddy. Scripting everything through Posh right out of the gate made a huge difference in my ability to constantly reset. The fact that our most common processes spit out UTF16, while not a problem, is something you need to get on top of (see Posh above). AZCopy works really well and the command line is easy to implement, but you’ll need to go through the process of setting up all the appropriate connections. Polybase is wicked cool. Yes, I had to hop through hoops to get the data clean, but, what you should note in the above descriptions, at no point did I need to go and recreate my EXTERNAL TABLE from Azure SQL Data Warehouse. I just kept updating the files through my scripts and as soon as they were loaded up to blob storage, I ran a query against them. No additional work required. That’s nice. You’re going to want to go through your data types on your data before you start the migration. You’re going to have to plan for a lot of time cleaning your data if you haven’t already.

Grant’s method is the right way, especially for early tests.  In practice, for the type of data you’d put into Polybase, you might want to create the external table to allow rejecting a certain number of rows—Grant didn’t specify the REJECT_TYPE and REJECT_VALUE attributes creating his external table, so the default of 0 rows was used.  In a warehouse with billions of rows, hand-fixing all of that data is a nasty proposition, and if you’re writing queries whose results likely won’t change if a few (dozen?) records get dropped, rejecting bad data might be a good way to keep some of your sanity.

Related Posts

Serverless Lambda Architecture

Laith Al-Saadoon shows off a new Amazon Web Services product, AWS Glue, which allows you to build a data processing system on the Lambda architecture without directly provisioning any EC2 instances: With the launch of AWS Glue, AWS provides a portfolio of services to architect a Big Data platform without managing any servers or clusters. […]

Read More

Instant Log Initialization In Azure

Dimitri Furman shows a benefit of creating database files with Azure Blob Storage: Recently, we were working on a performance testing exercise using a SQL Server database with files in Azure Blob Storage. After creating the database using the default 8 MB size for data and log file (as in the example above), we wanted […]

Read More

Categories

February 2016
MTWTFSS
« Jan Mar »
1234567
891011121314
15161718192021
22232425262728
29