Thinking About The Data Lake

James Serra explains at a high level what the data lake metaphor is and how it works:

The data lake introduces a new data analysis paradigm shift:

OLD WAY: Structure -> Ingest -> Analyze

NEW WAY: Ingest -> Analyze -> Structure

This allows you to avoid a lot of up-front work before you are able to analyze data.  With the old way, you have to know the questions to ask.  The new way supports situations when you don’t know the questions to ask.

This solves the two biggest reasons why many EDW projects fail:

  • Too much time spent modeling when you don’t know all of the questions your data needs to answer

  • Wasted time spent on ETL where the net effect is a star schema that doesn’t actually show value

There are some good details here.  My addition would be to reiterate the importance of a good data governance policy.

Data Wrangling With Power Query

Eugene Meidinger parses a complex report using Power Query:

Hmm, so it looks like I made a mistake. I hope my honesty won’t lose me any izzat, or ability to command respect. I think it’s important to see how people really learn and really solve problems. So, I’m including my screw ups in this post.

Apparently, I created a linked table and I can’t see how to edit the the Power Query portion for that. A linked table is a nice way to pull raw data from the Excel workbook. It’s great for reference tables, but doesn’t solve our problem.

Come for the data analysis, stay for the spelling bee.  This is part one of a two-parter, focusing on techniques to get the data in a digestible format; part two will do interesting things with the data.

Using Azure Data Factory With Biml

Meagan Longoria has a multi-part series on using Biml to script Azure Data Factory tasks to migrate data from an on-prem SQL Server instance to Azure Data Lake Store.  Here’s part 1:

My Azure Data Factory is made up of the following components:

  • Gateway – Allows ADF to retrieve data from an on premises data source

  • Linked Services – define the connection string and other connection properties for each source and destination

  • Datasets – Define a pointer to the data you want to process, sometimes defining the schema of the input and output data

  • Pipelines – combine the data sets and activities and define an execution schedule

Click through for the Biml.

BCP For UTF-8

Sanjay Mishra notes that SQL Server 2016 and 2014 SP2 support UTF-8 for BCP and BULK INSERT:

This requirement has been addressed in SQL Server 2016 (and backported to SQL Server 2014 SP2). To test this, I obtained a UTF-8 dataset from http://www.columbia.edu/~fdc/utf8/. The dataset is translation of the sentence “I can eat glass and it doesn’t hurt me” in several languages. A few lines of sample data are shown here:

(As an aside, it is entirely possible to load Unicode text such as above into SQL Server even without this improvement, as long as the source text file uses a Unicode encoding other than UTF-8.)

I ran into this problem before, where developers wanted to bulk load UTF-8 but had to settle for an inferior solution.

Copying A File Using SQL Server

Slava Murygin makes me want to add a “wacky ideas” category with this one:

At first, you have to read the file you want to copy into a SQL Server. You have to choose a database to perform that action. It can be Test database or you can create a new database to perform that action or it can be even TempDB. There is only two requirements for the database:
– It must not be a production Database;
– Database should have enough of space to accommodate the file you want to copy.

The idea is that if the database engine’s service account has rights to a file you want to access but don’t have permissions to access, you can bulk copy the contents as a binary blob and then grab the contents and write the results to your local system using bcp.  Sure, it becomes your company’s most expensive file copy tool, but I love the mad ingeniousness behind it.

Data Cleansing In SQLite

Allison Tharp wants to clear out kinda-sorta duplicates from a SQLite table:

However, now I have a lot of database entries that are unneeded.  I thought I would take the time to clean this up (even though I’ll no longer use the data and could easily just delete the tables).  For the BGG Hotness, I have the tables: hotgame, hotperson, and hotcompany.  I have 7,350 rows in each of those tables, since I collected data on 50 rankings every hour for just over 6 days.  However, since the BGG hotness rankings only update daily, I really only need 300 rows (50 rankings * 6 days = 300 rows).

I know think the rankings update between 3 and 4, so I want to only keep the entries from 4:00 AM.  I use the following SELECT statement to make sure I’m in the ballpark with where the data is that I want to keep:

There are several ways to solve this problem; this one is easy and works.  The syntax won’t work for all database platforms, but does the trick for SQLite.

Bulk Loading Text Files With Line Feeds

Steve Jones runs into a scenario in which he wants to bulk load a file not in standard Windows CRLF format:

That’s not good. I suspected this was because of the format of the file, so I added a row terminator.

BULK insert MyTable
from ‘C:\SampleFiles\input.txt’
with ( ROWTERMINATOR = ‘\r’)

That didn’t help. I suspected this was because of the terminators for some reason. I also tried the newline (\n) terminator, and both, but nothing worked.

Since I was worried about formatting, I decided to look at the file. My first choice here is XVI32, and when I opened the file, I could see that only a line feed (0x0A) was used.

Little annoyances like this make me more appreciative of Integration Services (and its mess of little annoyances…).

Scrubbing Data

Tom Norman has a series going on scrubbing data before moving it to lower environments.

Part 1:

Have you ever heard, “but it works on my machine”? Is this because of data perfection in Development and QA or having specific failure conditions? Can you think of all the data scenarios that accompany Production data? What about performance? Why did the application fail? What happens if I add this index?

Here are the reasons I believe you should get a scrubbed version of your production database into your Development, QA and UAT environments.

Part 2:

All of us have Production database servers and hopefully you also have additional database servers for Development, QA and UAT. Some IT shops will also have a Continuous Integration server and maybe other servers. If you only have Production servers this needs to be addressed and is outside the scope of this post. In the locations where I have worked, we also have a Scrub server. The question is, when a script executes, do you know what environment the query is executing in? Most scripts will not care what environment the script executes in but other scripts could cause damage in a Production environment. For example, if the script is removing email addresses so you don’t spam your clients with automated email messages, you would not want the script to execute in a Production environment.

So how do you make your database server environmentally aware?

 

The concept of a dedicated scrub server is interesting; it’s not something I’ve thought about before.  I’m looking forward to seeing the rest of the series.

Check Bulk Insert Errors

Tom Staab points out bulk insert allows up to 10 errors by default:

The issue was that the last row in a text file contained the row count, so he needed to bulk import all but that last row.

My solution was to set maxerrors to 1 so the import would ignore the last row due to the error. Any other row with an error would still fail the import. This reminded me of one of my least favorite defaults in SQL Server, so I decided to write about it here as well. A lot of people don’t realize this, but by default a bulk insert will only fail after 10 errors. Why not 0? I’ve never understood that. If you don’t change the default and then bulk insert 20 rows of data from a file, it will only fail if over half of the rows cause an error.

Keep track of error incidence and what that means for your data.  The default of 10 errors does seem rather strange.

Categories

July 2017
MTWTFSS
« Jun  
 12
3456789
10111213141516
17181920212223
24252627282930
31