Meagan Longoria has a multi-part series on using Biml to script Azure Data Factory tasks to migrate data from an on-prem SQL Server instance to Azure Data Lake Store. Here’s part 1:
My Azure Data Factory is made up of the following components:
Gateway – Allows ADF to retrieve data from an on premises data source
Linked Services – define the connection string and other connection properties for each source and destination
Datasets – Define a pointer to the data you want to process, sometimes defining the schema of the input and output data
Pipelines – combine the data sets and activities and define an execution schedule
Click through for the Biml.
This requirement has been addressed in SQL Server 2016 (and backported to SQL Server 2014 SP2). To test this, I obtained a UTF-8 dataset from http://www.columbia.edu/~fdc/utf8/. The dataset is translation of the sentence “I can eat glass and it doesn’t hurt me” in several languages. A few lines of sample data are shown here:
(As an aside, it is entirely possible to load Unicode text such as above into SQL Server even without this improvement, as long as the source text file uses a Unicode encoding other than UTF-8.)
I ran into this problem before, where developers wanted to bulk load UTF-8 but had to settle for an inferior solution.
At first, you have to read the file you want to copy into a SQL Server. You have to choose a database to perform that action. It can be Test database or you can create a new database to perform that action or it can be even TempDB. There is only two requirements for the database:
– It must not be a production Database;
– Database should have enough of space to accommodate the file you want to copy.
The idea is that if the database engine’s service account has rights to a file you want to access but don’t have permissions to access, you can bulk copy the contents as a binary blob and then grab the contents and write the results to your local system using bcp. Sure, it becomes your company’s most expensive file copy tool, but I love the mad ingeniousness behind it.
However, now I have a lot of database entries that are unneeded. I thought I would take the time to clean this up (even though I’ll no longer use the data and could easily just delete the tables). For the BGG Hotness, I have the tables: hotgame, hotperson, and hotcompany. I have 7,350 rows in each of those tables, since I collected data on 50 rankings every hour for just over 6 days. However, since the BGG hotness rankings only update daily, I really only need 300 rows (50 rankings * 6 days = 300 rows).
I know think the rankings update between 3 and 4, so I want to only keep the entries from 4:00 AM. I use the following SELECT statement to make sure I’m in the ballpark with where the data is that I want to keep:
There are several ways to solve this problem; this one is easy and works. The syntax won’t work for all database platforms, but does the trick for SQLite.
That’s not good. I suspected this was because of the format of the file, so I added a row terminator.
BULK insert MyTable
with ( ROWTERMINATOR = ‘\r’)
That didn’t help. I suspected this was because of the terminators for some reason. I also tried the newline (\n) terminator, and both, but nothing worked.
Since I was worried about formatting, I decided to look at the file. My first choice here is XVI32, and when I opened the file, I could see that only a line feed (0x0A) was used.
Little annoyances like this make me more appreciative of Integration Services (and its mess of little annoyances…).
Tom Norman has a series going on scrubbing data before moving it to lower environments.
Have you ever heard, “but it works on my machine”? Is this because of data perfection in Development and QA or having specific failure conditions? Can you think of all the data scenarios that accompany Production data? What about performance? Why did the application fail? What happens if I add this index?
Here are the reasons I believe you should get a scrubbed version of your production database into your Development, QA and UAT environments.
All of us have Production database servers and hopefully you also have additional database servers for Development, QA and UAT. Some IT shops will also have a Continuous Integration server and maybe other servers. If you only have Production servers this needs to be addressed and is outside the scope of this post. In the locations where I have worked, we also have a Scrub server. The question is, when a script executes, do you know what environment the query is executing in? Most scripts will not care what environment the script executes in but other scripts could cause damage in a Production environment. For example, if the script is removing email addresses so you don’t spam your clients with automated email messages, you would not want the script to execute in a Production environment.
So how do you make your database server environmentally aware?
The concept of a dedicated scrub server is interesting; it’s not something I’ve thought about before. I’m looking forward to seeing the rest of the series.
The issue was that the last row in a text file contained the row count, so he needed to bulk import all but that last row.
My solution was to set maxerrors to 1 so the import would ignore the last row due to the error. Any other row with an error would still fail the import. This reminded me of one of my least favorite defaults in SQL Server, so I decided to write about it here as well. A lot of people don’t realize this, but by default a bulk insert will only fail after 10 errors. Why not 0? I’ve never understood that. If you don’t change the default and then bulk insert 20 rows of data from a file, it will only fail if over half of the rows cause an error.
Keep track of error incidence and what that means for your data. The default of 10 errors does seem rather strange.