Press "Enter" to skip to content

Category: Data Loading

Data Cleansing In SQLite

Allison Tharp wants to clear out kinda-sorta duplicates from a SQLite table:

However, now I have a lot of database entries that are unneeded.  I thought I would take the time to clean this up (even though I’ll no longer use the data and could easily just delete the tables).  For the BGG Hotness, I have the tables: hotgame, hotperson, and hotcompany.  I have 7,350 rows in each of those tables, since I collected data on 50 rankings every hour for just over 6 days.  However, since the BGG hotness rankings only update daily, I really only need 300 rows (50 rankings * 6 days = 300 rows).

I know think the rankings update between 3 and 4, so I want to only keep the entries from 4:00 AM.  I use the following SELECT statement to make sure I’m in the ballpark with where the data is that I want to keep:

There are several ways to solve this problem; this one is easy and works.  The syntax won’t work for all database platforms, but does the trick for SQLite.

Comments closed

Bulk Loading Text Files With Line Feeds

Steve Jones runs into a scenario in which he wants to bulk load a file not in standard Windows CRLF format:

That’s not good. I suspected this was because of the format of the file, so I added a row terminator.

BULK insert MyTable
from ‘C:\SampleFiles\input.txt’
with ( ROWTERMINATOR = ‘\r’)

That didn’t help. I suspected this was because of the terminators for some reason. I also tried the newline (\n) terminator, and both, but nothing worked.

Since I was worried about formatting, I decided to look at the file. My first choice here is XVI32, and when I opened the file, I could see that only a line feed (0x0A) was used.

Little annoyances like this make me more appreciative of Integration Services (and its mess of little annoyances…).

Comments closed

Scrubbing Data

Tom Norman has a series going on scrubbing data before moving it to lower environments.

Part 1:

Have you ever heard, “but it works on my machine”? Is this because of data perfection in Development and QA or having specific failure conditions? Can you think of all the data scenarios that accompany Production data? What about performance? Why did the application fail? What happens if I add this index?

Here are the reasons I believe you should get a scrubbed version of your production database into your Development, QA and UAT environments.

Part 2:

All of us have Production database servers and hopefully you also have additional database servers for Development, QA and UAT. Some IT shops will also have a Continuous Integration server and maybe other servers. If you only have Production servers this needs to be addressed and is outside the scope of this post. In the locations where I have worked, we also have a Scrub server. The question is, when a script executes, do you know what environment the query is executing in? Most scripts will not care what environment the script executes in but other scripts could cause damage in a Production environment. For example, if the script is removing email addresses so you don’t spam your clients with automated email messages, you would not want the script to execute in a Production environment.

So how do you make your database server environmentally aware?

 

The concept of a dedicated scrub server is interesting; it’s not something I’ve thought about before.  I’m looking forward to seeing the rest of the series.

Comments closed

Check Bulk Insert Errors

Tom Staab points out bulk insert allows up to 10 errors by default:

The issue was that the last row in a text file contained the row count, so he needed to bulk import all but that last row.

My solution was to set maxerrors to 1 so the import would ignore the last row due to the error. Any other row with an error would still fail the import. This reminded me of one of my least favorite defaults in SQL Server, so I decided to write about it here as well. A lot of people don’t realize this, but by default a bulk insert will only fail after 10 errors. Why not 0? I’ve never understood that. If you don’t change the default and then bulk insert 20 rows of data from a file, it will only fail if over half of the rows cause an error.

Keep track of error incidence and what that means for your data.  The default of 10 errors does seem rather strange.

Comments closed