Since the load was taking longer than expected, we examined the sys.dm_exec_requests DMV while load was running, and saw that there was a fair amount of latch contention on various pages, which wouldn’t not be expected if data was being loaded via a bulk API.
Examining the statements being executed, we saw that the JDBC driver uses sp_prepare followed by sp_execute for each inserted row; therefore, the operation is not a bulk insert. One can further example the Spark JDBC connector source code, it builds a batch consisting of singleton insert statements, and then executes the batch via the prep/exec model.
It’s the power of bulk insertion.
Alright, back to the original question. So, how to combine optimistic locking and automatic retry? In other words, when the application gets an error from the database saying that the versions of a Product don’t match, how to retry the same operation again?
The short answer is: nohow. You don’t want to do that because it defeats the very purpose of having an optimistic lock in the first place.
Remember that the locking mechanism is a way to ensure that all changes are taken into consideration when changing a record in the database. In other words, someone should review the new version of the record and make an informed decision as to whether they still want to submit the update. And that should be the same client who originated the initial request, you can’t make that decision for them.
Plenty of systems do this sort of data merging automatically, but I get Vladimir’s point: if someone else pulled the rug out from under you, it might change your decision on what that data should look like.
USE CASE 2: UPDATE HIVE PARTITIONS.
A common strategy in Hive is to partition data by date. This simplifies data loads and improves performance. Regardless of your partitioning strategy you will occasionally have data in the wrong partition. For example, suppose customer data is supplied by a 3rd-party and includes a customer signup date. If the provider had a software bug and needed to change customer signup dates, suddenly records are in the wrong partition and need to be cleaned up.
It has been interesting to see Hive morph over the past few years from a batch warehousing system to something approaching a relational warehouse. This is one additional step in that direction.
The data lake introduces a new data analysis paradigm shift:
OLD WAY: Structure -> Ingest -> Analyze
NEW WAY: Ingest -> Analyze -> Structure
This allows you to avoid a lot of up-front work before you are able to analyze data. With the old way, you have to know the questions to ask. The new way supports situations when you don’t know the questions to ask.
This solves the two biggest reasons why many EDW projects fail:
Too much time spent modeling when you don’t know all of the questions your data needs to answer
Wasted time spent on ETL where the net effect is a star schema that doesn’t actually show value
There are some good details here. My addition would be to reiterate the importance of a good data governance policy.
Hmm, so it looks like I made a mistake. I hope my honesty won’t lose me any izzat, or ability to command respect. I think it’s important to see how people really learn and really solve problems. So, I’m including my screw ups in this post.
Apparently, I created a linked table and I can’t see how to edit the the Power Query portion for that. A linked table is a nice way to pull raw data from the Excel workbook. It’s great for reference tables, but doesn’t solve our problem.
Come for the data analysis, stay for the spelling bee. This is part one of a two-parter, focusing on techniques to get the data in a digestible format; part two will do interesting things with the data.
Meagan Longoria has a multi-part series on using Biml to script Azure Data Factory tasks to migrate data from an on-prem SQL Server instance to Azure Data Lake Store. Here’s part 1:
My Azure Data Factory is made up of the following components:
Gateway – Allows ADF to retrieve data from an on premises data source
Linked Services – define the connection string and other connection properties for each source and destination
Datasets – Define a pointer to the data you want to process, sometimes defining the schema of the input and output data
Pipelines – combine the data sets and activities and define an execution schedule
Click through for the Biml.
This requirement has been addressed in SQL Server 2016 (and backported to SQL Server 2014 SP2). To test this, I obtained a UTF-8 dataset from http://www.columbia.edu/~fdc/utf8/. The dataset is translation of the sentence “I can eat glass and it doesn’t hurt me” in several languages. A few lines of sample data are shown here:
(As an aside, it is entirely possible to load Unicode text such as above into SQL Server even without this improvement, as long as the source text file uses a Unicode encoding other than UTF-8.)
I ran into this problem before, where developers wanted to bulk load UTF-8 but had to settle for an inferior solution.
At first, you have to read the file you want to copy into a SQL Server. You have to choose a database to perform that action. It can be Test database or you can create a new database to perform that action or it can be even TempDB. There is only two requirements for the database:
– It must not be a production Database;
– Database should have enough of space to accommodate the file you want to copy.
The idea is that if the database engine’s service account has rights to a file you want to access but don’t have permissions to access, you can bulk copy the contents as a binary blob and then grab the contents and write the results to your local system using bcp. Sure, it becomes your company’s most expensive file copy tool, but I love the mad ingeniousness behind it.
However, now I have a lot of database entries that are unneeded. I thought I would take the time to clean this up (even though I’ll no longer use the data and could easily just delete the tables). For the BGG Hotness, I have the tables: hotgame, hotperson, and hotcompany. I have 7,350 rows in each of those tables, since I collected data on 50 rankings every hour for just over 6 days. However, since the BGG hotness rankings only update daily, I really only need 300 rows (50 rankings * 6 days = 300 rows).
I know think the rankings update between 3 and 4, so I want to only keep the entries from 4:00 AM. I use the following SELECT statement to make sure I’m in the ballpark with where the data is that I want to keep:
There are several ways to solve this problem; this one is easy and works. The syntax won’t work for all database platforms, but does the trick for SQLite.
That’s not good. I suspected this was because of the format of the file, so I added a row terminator.
BULK insert MyTable
with ( ROWTERMINATOR = ‘\r’)
That didn’t help. I suspected this was because of the terminators for some reason. I also tried the newline (\n) terminator, and both, but nothing worked.
Since I was worried about formatting, I decided to look at the file. My first choice here is XVI32, and when I opened the file, I could see that only a line feed (0x0A) was used.
Little annoyances like this make me more appreciative of Integration Services (and its mess of little annoyances…).