Database Containment Checks

Parikshit Savjani discusses a nice little performance optimization in the latest versions of SQL Server 2014 and 2016:

Starting SQL 2014 SP1 CU8, SQL 2014 SP2 CU1 and SQL 2016 CU1, the spinlock to check the database containment property is replaced by the “load acquire and store release” lock semantics, which is a non-blocking lock-free synchronization mechanism between the concurrent threads. This avoids exclusive spinlocks and thereby avoids the spinlock collisions between the concurrent threads executing stored procedures from same database as described earlier. This change improves the overall concurrency and scalability of the system especially if all the worker threads are simultaneously executing a stored procedure from same database.

On extremely busy systems, this might be a reason to update.

Documenting A Data Warehouse

Jesse Seymour discusses a few forms of documentation for a data warehouse:

Extended properties are a great way to internally document the data warehouse.  The key advantage here is that the values of these extended properties can be retrieved with a T-SQL query.  This allows us to access this information with a view as needed.  My favorite method of using this is to create an SSRS report that end users can run to look up the attributes and comments I store in the extended property.  Data warehouse tools take some of the pain out of the process.  Unfortunately, not all tools support use of extended properties.  Make sure your tool does or consider changing tools.  Be sure to document the names and use cases for each property you create.  Consistency is the key to the value here.

I’ve never been a big fan of extended properties, mostly because I typically don’t work with tools which expose that information easily.  Regardless, there are other important forms of documentation, so read on.

RDBMS To Hive Via Kafka

Kevin Feasel

2016-08-26

ETL, Hadoop

Rajesh Nadipalli shows how to use Kafka to read relational database data and feed it to Hive:

Processes that publish messages to a Kafka topic are called “producers.” “Topics” are feeds of messages in categories that Kafka maintains. The transactions from RDBMS will be converted to Kafka topics. For this example, let’s consider a database for a sales team from which transactions are published as Kafka topics. The following steps are required to set up the Kafka producer

I’d call this a non-trivial but still straightforward exercise.  Step 1 from the SQL Server side could be reading from transaction logs (which would be the least-intrusive), but you could also set up something like change tracking and fire off messages when important tables’ records change.

Data Frames

Kevin Feasel

2016-08-26

R

Meltem Ballan introduces data frames in R:

Our first data frame constrained of seven vectors, Customer_Id, loan_type, First_Name, Last_name, Gender, Zip_code and amount.

NOTE: R is case sensitive. That is why I have used lower and upper case for you to practice.

After we run the lines we want to see how our first data frame looks. Following command will suffice that need:

>View(my_first_df)

If you’re coming from a SQL background, data frames are tables.  Well-formed (“clean”) data frames more or less follow first normal form.

Collecting Wait Stats

Kendra Little on collecting wait stats as part of a baseline:

I do love wait stats!

If you listened to the performance tuning methodology I outlined in an earlier episode, you saw how important I think wait stats are for troubleshooting performance.

If you missed that episode, it’s called Lost in Performance Tuning. (I’ve got an outline of the discussion in the blog post, as always.)

I agree with Kendra’s advice that buying a vendor tool is the right choice here, whenever it’s possible.  It’s fairly likely that you’ll spend more money creating (and maintaining) your own scaled-down version of a vendor tool than biting the bullet and paying for a packaged product.

Dynamic Index Generation

Brent Ozar generates 999 indexes:

The CHARACTER_MAXIMUM_LENGTH <> -1 OR IS NULL stuff is because I don’t want to try to index NVARCHAR(MAX) fields, but I do want to index NVARCHAR(50) fields.

The ORDER BY is because I wanted to get a wide variety of leading fields. If I just ordered by c1.COLUMN_NAME, I wouldn’t get all of the first fields covered in the first 999 indexes. (And I’m not even using a large table.)

Sometimes I think I’ve worked on systems which used this script to build indexes.  But then I read the index names:  “dta.”  And it all makes sense.

Training Data With Azure ML

Koos van Strien discusses training data sets and cross-validating results:

When choosing a train and testset, you’ll implicitly introduce a new bias: it could be that the model you just trained predicts well for this particular testset, when trained for this particular trainset. To reduce this bias, you could “cross-validate” your results.

Cross-validation (often abbreviated as just “cv”) splits the dataset into n folds. Each fold is used once as a testset, using all other folds together as a training set. So in our pizza example with 100 records, with 5 folds we will have 5 test runs:

This isn’t Azure ML-specific, and is good reading.

Last Item In Each Group

Reza Rad shows how to get the last item in each group using Power Query:

Scenario that I want to solve as an example is this:

FactInternetSales has sales transaction information for each customer, by each product, each order date and some other information. We want to have a grouped table by customer, which has the number of sales transaction by each customer, total sales amount for that customer, the first and the last sales amount for that customer. First and last defined by the first and last order date for the transaction.

In T-SQL, this sounds like the job of window functions.  In Power BI, we write M.

Using The OUTPUT Clause

Kevin Feasel

2016-08-25

T-SQL

Steve Jones looks at the OUTPUT clause:

I often see people struggling to use triggers for auditing, or having issues with building them to handle multi row updates. However, there’s another choice: the OUTPUT clause.

Not many people use this clause, but it’s a great way to access the virtual inserted and deleted tables in your code.

My favorite post on the topic is still Adam Machanic’s.

Confirming Checkpoints

Arun Sirpal shows how to log when checkpointing runs:

Via Configuration manager I enabled trace flags 3502 and 3605 – both needed to get the checkpoint information and write it to the error log.

I then shutdown the machine, on start-up I looked into the error log.

1
EXEC XP_READERRORLOG

Notice the ‘s’ in front of the spid<number>? Well that means the checkpoint was done via the automatic process; if you do a manual checkpoint it won’t see this letter.

I did not know that the “s” indicated that this was an automated process.

Categories

October 2018
MTWTFSS
« Sep  
1234567
891011121314
15161718192021
22232425262728
293031