Curated SQL – Page 1380 – A Fine Slice Of SQL Server

Keep That Data Raw

Published 2017-09-18 by Kevin Feasel

Archana Madhavan argues that you should retain your raw data:

When your pipeline already has to read every line of your data, it’s tempting to make it perform some fancy transformations. But you should steer clear of these add-ons so that you:

Avoid flawed calculations. If you have thousands of machines running your pipeline in real-time, sure, it’s easy to collect your data — but not so easy to tell if those machines are performing the right calculations.
Won’t limit yourself to the aggregates you decided on in the past. If you’re performing actions on your data as it streams by, you only get one shot. If you change your mind about what you want to calculate, you can only get those new stats going forward — your old data is already set in stone.
Won’t break the pipeline. If you start doing fancy stuff on the pipeline, you’re eventually going to break it. So you may have a great idea for a new calculation, but if you implement it, you’re putting the hundreds of other calculations used by your coworkers in jeopardy. When a pipeline breaks down, you may never get that data.

The problem is that even if the cost of storage is much cheaper than before, there’s a fairly long tail before you get into potential revenue generation. I like the idea, but selling it is hard when you generate a huge amount of data.

Comments closed

Long-Term Storage In Kafka

Published 2017-09-18 by Kevin Feasel

Jay Kreps shows us that you can use Kafka as a primary data store:

The short answer is that it’s not insane, people do this all the time, and Kafka was actually designed for this type of usage. But first, why might you want to do this? There are actually a number of use cases, here’s a few:

You may be building an application using event sourcing and need a store for the log of changes. Theoretically you could use any system to store this log, but Kafka directly solves a lot of the problems of an immutable log and “materialized views” computed off of that. The New York Times does this for all their article data as the heart of their CMS.
You may have an in-memory cache in each instance of your application that is fed by updates from Kafka. A very simple way of building this is to make the Kafka topic log compacted, and have the app simply start fresh at offset zero whenever it restarts to populate its cache.
Stream processing jobs do computation off a stream of data coming via Kafka. When the logic of the stream processing code changes, you often want to recompute your results. A very simple way to do this is just to reset the offset for the program to zero to recompute the results with the new code. This sometimes goes by the somewhat grandiose name of The Kappa Architecture.
Kafka is often used to capture and distribute a stream of database updates (this is often called Change Data Capture or CDC). Applications that consume this data in steady state just need the newest changes, however new applications need start with a full dump or snapshot of data. However performing a full dump of a large production database is often a very delicate and time consuming operation. Enabling log compaction on the topic containing the stream of changes allows consumers of this data to simple reload by resetting to offset zero.

This is a great article, especially the part about how Kafka is not the data storage system; there are reasons you’d want data in other formats as well (like relational databases, which are great for random access queries).

Comments closed

tSQLt And VS Database Tests

Published 2017-09-18 by Kevin Feasel

Gavin Campbell combines tSQLt along with a Visual Studio database test project:

There are a few ways of getting the tSQLt objects deployed to where they are needed for testing, the way I use most often is basically this one, whereby we create a .dacpac of just the tSQLt objects (or use one we made earlier!), and create a second database project with a Database Reference of type “Same Database” to the project we are trying to test, and a reference to our tSQLt .dacpac. The .dacpac file needs to be somewhere in our source folders, as it will be added by path. We also need a reference to the master database, as this is required to build the tSQLt project.

I see the two tools as serving completely different purposes: tSQLt is a decent unit test framework, whereas a Visual Studio database project is a decent integration test framework. It’s interesting that Gavin was able to combine them here but aside from having a common test runner, my inclination would be to keep them separated.

Comments closed

Long Live The DBA

Published 2017-09-18 by Kevin Feasel

Kellyn Pot’vin-Gorman notes that the “Gone will be the DBA” trend has hit Oracle as well:

Any DBA who specializes in optimization knows that hardware offers around 15% overall opportunity for improvement. My favorite quote from Cary Millsap, “You can’t hardware your way out of a software problem” is quite fitting, too. A hardware upgrade can offer a quick increase in performance, only to find that the problem seemingly returns after a period of time. As we’ve discussed in previous posts. The natural life of a database is growth- growth in data, growth in processing, growth in users. This growth requires more resources and if the environment is not performing as optimally and efficiently as possible, more resources will always be required.

Someday I will write my “No, the DBA isn’t going anywhere” opus, but today is not that day. Anyhow, this is a good post for anyone worried that automation will kill the DBA.

Comments closed

Linked Servers And Columnstore

Published 2017-09-18 by Kevin Feasel

Niko Neugebauer continues his columnstore series by looking at how they interact with linked servers:

Lets us make sure everything is fine for data transfer and as we are using our source server (SQL Server 2014) with Linked Server to SQL Server 2016, let us insert a couple of ObjectIds to the T1 table that we have created in the [Test] database:

1

2

3

INSERT INTO [.\SQL16].Test.dbo.T1 (C1)

SELECT so.object_id

FROM sys.objects so;

This statement will result in the error message that you can find below, telling us something about Cursors (????):

1

2

Msg 35370, Level 16, State 1, Line 1

Cursors are not supported on a table which has a clustered columnstore index.

WHAT ? SORRY ? THERE ARE NO CURSORS HERE !
We have no cursors, we just have a Clustered Columnstore Index on our table!

Read on to see how to get around this error, to the extent that you can.

Comments closed

Using RAISERROR For Debug Info

Published 2017-09-18 by Kevin Feasel

Doug Lane exhorts people to use RAISERROR instead of PRINT when printing messages:

It wasn’t until a few years ago, when I started contributing to the First Responder Kit at Brent Ozar Unlimited, that I noticed every status message in the kit scripts was thrown with something other than PRINT.

Strange, I thought, since those scripts like to report on what statements are running. Turns out, they avoided PRINT because it has some serious drawbacks:

PRINT doesn’t necessarily output anything at the moment it’s called.

PRINT statements won’t show up in Profiler.

PRINT can’t be given variable information without CAST or CONVERT.

Those are important limitations, as Doug shows.

Comments closed

So You Want To Wait…

Published 2017-09-18 by Kevin Feasel

If you need your queries to be slower, Kenneth Fisher has you covered:

And in case you run into a development team that complains that when they time their code the duration is all over the place, this little gem will make sure their query will always take the same amount of time (assuming normal run time is under 90 seconds).

It’s the T-SQL equivalent of speed-up loops.

Comments closed

Adding Public Holidays To A Date Dimension

Published 2017-09-18 by Kevin Feasel

Reza Rad continues his series on Power BI date dimensions:

To get public holidays live, you first need an API that is giving you up-to-date information. There are some web pages that has the list of public holidays. I have already explained in another blog post how to use a web page and query public holidays from there. That method uses custom functions as well, here you can read about that.

The method of reading data from a web page has an issue already; Web.Page function from Power Query is used to pull data from that page, and this function needs a gateway configuration to work. There is another function Xml.Document that can work even without the gateway. So because of this reason, we’ll use Xml.Document and get data from an API that provides the result set as XML.

WebCal.fi is a great free website with calendars for 36 countries which I do recommend for this example. This website, provides the calendars through XML format. There are other websites that give you the calendar details through a paid subscription. However, this website is a great free one which can be used for this example. WebCal.fi is created by User Point Inc.

This was an interesting approach to the problem, one I did not expect when first reading the article. I figured it’d be some sort of date calculation script.

Comments closed

Creating A Simple Kafka Streams Application

Published 2017-09-15 by Kevin Feasel

Bill Bejeck has built a simple Kafka Streams application for us:

This blog post will quickly get you off the ground and show you how Kafka Streams works. We’re going to make a toy application that takes incoming messages and upper-cases the text of those messages, effectively yelling at anyone who reads the message. This application is called the yelling application.

Before diving into the code, let’s take a look at the processing topology you’ll assemble for this “yelling” application. We’ll build a processing graph topology, where each node in the graph has a particular function.

His entire application is 20 lines of code but it does function as a valid Kafka Streams app and works well as a demo.

Comments closed

Introduction To Bayesian Statistics

Published 2017-09-15 by Kevin Feasel

Kennie Nybo Pontoppidan has just completed a course on Bayesian statistics:

Last month I finished a four-week course on Bayesian statistics. I have always wondered why people deemed it hard, and why I heard that the computations quickly became complicated. The course wasn’t that hard, and it gave a nice introduction to prior/posterior distributions and I many cases also how to interpret the parameters in the prior distribution as extra data points.

An interesting aspect of Bayesian statistics is that it is a mathematically rigorous model, with no magic numbers such as the 5% threshold for p-values. And I like the way it naturally caters sequential hypothesis testing with where the sample size of each iteration is not fixed in advance. Instead data are evaluated and used to update the model as they are collected.

Check out Kennie’s explanation as well as the course. I also went through Bayes’ Theorem not too long ago, which is a good introduction to the topic if you’re unfamiliar with Bayes’s Law.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Curated SQL Posts