Index-Based Statistics Updates

Michael Bourgon has a script to get information on statistics updates for stats based off of indexes:

Quickie, based off an earlier post. (http://thebakingdba.blogspot.com/2012/02/tuning-statistics-when-were-they.html)

Get the last 4 stat updates for every statistic based on an index. The filter is on the auto_created; flip that to get all the system

This does use the DBCC SHOW_STATISTICS command, which reminds me of a rant (though not about Michael’s code; it’s about the need to use this DBCC command rather than having a nice DMV which returns all of the relevant information).

Making Wide World Importers Bigger

Kevin Feasel

2016-08-15

Data

Koen Verbeeck wants bigger fact tables for Wide World Importers:

Microsoft released a new sample database a couple of months back: Wide World Importers. It’s quite great: not every (unnecessary feature) is included but only features you’d actually use, lots of sample scripts are provided and – most importantly – you can generate data until the current date. One small drawback: it’s quite tiny. Especially the data warehouse is really small. The biggest table, Fact.Order, has about 266,000 rows and uses around 280MB on disk. Your numbers may vary, because I have generated data until the current date (12th of August 2016) and I generated data with more random samples per day. So most likely, other versions of WideWorldImportersDW might be even smaller. That’s right. Even smaller.

260 thousand rows is nothing for a fact table.  I was hoping that the data generator would allow for a bigger range of results, from “I only want a few thousand records” like it does up to “I need a reason to buy a new hard drive.”  Koen helps out by giving us a script to expand the primary fact table.

Connecting Spark and Riak

Kevin Feasel

2016-08-15

Riak, Spark

Pavel Hardak discusses the Riak Connector for Apache Spark:

Modeled using principles from the “AWS Dynamo” paper, Riak KV buckets are good for scenarios which require frequent, small data-sized operations in near real-time, especially workloads with reads, writes, and updates — something which might cause data corruption in some distributed databases or bring them to “crawl” under bigger workloads. In Riak, each data item is replicated on several nodes, which allows the database to process a huge number of operations with very low latency while having unique anti-corruption and conflict-resolution mechanisms. However, integration with Apache Spark requires a very different mode of operation — extracting large amounts of data in bulk, so that Spark can do its “magic” in memory over the whole data set. One approach to solve this challenge is to create a myriad of Spark workers, each asking for several data items. This approach works well with Riak, but it creates unacceptable overhead on the Spark side.

This is interesting in that it ties together two data platforms whose strengths are almost the opposite:  one is great for fast, small writes of single records and the other is great for operating on large batches of data.

Kinesis Analytics

Ryan Nienhuis shows how to write SQL against Amazon Kinesis queues:

In your application code, you interact primarily with in-application streams. For instance, a source in-application stream represents your configured Amazon Kinesis stream or Firehose delivery stream in the application, which by default is named “SOURCE_SQL_STREAM_001”. A destination in-application stream represents your configured destinations, which by default is named “DESTINATION_SQL_STREAM”. When interacting with in-application streams, the following is true:

  • The SELECT statement is used in the context of an INSERT statement. That is, when you select rows from one in-application stream, you insert results into another in-application stream.
  • The INSERT statement is always used in the context of a pump. That is, you use pumps to write to an in-application stream. A pump is the mechanism used to make an INSERT statement continuous.

There are two separate SQL statements in the template you selected in the first walkthrough. The first statement creates a target in-application stream for the SQL results; the second statement creates a PUMP for inserting into that stream and includes the SELECT statement.

This is worth looking into if you use AWS and have a need for streaming data.

Temporal Database Theory

Kennie Pontoppidan reads and reviews a book on temporal database theory:

I have chosen to blog about Richard T. Snodgrass’ book “Developing time-oriented database applications in SQL.” I heard about this book last year around this time, when I started to investigate the new temporal feature “System versioned tables” in SQL Server 2016. I believe it was my old colleague Peter Gram from Miracle who pointed out the book to me, and usually when Peter recommends a book, I buy it and (eventually) read it. It was also about time (no pun intended), since I’m giving a talk on “All things time-related” for two SQL Saturdays during the next few months, and I needed to spice up the presentation with some new material.

In this blog post, I will quickly scratch down a few of the takeaways, I have taken from the book already.

Sounds like an interesting read.

Power BI Storage Regions

Adam Saxton explains where Power BI stores your data and how they choose the region:

This selection is what drives the location of where your data will be stored. Power BI will pick a data region closest to this selection. This selection CANNOT BE CHANGED! You will want to think about the location that makes the most sense for you.

For example, if the majority of your organization’s users are in Australia, and you are in the United States, it probably makes more sense to select Australia for the country. You may also have legal requirements that your organization’s data needs to be in a specific country.

The “cannot be changed” part means that this decision is a lot more important than you might first realize.

The Case For Self-Service BI

Matt Allington makes the case for self-service BI:

Success or failure of Enterprise BI can be shown as a continuum.

The 5 sample points I call out (from best to worst) are:

  1. It adds lots of value to lots of people.
  2. It’s OK, lots of “export to Excel”
  3. Some use, but not worth the cost
  4. It is a failure and it is written off
  5. It is a failure but you keep it.

Note what I list as the worst possible outcome.  The solution is no good, and no one does anything about it.  This is much worse than writing it off as a failure as you can’t move on if you don’t accept you have a problem.

This is a provocative article with some good comments.  I’ve mixed emotions about this, as I see Matt’s point and agree with him in the hypothetical scenario, but it’s really easy for business users to get the wrong answers from self-service tools (e.g., introducing hidden cartesian products or not applying all business rules to calculations) and give up on the product.  That might be a function of me doing it wrong and I’ll cop to that if so, but I think that self-service BI needs a “You must be this tall to ride” sign.

NOT IN NULL

Kevin Feasel

2016-08-15

T-SQL

Jen McCown warns against NULL in NOT IN statements:

The IF statement asks: “is 1 in the collection of values (2, NULL)?” And we must say, “ah, no idea. That can’t be proven. Therefore we can’t return true.”

The value 1 cannot be determined to not be in the set of (2, NULL).

NULL is a strange bird.

Missing Libraries With SQL Server R Services

Kevin Feasel

2016-08-15

R

Tomaz Kastrun has a script to check and install missing packages in SQL Server R Services code:

Result in this case will be successful with correct R results and sp_execute_external_script will not return error for missing libraries.

I added a “fake” library called test123 for testing purposes if all the libraries will be installed successfully.

At the end the script generated xp_cmdshell command (in one line)

This is a rather clever solution to a problem which I’d rather not exist.  There really ought to be a better way for authorized users programmatically to install packages.

Latch Promotion

Ewald Cress discusses latch promotion threshold calculations:

Now I wish I could use the phrase “cycle-based promotion threshold” in a tone that suggests we were all born knowing the context, but to be honest, I don’t yet have all the pieces. Here is the low-down as it stands in SQL Server 2014:

  • Everything I’m describing applies only to page latches.

  • A cycle-based promotion simply means one that is triggered by the observation that the average acquire time for a given page latch (i.e. the latch for a given page) has exceeded a threshold.

  • Because the times involved are so short, they are measured not in time units but in CPU ticks.

  • There exists a global flag that can enable cycle-based promotions, although I do not know what controls that flag.

  • If cycle-based promotion is disabled, there is another path to promotion; this will be be discussed in Part 4.

I don’t think I’d ever seen the informational message Ewald mentions, so this was a brand new topic to me.

Categories

August 2016
MTWTFSS
« Jul Sep »
1234567
891011121314
15161718192021
22232425262728
293031