Press "Enter" to skip to content

Day: July 14, 2016

Indexes And Stats

Brent Ozar looks at a case when adding a suggested index monkeys with stats:

The query runs faster, make no mistake – but check out the estimates:

  • Estimated number of rows = 1
  • Actual number of rows = 165,367

Those estimates are built by SQL Server’s cardinality estimator (CE), and there have been major changes to it over the last couple of versions. You can control which CE you’re using by changing the database’s compatibility level. This particular StackOverflow database is running in 2016 compat mode – so what happens if we switch it back to 2012 compat mode?

Based on this result, there might be further optimizations available.  Read on for more of Brent’s thoughts.

Comments closed

Documenting Replication Using Powershell

Shane O’Neill learned a bit of Powershell in the process of documenting replication:

Now, due to an unfortunate incident when I was a Software Support Engineer that involved a 3 week old backup and a production database, I prefer to not to use the GUI if I can help it.

I’m not joking about that as well, if there is ANY way that I can accomplish something with scripts instead of the GUI, I will take it!

Especially when the need was to document the properties of over 100 articles, I was particularly not looking forward to opening the article properties window for each of the articles and copying them out individually.

Check it out.

Comments closed

Use Source Control

James Anderson wants you to use source control:

SSC and SSDT require the use of compare tools to build deployment scripts. This is referred to as a state based migration. I’d done deployments like this in the past and saw that people reviewing the release found it difficult to review these scripts when the changes were more than trivial. For this reason, I decided to look at some migration based solutions. Migration solutions generate scripts during the development process that will be used to deploy changes to production. This allows the developer to break the changes down into small manageable individual scripts which in turn makes code reviews easier and deployments feel controlled. These scripts sit in the VS project and are therefore source controlled in the same way as the database.

James recommends Git here.  I’m not Git’s biggest fan, but it’s much, much better than not having any source control at all.

Comments closed

The Basics Of Notebooks

I have a quick walkthrough of notebooks:

Remember chemistry class in high school or college?  You might remember having to keep a lab notebook for your experiments.  The purpose of this notebook was two-fold:  first, so you could remember what you did and why you did each step; second, so others could repeat what you did.  A well-done lab notebook has all you need to replicate an experiment, and independent replication is a huge part of what makes hard sciences “hard.”

Take that concept and apply it to statistical analysis of data, and you get the type of notebook I’m talking about here.  You start with a data set, perform cleansing activities, potentially prune elements (e.g., getting rid of rows with missing values), calculate descriptive statistics, and apply models to the data set.

I didn’t realize just how useful notebooks were until I started using them regularly.

Comments closed

Going From Pig To Spark

Philippe de Cuzey introduces Spark to people already familiar with Pig:

I like to think of Pig as a high-level Map/Reduce commands pipeline. As a former SQL programmer, I find it quite intuitive, and at my organization our Hadoop jobs are still mostly developed in Pig.

Pig has a lot of qualities: it is stable, scales very well, and integrates natively with the Hive metastore HCatalog. By describing each step atomically, it minimizes conceptual bugs that you often find in complicated SQL code.

But sometimes, Pig has some limitations that makes it a poor programming paradigm to fit your needs.

Philippe includes a couple of examples in Pig, PySpark, and SparkSQL.  Even if you aren’t familiar with Pig, this is a good article to help familiarize yourself with Spark.

Comments closed

Comparing Nullable Columns

Daniel Hutmacher shows an elegant way to compare multiple nullable columns on two tables:

Because we’ve added OR conditions into the mix, we’re forced to use the Nested Loop join, which loops over table B for every single row in A. That’s a lot of index scans and it comes with a hefty price tag.

Here’s an absolutely eye-watering beautiful pattern that I found on the Interwebs (though I forgot where) the other day.

This is an interesting use of INTERSECT.  Check it out.

Comments closed

Shortest Paths

Sanjay Mishra and Arvind Shaymsundar tie R to SQL Server to solve a pathfinding problem faster:

In such a case, if a developer were to implement Dijkstra’s algorithm to compute the shortest path within the database using T-SQL, then they could use approaches like the one at Hans Oslov’s blog. Hans offers a clever implementation using recursive CTEs, which functionally does the job well. This is a fairly complex problem for the T-SQL language, and Hans’ implementation does a great job of modelling a graph data structure in T-SQL. However, given that T-SQL is mostly a transaction and query processing language, this implementation isn’t very performant, as you can see below.

The important thing to remember is that these technologies tend to complement each other rather than supplant them.

Comments closed

Azure SQL Data Warehouse GA

James Serra notes that Azure SQL Data Warehouse is now generally available:

In brief, SQL DW is a fully managed data-warehouse-as-a-service that you can provision in minutes and scale up in seconds.  With SQL DW, storage and compute scale independently.  You can dynamically deploy, grow, shrink, and even pause compute, allowing for cost savings.  Also, SQL DW uses the power and familiarity of T-SQL so you can integrate query results across relational data in your data warehouse and non-relational data in Azure blob storage or Hadoop using PolyBase.  SQL DW offers an availability SLA of 99.9%, the only public cloud data warehouse service that offers an availability SLA to customers.

The pricing calculator now reflects GA prices.

Comments closed

Desired Enhancements

Jason Brimhall has a list of some nice SQL Server updates and bug fixes, followed by some things he’d like to see:

Digging a little deeper on this one. I would really love to see an enhancement to Resource Governor. Not just any enhancement will do. I need it to be enhanced so it will also affect the reporting services engine and the integration services engine in addition to the database engine. I want to be able to use RG to prevent certain reports from over consuming resources within the SSRS engine. Or for that matter, I want to make sure certain SSIS packages do not consume too much memory. If I can implement constraints on resources for these two engines it would be a huge improvement.

Check it out.

Comments closed