Press "Enter" to skip to content

Author: Kevin Feasel

When Join Order Matters

Bert Wagner takes a look at one of the lesser appreciated tricks in performance tuning:

I had a great question submitted to me (thank you Brandman!) that I thought would make for a good blog post:

…I’ve been wondering if it really matters from a performance standpoint where I start my queries. For example, if I join from A-B-C, would I be better off starting at table B and then going to A & C?

The short answer: Yes. And no.

One of my favorite query tuning books is SQL Tuning by Dan Tow.  Parts of it are rather dated at this point—like pretty much anything involving a rule-based optimizer—but the gist still works well.  What it comes down to is finding the best single table from which to drive your query (based on table size, filters, etc.) and selecting the appropriate join order afterward.  It’s fairly time-consuming effort, but for the 0.5-1% of queries which really need it, it can be the difference between an awful plan and a good one.

Comments closed

T-SQL Tuesday Roundup

Ewald Cress has what might have been the largest T-SQL Tuesday ever:

Firstly, I want to thank every person who took part. SIXTY TWO blog posts got generated, including a few first-time #tsql2sday contributors as well as first-time bloggers. I am fairly glowing to have been a part of it, and I hope the other contributors are too.

Secondly, from my own experience in writing a post, I know it feels terrible when you start worrying about who to pick. There are many people I could have included, but I hope I have made my appreciation for them clear elsewhere. Not that I want to speak on your behalf, but I’ll assume that the same applies to many other contributors.

That’s a lot of reading.

Comments closed

Importing SSMS Registered Servers Into SQL Operations Studio

Drew Furgiuele has a hankering for SQL Operations Studio and wants to invite a few servers to the party:

One barrier to entry is that the initial setup can be a little daunting, especially if you use a local connection groups or central management servers to keep track of registered connections in SQL Server Management Studio. You’d be in for a lot of manual clicking and typing of connections if you have a lot of saved connections. But there’s a better way: you can import all that saved information right into SQL Operations Studio, and it’s pretty painless, too. Buckle up, because this involves a little knowledge of how settings are saved in Operations Studio, and how we can quickly get saved connection information out of SSMS and into your new application. Spoiler alert, we’re going to use PowerShell.

I’d love to see CMS support in SQL Operations Studio.  In the meantime, this is a more or less reasonable alternative, depending upon how many servers you have and how frequently they change.

Comments closed

Clusterless Availability Groups For Scaling Out Reads

Sean Gallardy shows a good use case for Availability Groups in scaling out reads:

Read-Scale availability groups are ones where we don’t want the availability group for high-availability or disaster recovery, instead, we want to use it to create multiple copies of our databases that span across multiple servers allowing for the spreading of a large read-only workload. There are various scenarios where this might be extremely valuable and in previous versions of SQL Server it was possible, though there was a requirement of using Windows Server Failover Clustering (WSFC). Read-Scale availability groups do not require the WSFC component and does not give high-availability or disaster recovery, it only acts as a mechanism (availability groups) to facilitate the synchronization of the databases across multiple servers.

To reiterate, this is not used for high-availability or disaster recovery but instead to scale your databases across multiple servers for read workloads.

The remainder of the post shows how to set up an Availability Group without the corresponding Windows Server Failover Clustering components.

Comments closed

Picking A Python IDE

Kevin Jacobs reviews a few Python IDEs from the perspective of a data scientist:

Ladies and gentlemens, this is one of the most perfect IDEs for editing your Python code! At least in my opinion. Jupyter notebook is a web based code editor and can quickly generate visualizations. You can mix up code and text containing no, simple or complex mathematics. One thing I am missing here, is the support for code completion, but there are tons of plugins available so this should be no problem. It is also easy to turn your notebook into a presentation. For collaboration with non-technical teams, this is a great tool.

Conclusion: perfect Python IDE for data science! Less support for code inspection.

Click through for reviews of three IDEs.

Comments closed

Data Type Conversions In 4 Database Systems

Eleni Markou has samples for converting strings to dates, numerals, or currency in SQL Server, Postgres, Redshift, and BigQuery:

The TO_DATE function in PostgreSQL is used to converting strings into dates. Its syntax is TO_DATE(text, text) and the return type is a date.

In contrast with MS SQL Server which has strictly specified date formats, in Redshift, any format constructed using the patterns of the table found in the corresponding documentation can be correctly interpreted.

When using the TO_DATE() one has to pay attention as even if an invalid date is passed, it will convert it into a nominally valid date without raising any error.

There are a few other tricks in SQL Server for some of these (for example, on 2012 or newer, I’d use TRY_CONVERT rather than CONVERT).  That said, it’s a good overview of how to translate skills in one relational system to another.

Comments closed

Handling Imbalanced Data

Tom Fawcett shows us how to handle a tricky classification problem:

The primary problem is that these classes are imbalanced: the red points are greatly outnumbered by the blue.

Research on imbalanced classes often considers imbalanced to mean a minority class of 10% to 20%. In reality, datasets can get far more imbalanced than this. —Here are some examples:

  1. About 2% of credit card accounts are defrauded per year. (Most fraud detection domains are heavily imbalanced.)
  2. Medical screening for a condition is usually performed on a large population of people without the condition, to detect a small minority with it (e.g., HIV prevalence in the USA is ~0.4%).
  3. Disk drive failures are approximately ~1% per year.
  4. The conversion rates of online ads has been estimated to lie between 10-3 to 10-6.
  5. Factory production defect rates typically run about 0.1%.

Many of these domains are imbalanced because they are what I call needle in a haystackproblems, where machine learning classifiers are used to sort through huge populations of negative (uninteresting) cases to find the small number of positive (interesting, alarm-worthy) cases.

Read on for some good advice on how to handle imbalanced data.

Comments closed

Using DMVs To Plan Out Your Indexes

Eric Blinn explains how to use two particular DMVs to see which index changes you might want to make:

Missing Indexes

This group of DMVs records every scan and large key lookups.  When the optimizer declares that there isn’t an index to support a query request it generally performs a scan.  When this happens a row is created in the missing index DMV showing the table and columns that were scanned.  If that exact same index is requested a second time, by the same query or another similar query, then the counters are increased by 1.  That value will continue to grow if the workload continues to call for the index that doesn’t exist.  It also records the cost of the query with the table scan and a suspected percentage improvement if only that missing index had existed.  The below query calculated those values together to determine a value number.

Click through for sample scripts for this and the index usage stats DMV.  The tricky part is to synthesize the results of these DMVs into the minimum number of viable indexes.  Unlike the optimizer—which is only concerned with making the particular query that ran faster—you have knowledge of all of the queries in play and can find commonalities.

Comments closed

What Update Locks Do

Guy Glantser explains the process around updating data in SQL Server, particularly the benefit of having update locks:

In order to update a row, SQL Server first needs to find that row, and only then it can perform the update. So every UPDATE operation is actually split into two phases – first read, and then write. During the read phase, the resource is locked for read, and then it is converted to a lock for write. This is better than just locking for write all the way from the beginning, because during the read phase, other sessions might also need to read the resource, and there is no reason to block them until we start the write phase. We already know that the SHARED lock is used for read operations (phase 1), and that the EXCLUSIVE lock is used for write operations (phase 2). So what is the UPDATE lock used for?

If we used a SHARED lock for the duration of the read phase, then we might run into a deadlock when multiple sessions run the same UPDATE statement concurrently.

Read on for more details.

Comments closed