Press "Enter" to skip to content

Curated SQL Posts

Don’t Over-Index

Kenneth Fisher shows the downside cost of non-clustered indexes:

I want to make a couple of final points. I realize 99 indexes is a lot. It’s to emphasize the differences. However they were also fairly small indexes and this is a single table where a normal database might easily have hundreds. So take these results as an example. They aren’t going to match real life but will hopefully show you how all of this can play out.

Indexes are awesome but you want to be smart about adding them. My personal rule of thumb, with no scientific evidence behind it, is 5 indexes or less and I’m pretty easy. 5-10 indexes and you’ll have to convince me. I’m going to be reviewing the existing indexes and see what I can get rid of, or maybe I can combine something. Past 10 indexes and it had best be for a query that’s running a 100+ times a minute or something for the CEO.

Read on for demo code and specific results.

Comments closed

SQL Server R Services Memory Usage

Ginger Grant looks at how SQL Server R Services handles memory allocation:

While R is an open source language, there are a number of different versions of R and each handles memory a little differently. Knowing which version is being used is important, especially when the code is going to be migrated to a server. As part of a SQL Server implementation, there are three different versions of R which come into play. The first is standard open source R, commonly known as CRAN R. This is the standard open source version of R which runs code in memory and is single threaded. The next version which will be installed as part of a SQL Server Installation is Microsoft R Open. This version of R was written to take advantage of the Intel Math Kernel Libraries [MLK]. Using the libraries speeds up many statistical calculations which use matrix operations. It also adds multi-threading capability to R as the rewrite provides the ability to use all available cores and processors and process in parallel. More information on how it works and how much faster Microsoft R Open is compared to standard R is available here. To use Microsoft R Open, once it is installed, in Rstudio should automatically start using it. To check out what version of R that is in use, within R Studio, go to Tools->Global Options and look at the R version.

If you’re concerned about R Services taking up too much server memory, you should look at Resource Governor.

Comments closed

Securing Solr Collections

Jan Kunigk and Paul Wilkinson show how to secure Solr collections:

The policy shown below establishes four Sentry roles based on the admin, operators, users, and techusersgroups.

  • Administrators are entitled to all actions.

  • Operators are granted update and query privileges.

  • Users are granted query privileges.

  • Tech users are granted update privileges.

These are pretty straightforward role-based access controls.  The authors also look at accessing the data via Flume and a couple other technologies.

Comments closed

Data Curation

Christina Prevalsky makes the case for data curation:

The gaining popularity of self-service analytical tools such as Tableau increases the necessity of having curated data in your database. These tools aim to allow the end users to intuitively query data “at the speed of thought” from the data warehouse and visualize the results quickly. That type of capability allows users to go through several different iterations of the data to really explore the data and generate unique insights. These tools do not work well when the underlying database tables have not been curated properly.

This is a difficult and lengthy process, but it’s vital; data minus context is a lot less relevant than you’d hope.

Comments closed

Optimizing S3 For High Concurrency

Aaron Friedman looks at how to optimize highly-concurrent, distributed workloads writing data to S3 buckets:

S3 is a massively scalable key-based object store that is well-suited for storing and retrieving large datasets. Due to its underlying infrastructure, S3 is excellent for retrieving objects with known keys. S3 maintains an index of object keys in each region and partitions the index based on the key name. For best performance, keys that are often read together should not have sequential prefixes. Keys should be distributed across many partitions rather than on the same partition.

For large datasets like genomics, population-level analyses of these data can require many concurrent S3 reads by many Spark executors. To maximize performance of high-concurrency operations on S3, we need to introduce randomness into each of the Parquet object keys to increase the likelihood that the keys are distributed across many partitions.

Reading the title, I wanted it to be an article on knobs to turn in S3 to maximize read performance.  It’s still an article well worth reading, but focuses from the other side:  how to write to S3 without stepping on your own toes.

Comments closed

Working With Topics In Kafka

I show how to do the basics of creating, deleting, and pushing messages on topics in Apache Kafka:

There are three important things here:  first, our Zookeeper port is 2181.  Zookeeper is great for centralized configuration and coordination; if you want to learn more, check out this Sean Mackrory post.

The second bit of important information is how long our retention period is.  Right now, it’s set to 7 days, and that’s our default.  Remember that messages in a Kafka topic don’t go away simply because some consumer somewhere accessed them; they stay in the log until we say they can go.

Finally, we have a set of listeners.  For the sandbox, the only listener is on port 6667.  We connect to listeners from our outside applications, so knowing those addresses and ports is vital.

This is still quick-start level stuff, but I’m building up to custom development, honest!

Comments closed

Unit Testing With SSDT And tSQLt

Steph Locke looks at adding tSQLt to an existing SQL Server Data Tools database project:

Whilst I won’t be showing code in this, there is a companion sample database project. This is on GitHub and each key stage is shown by a branch of work. This means you can jump in at most stages and work from there. If you need some git advice, check out my quick git in 5 presentation.

The core DB is a super hero themed database.

Database unit testing involves a lot of compromises (because most of what we care about as database professionals tends to be integration tests rather than unit tests), but having imperfect tests generally beats having no tests, so check this out.

Comments closed

Windows Server 2016 Licensing

Brent Ozar looks at an ugly aspect of Windows Server 2016 licensing:

Windows Server 2016 is now licensed by the core, but that’s not really the big problem. The bigger problem is buried in the licensing guide (PDF):

  • “A minimum of 16 core licenses is required for each server.”
  • “A minimum of 8 core licenses is required for each physical processor.”

When most sysadmins see that, they’ll think, “Okay, so I shouldn’t bother buying a server smaller than 2 8-core processors.”

There are plenty of scenarios in which this doesn’t hurt (much):  mainly when you need a hefty server with more than 16 cores, or when you are running in a virtualized environment and can split that hardware across a number of logical servers.

As for me, this is one reason why I’m looking forward to SQL Server on Linux.

Comments closed

Format Your Code

Grant Fritchey looks at some poorly-formatted code:

You are going to cause all sorts of problems if you write code like this. First off, if you really do have these three queries within the same stored procedure, how hard will it be to confuse which is table ‘a’ in each of the queries when you go back to edit them? Pretty easy.

It gets worse though. I know that none of us will ever write a query that exceeds 26 tables in a JOIN… well, except that one time… and that other time. In fact, it happens. Oh, it’s not always a good thing, but it’s a thing. How do we respond to that? I’ve seen this:

Some of this stuff is egregious.  Some of it is debatable.  But either way, well-formatted code helps prevent bugs and aid understanding of queries.

Comments closed

Syncing Logins Between AG Replicas

Daniel Hutmacher has a new series on Availability Group synchronization, starting with logins:

You’ll need a linked server from your secondary replica to the primary replica. This linked server should feature the absolute bare-minimum of permissions, preferably with only the “be made using the login’s current security context” selected.

The account running the procedure (or the mapped login in the linked server) will need SELECT access to the following DMVs on the remote (primary) server:

  • master.sys.server_principals

  • master.sys.sql_logins

  • master.sys.server_role_members

  • master.sys.server_permissions

The code itself is a download from Daniel’s website; go check it out.

Comments closed