Press "Enter" to skip to content

Curated SQL Posts

Columnstore Basics

Sunil Agarwal has a couple of posts explaining columnstore indexes.  First, how columnstore indexes differ from classic B-tree indexes:

  • Index Fragmentation: For rowstore based indexes, it is considered fragmented if (a) the physical order of pages in out of sync with the index-key order. (b) the data pages (clustered index) or index pages (for nonclustered index) are partially filled. A fragmented index will lead to significantly higher physical IOs and can potentially put more pressure on memory which can ultimately slowdown queries. Most organizations run a periodic index maintenance job to defragment indexes. For details, please refer to  https://msdn.microsoft.com/en-us/library/ms189858.aspx#Fragmentation best practices on how to maintain btree indexes. For columnstore index, an index fragmentation is considered fragmented if (a) there are 10% or more rows marked as deleted in a compressed rowgroup (b) one or more smaller compressed rowgroups can be combined to create a larger compressed rowgroup such that the resultant compressed rowgroup has less than or equal to 1 million rows. Note, if a compressed rowgroup has less than 1 million rows due to dictionary size, it is not considered fragmented because there is nothing that can be done to increase its size.  Also recall that a columnstore index consists of zero or more delta rowgroups as shown the in the picture below.

Also, clustered and non-clustered columnstores:

SQL Server 2016 provides two flavors of columnstore index; clustered (CCI) and nonclustered (NCCI) columnstore index. As shown in the simplified picture below, both indexes are organized as columns but NCCI is created on an existing rowstore table as shown on the right side in the picture below while a table with CCI does not have a rowstore table. Both tables can have one or more btree nonclustered indexes.

If you haven’t looked at columnstore indexes yet, 2016 is a great time to start.

Comments closed

Fashion Design And Genetic Algorithms

Daragh Sibley, et al, discuss using genetic algorithms to help design clothing:

Recombination. Having selected a set of high performing blouses we can now consider how they should be recombined to form a new child. While a traditional genetic algorithm would stochastically search all combinations over many market generations, we can shortcut that process by algorithmically looking for features that have been historically preferred by our target client segment.

To achieve this, we find statistical regularities between the population of blouses’ attributes (or configurations of attributes) and client feedback. For instance, we can model the relationship between attributes of our existing blouses and client feedback via:

Genetic algorithms (and Koza-style genetic programming) have long been a favorite topic of mine.  Integrating GA with fashion was not something that came to mind, but is a very interesting solution.

Comments closed

Altering Columns In Large Tables

Kenneth Fisher discusses a problem he had with altering a column on a large table:

My first attempt was just a straight ALTER TABLE ALTER COLUMN. After about an hour I got back a log full error. I then tried a 200 GB log and a 350 GB log. These failed at 3 and 5 hours. While this was going on I checked on #sqlhelp to see if anyone knew any way to minimize the log useage so my command would finish.

The primary suggestions were:

  • Add a new column to the end of the table, populate it in batches, then remove the old column.
  • Create a new table, populate it, index it, drop the old table, re-name the new table.

I will say that I have used suggestion #1 several times, particularly in zero down-time situations.  Once you’re done backfilling the column, you can drop the old one and rename the new one in a single transaction.  Read on for Kenneth’s solution.

Comments closed

Top-Down ETL With Powershell

Max Trinidad continues his series on top-down ETL using Powershell:

After all the previous functions has been loaded, just type the following one-liner:

Process-PSObjectToSQL -SQLDataObj $SQLData;

This sample script code can serve as a startup Template to load data into SQL Server.

This sample SQL data load will fail. Here’s when the Try/Catch/Finally will work for you in trapping what went wrong. Adding the necessary code to provide that additional information to troubleshoot and fix the problem.

Parts one and two available as well.

Comments closed

Parallelism Configuration Options

Kendra Little discusses max degree of parallelism and cost threshold for parallelism:

When you run a query, SQL Server estimates how “expensive” it is in a fake costing unit, let’s call it Estimated QueryBucks.

If a query’s Estimated QueryBucks is over the “Cost Threshold for Parallelism” setting in SQL Server, it qualifies to potentially use multiple processors to run the query.

The number of processors it can use is defined by the instance level “Max Degree of Parallelism” setting.

When writing TSQL, you can specify maxdop for individual statements as a query hint, to say that if that query qualifies to go parallel, it should use the number of processors specified in the hint and ignore the server level setting. (You could use this to make it use more processors, or to never go parallel.)

Read the whole thing, or watch/listen to the video.

Comments closed

Storm 1.0 Microbenchmarks

Roshan Naik and Sapin Amin have Storm 1.0 benchmarks on a small cluster:

Numbers suggest that Storm has come a long way in terms of performance but it still has room go faster. Here are some of the broad areas that should improve performance in future:

  • An effort to rewrite much of Storm’s Clojure code in Java is underway. Profiling has shown many hotspots in Clojure code.

  • Better scheduling of workers. Yahoo is experimenting with a Load Aware Scheduler for Storm to be smarter about the way in which topologies are scheduled on the cluster.

  • Based on microbenchmarking and discussions with other Storm developers there appears potential for streamlining the internal queueing for faster message transfer.

  • Operator coalescing (executing consecutive spouts/bolts in a single thread when possible) is another area that reduces intertask messaging and improve throughput.

Even with these potential improvements, Storm has come a long way—their benchmarks show around 5x throughput and a tiny fraction of the latency of Storm 0.9.1.

Comments closed

New JDBC Driver

Microsoft has released a new version of their SQL Server JDBC driver:

Table-Valued Parameters (TVPs)

TVP support allows a client application to send parameterized data to the server more efficiently by sending multiple rows to the server with a single call. You can use the JDBC Driver 6.0 to encapsulate rows of data in a client application and send the data to the server in a single parameterized command.

There are a couple of interesting features in this driver which could help your Hadoop cluster play nice with SQL Server.

Comments closed

Limitations Of Statistics

Erik Darling discusses statistics and limitations of statistics outside of indexes:

Head rush moment: SQL may use information from histograms outside of the index it chooses for cardinality estimation.

Back to earth: If you just create a bunch of statistics instead of indexes, you’re (at best) using your Clustered Index for everything (which is still bad), or you’re using a HEAP for everything (which is usually worse). You’re still generally better off creating good indexes for your workload. They’ll get statistics objects created and associated with them, and if SQL thinks another column is interesting, it will create a single column statistics object for it, as long as you haven’t turned off auto create stats.

Sure, you can put on your black cloak and goat mask and create some multi-column or filtered statistics, but in the words of a wise man (Doug), you end up with more stats to maintain and understanding query behavior gets more difficult.

Erik argues that at the margin, keeping existing statistics up to date tends to be a better solution than creating new statistics.

Comments closed

Biml Object Model

Bill Fellows begins a series on exploring the Biml Object Model:

The most important precursor to doing this is you will need a licensed installation of Mist. Full stop. The assemblies we’re going to use have security built into them to tell whether they are licensed and you cannot use the assemblies shipped with BidsHelper or BimlExpress as they’ve hardwired to the specific apps.

We’re going to use two classes: AstRootNode and AstPackageNode.

Ast, what is that? Abstract Syntax Tree – it’s a compiler theory thing.

I’m interested in seeing where this series goes.

Comments closed