Press "Enter" to skip to content

Author: Kevin Feasel

Dot Plots

Devin Knight continues his custom visualization series:

In this module you will learn how to use the Dot Plot Power BI Custom Visual.  The Dot Plot is often used when visualizing a distribution of values or a count of an occurrence across different categorical data you may have.  Watch this module to learn more!

This particular visualization seems a bit distracting for my tastes, but check out Devin’s video.

Comments closed

Descriptive Statistics With SQL Server And R

Mala Mahadevan digs into descriptive statistics:

With R integration into SQL Server 2016 we can pull an R script and integrate it rather easily. I will be covering all 3 approaches. I am using a small dataset – a single table with 915 rows, with a SQL Server 2016 installation and R Studio. The complexities of doing this type of analysis in the real world with bigger datasets involve setting various options for performance and dealing with memory issues – because R is very memory intensive and single threaded.

My table and the data it contains can be created with scripts here. For this specific post I used just one column in the table – age. For further posts I will be using the other fields such as country and gender.

Mala compares T-SQL versus R for calculating minimum, maximum, mean, and mode.  She wraps the post up by showing how to call her R code via T-SQL using SQL Server R Services.

Comments closed

Columnstore Basics

Sunil Agarwal has a couple of posts explaining columnstore indexes.  First, how columnstore indexes differ from classic B-tree indexes:

  • Index Fragmentation: For rowstore based indexes, it is considered fragmented if (a) the physical order of pages in out of sync with the index-key order. (b) the data pages (clustered index) or index pages (for nonclustered index) are partially filled. A fragmented index will lead to significantly higher physical IOs and can potentially put more pressure on memory which can ultimately slowdown queries. Most organizations run a periodic index maintenance job to defragment indexes. For details, please refer to  https://msdn.microsoft.com/en-us/library/ms189858.aspx#Fragmentation best practices on how to maintain btree indexes. For columnstore index, an index fragmentation is considered fragmented if (a) there are 10% or more rows marked as deleted in a compressed rowgroup (b) one or more smaller compressed rowgroups can be combined to create a larger compressed rowgroup such that the resultant compressed rowgroup has less than or equal to 1 million rows. Note, if a compressed rowgroup has less than 1 million rows due to dictionary size, it is not considered fragmented because there is nothing that can be done to increase its size.  Also recall that a columnstore index consists of zero or more delta rowgroups as shown the in the picture below.

Also, clustered and non-clustered columnstores:

SQL Server 2016 provides two flavors of columnstore index; clustered (CCI) and nonclustered (NCCI) columnstore index. As shown in the simplified picture below, both indexes are organized as columns but NCCI is created on an existing rowstore table as shown on the right side in the picture below while a table with CCI does not have a rowstore table. Both tables can have one or more btree nonclustered indexes.

If you haven’t looked at columnstore indexes yet, 2016 is a great time to start.

Comments closed

Fashion Design And Genetic Algorithms

Daragh Sibley, et al, discuss using genetic algorithms to help design clothing:

Recombination. Having selected a set of high performing blouses we can now consider how they should be recombined to form a new child. While a traditional genetic algorithm would stochastically search all combinations over many market generations, we can shortcut that process by algorithmically looking for features that have been historically preferred by our target client segment.

To achieve this, we find statistical regularities between the population of blouses’ attributes (or configurations of attributes) and client feedback. For instance, we can model the relationship between attributes of our existing blouses and client feedback via:

Genetic algorithms (and Koza-style genetic programming) have long been a favorite topic of mine.  Integrating GA with fashion was not something that came to mind, but is a very interesting solution.

Comments closed

Altering Columns In Large Tables

Kenneth Fisher discusses a problem he had with altering a column on a large table:

My first attempt was just a straight ALTER TABLE ALTER COLUMN. After about an hour I got back a log full error. I then tried a 200 GB log and a 350 GB log. These failed at 3 and 5 hours. While this was going on I checked on #sqlhelp to see if anyone knew any way to minimize the log useage so my command would finish.

The primary suggestions were:

  • Add a new column to the end of the table, populate it in batches, then remove the old column.
  • Create a new table, populate it, index it, drop the old table, re-name the new table.

I will say that I have used suggestion #1 several times, particularly in zero down-time situations.  Once you’re done backfilling the column, you can drop the old one and rename the new one in a single transaction.  Read on for Kenneth’s solution.

Comments closed

Top-Down ETL With Powershell

Max Trinidad continues his series on top-down ETL using Powershell:

After all the previous functions has been loaded, just type the following one-liner:

Process-PSObjectToSQL -SQLDataObj $SQLData;

This sample script code can serve as a startup Template to load data into SQL Server.

This sample SQL data load will fail. Here’s when the Try/Catch/Finally will work for you in trapping what went wrong. Adding the necessary code to provide that additional information to troubleshoot and fix the problem.

Parts one and two available as well.

Comments closed

Parallelism Configuration Options

Kendra Little discusses max degree of parallelism and cost threshold for parallelism:

When you run a query, SQL Server estimates how “expensive” it is in a fake costing unit, let’s call it Estimated QueryBucks.

If a query’s Estimated QueryBucks is over the “Cost Threshold for Parallelism” setting in SQL Server, it qualifies to potentially use multiple processors to run the query.

The number of processors it can use is defined by the instance level “Max Degree of Parallelism” setting.

When writing TSQL, you can specify maxdop for individual statements as a query hint, to say that if that query qualifies to go parallel, it should use the number of processors specified in the hint and ignore the server level setting. (You could use this to make it use more processors, or to never go parallel.)

Read the whole thing, or watch/listen to the video.

Comments closed

Storm 1.0 Microbenchmarks

Roshan Naik and Sapin Amin have Storm 1.0 benchmarks on a small cluster:

Numbers suggest that Storm has come a long way in terms of performance but it still has room go faster. Here are some of the broad areas that should improve performance in future:

  • An effort to rewrite much of Storm’s Clojure code in Java is underway. Profiling has shown many hotspots in Clojure code.

  • Better scheduling of workers. Yahoo is experimenting with a Load Aware Scheduler for Storm to be smarter about the way in which topologies are scheduled on the cluster.

  • Based on microbenchmarking and discussions with other Storm developers there appears potential for streamlining the internal queueing for faster message transfer.

  • Operator coalescing (executing consecutive spouts/bolts in a single thread when possible) is another area that reduces intertask messaging and improve throughput.

Even with these potential improvements, Storm has come a long way—their benchmarks show around 5x throughput and a tiny fraction of the latency of Storm 0.9.1.

Comments closed

New JDBC Driver

Microsoft has released a new version of their SQL Server JDBC driver:

Table-Valued Parameters (TVPs)

TVP support allows a client application to send parameterized data to the server more efficiently by sending multiple rows to the server with a single call. You can use the JDBC Driver 6.0 to encapsulate rows of data in a client application and send the data to the server in a single parameterized command.

There are a couple of interesting features in this driver which could help your Hadoop cluster play nice with SQL Server.

Comments closed