rqdatatable — Wrangling Lots Of Data, Fast

Kevin Feasel

2018-06-04

R

John Mount explains the motivation behind rqdatatable and puts together a performance test:

rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now rquery is also one of the fastest methods to wrangle data in-memory in R (thanks to data.table, via a thin adaption supplied by rqdatatable).

Teaching rquery and fully benchmarking it is a big task, so in this note we will limit ourselves to a single example and benchmark. Our intent is to use this example to promote rquery and rqdatatable, but frankly the biggest result of the benchmarking is how far out of the pack data.tableitself stands at small through large problem sizes. This is already known, but it is a much larger difference and at more scales than the typical non-data.table user may be aware of.

Click through for the benchmark and information on how to grab the package before it goes into CRAN.

Python And The Tidyverse

Kevin Feasel

2018-06-04

Python

Leo at Locke Data looks at a couple Python packages which implement Tidyverse concepts:

The Dplython README provides some clear examples of how the package can be used. Below is an summary of the common functions:

  • select() – used to get specific columns of the data-frame.

  • sift() – used to filter out rows based on the value of a variable in that row.

  • sample_n() and sample_frac() – used to provide a random sample of rows from the data-frame.

  • arrange() – used to sort results.

  • mutate() – used to create new columns based on existing columns.

I think the Tidyverse is immediately accessible for data platform professionals, so it’s good to see these concepts making their way to Python as well as R.

Forwarded Records Without User Table Heaps

Erik Darling unravels a conundrum:

When people think about Heaps and the problems they can cause, they don’t often think of temp tables and table variables as Heaps.

Of course, without a clustered index, any table is a Heap.

This isn’t an argument for or against indexing temp tables, but while working with a client we came across something strange!

sp_BlitzFirst was reporting hundreds of thousands of Forwarded Records, but there were no Heaps in user databases.

When we dug in closer to what queries were doing, we found lots of places where temp tables had an insert/update pattern.

Click through for a demonstration and an explanation of why this can be trouble.

Cached Datasets And ssisUnit

Bartosz Ratajczyk shows how to store result sets in a test file with ssisUnit:

The ssisUnit GUI does not support creating the persisted dataset. If you switch the IsResultsStored flag to true on the dataset’s properties, it gives a warning “The expected dataset’s (<the dataset name>) stored results does not contain data. Populate the Results data table before executing.” during the test run.

To find out more about it, take a look at the source code.

This is a nice explanation of a current limitation in the tool and a workaround.

The SQL Server Execution Plan Reference

Hugo Kornelis has embarked on a major project:

I didn’t choose the term “Execution Plan Reference” by accident. The core of the EPR will be a full description of all that is known about every operator known to exist in execution plans: what it does, how it functions, what properties it can have, how those properties affect its behavior, and any additional information that may be relevant to understand the operator. This section will be one page for each operator. Of course, some operators are fairly simple while others are more complex, so those pages will vary in length.

Apart from that core content, I planned some generic pages. It makes no sense to repeat the explanation for properties such as Estimated Number of Rows or Number of Executions on each operator’s page, so instead I wanted to have a single page to list and describe all common properties. I also wanted an introduction page that explains the basics of reading an execution plan, lists the properties for plans as a whole, and debunks some common misconceptions.

And there will be articles with additional background. Instead of having to explain what exactly an “anti semi join” is on each of the four pages for the four join operators, I decided to create a single page describing all the logical join types. When working on Hash Match, the page, was already very long and complex before I even had a chance to start on the details of the “dynamic destaging” process that handles memory spills, so I decided to leave that for a future page. As I continue to work on the EPR, I will probably continue to create or plan separate pages for content that applies to multiple operators, or that I consider too deep and too advanced for the operator’s normal page.

This is a huge undertaking, but even in its current state, the Execution Plan Reference looks great and has tremendous potential.

Parallel Transaction Log Redo For Availability Groups

Dong Cao explains the two different models for transaction log redo when using availability groups:

When availability group was initially released with SQL Server 2012, the transaction log redo was handled by a single redo thread for each database in an AG secondary replica. This redo model is also called as serial redo. In SQL Server 2016, the redo model was enhanced with multiple parallel redo worker threads per database to share the redo workload. In addition, each database has a new helper worker thread for handling the dirty page disk flush IO. This new redo model is called parallel redo.

With the new parallel redo model that is the default setting since SQL Server 2016, workloads with highly concurrent small transactions are expected to achieve better redo performance. When the transaction redo operation is CPU intensive, such as when data encryption and/or data compression are enabled, parallel redo has even higher redo throughput (Redone Bytes/sec) compared to serial redo. Moreover, indirect checkpoint allows parallel redo to offload more disk IO (and IO waits for slow disk) to its helper worker thread and frees main redo thread to enumerate more received log records in secondary replica. It further speeds up the redo performance.

Read on to learn more about these two models, including positives and negatives for each and how to switch from one to the other.

sp_BlitzInMemoryOLTP

Ned Otter announces a new tool for troubleshooting memory-optimized databases:

Instance level evaluates the following:

  • the version/edition of SQL server

  • SQL Server ‘max memory’ setting

  • memory clerks

  • XTP memory consumers, aggregated

  • XTP memory consumers, detailed

  • the value of the committed_target_kb column from sys.dm_os_sys_info

  • whether or not instance-level collection of execution statistics has been enabled for all natively compiled stored procedures (because this can kill their performance….)

  • when running Enterprise, if there are any resource groups defined, and which memory-optimized databases are bound to them

  • XTP and buffer pool memory allocations, because In-Memory OLTP can affect on-disk workloads

  • summary of memory used by XTP

There’s a lot of useful information you can get out of this procedure.  Click through for the full documentation.

Dumping Error Log Data Into A Table

Kenneth Fisher has a script to take information out of the error log and put it into a table:

This is one of my favorite scripts. It pulls all of the data from the error log and dumps it into a temp table. I did one version here but it was pretty specific to I/O errors. I also gave some homework to find/build a script like it. So if you didn’t do the homework here is a nice little (more generic) script that I personally find extraordinarily handy. There are two parameters at the top that will restrict the data pulled in, and a query against #LogInfo at the bottom that filters out backupand logon entries since they tend to get in the way of what I’m looking for. That said, if you want to create a report on logons (or backups) you’ll want to modify the query. I do leave #LogInfo open and the end so you can run multiple queries against it.

Click through for the script.

Categories

June 2018
MTWTFSS
« May Jul »
 123
45678910
11121314151617
18192021222324
252627282930