Press "Enter" to skip to content

Author: Kevin Feasel

Python And The Tidyverse

Leo at Locke Data looks at a couple Python packages which implement Tidyverse concepts:

The Dplython README provides some clear examples of how the package can be used. Below is an summary of the common functions:

  • select() – used to get specific columns of the data-frame.

  • sift() – used to filter out rows based on the value of a variable in that row.

  • sample_n() and sample_frac() – used to provide a random sample of rows from the data-frame.

  • arrange() – used to sort results.

  • mutate() – used to create new columns based on existing columns.

I think the Tidyverse is immediately accessible for data platform professionals, so it’s good to see these concepts making their way to Python as well as R.

Comments closed

Forwarded Records Without User Table Heaps

Erik Darling unravels a conundrum:

When people think about Heaps and the problems they can cause, they don’t often think of temp tables and table variables as Heaps.

Of course, without a clustered index, any table is a Heap.

This isn’t an argument for or against indexing temp tables, but while working with a client we came across something strange!

sp_BlitzFirst was reporting hundreds of thousands of Forwarded Records, but there were no Heaps in user databases.

When we dug in closer to what queries were doing, we found lots of places where temp tables had an insert/update pattern.

Click through for a demonstration and an explanation of why this can be trouble.

Comments closed

Cached Datasets And ssisUnit

Bartosz Ratajczyk shows how to store result sets in a test file with ssisUnit:

The ssisUnit GUI does not support creating the persisted dataset. If you switch the IsResultsStored flag to true on the dataset’s properties, it gives a warning “The expected dataset’s (<the dataset name>) stored results does not contain data. Populate the Results data table before executing.” during the test run.

To find out more about it, take a look at the source code.

This is a nice explanation of a current limitation in the tool and a workaround.

Comments closed

The SQL Server Execution Plan Reference

Hugo Kornelis has embarked on a major project:

I didn’t choose the term “Execution Plan Reference” by accident. The core of the EPR will be a full description of all that is known about every operator known to exist in execution plans: what it does, how it functions, what properties it can have, how those properties affect its behavior, and any additional information that may be relevant to understand the operator. This section will be one page for each operator. Of course, some operators are fairly simple while others are more complex, so those pages will vary in length.

Apart from that core content, I planned some generic pages. It makes no sense to repeat the explanation for properties such as Estimated Number of Rows or Number of Executions on each operator’s page, so instead I wanted to have a single page to list and describe all common properties. I also wanted an introduction page that explains the basics of reading an execution plan, lists the properties for plans as a whole, and debunks some common misconceptions.

And there will be articles with additional background. Instead of having to explain what exactly an “anti semi join” is on each of the four pages for the four join operators, I decided to create a single page describing all the logical join types. When working on Hash Match, the page, was already very long and complex before I even had a chance to start on the details of the “dynamic destaging” process that handles memory spills, so I decided to leave that for a future page. As I continue to work on the EPR, I will probably continue to create or plan separate pages for content that applies to multiple operators, or that I consider too deep and too advanced for the operator’s normal page.

This is a huge undertaking, but even in its current state, the Execution Plan Reference looks great and has tremendous potential.

Comments closed

Parallel Transaction Log Redo For Availability Groups

Dong Cao explains the two different models for transaction log redo when using availability groups:

When availability group was initially released with SQL Server 2012, the transaction log redo was handled by a single redo thread for each database in an AG secondary replica. This redo model is also called as serial redo. In SQL Server 2016, the redo model was enhanced with multiple parallel redo worker threads per database to share the redo workload. In addition, each database has a new helper worker thread for handling the dirty page disk flush IO. This new redo model is called parallel redo.

With the new parallel redo model that is the default setting since SQL Server 2016, workloads with highly concurrent small transactions are expected to achieve better redo performance. When the transaction redo operation is CPU intensive, such as when data encryption and/or data compression are enabled, parallel redo has even higher redo throughput (Redone Bytes/sec) compared to serial redo. Moreover, indirect checkpoint allows parallel redo to offload more disk IO (and IO waits for slow disk) to its helper worker thread and frees main redo thread to enumerate more received log records in secondary replica. It further speeds up the redo performance.

Read on to learn more about these two models, including positives and negatives for each and how to switch from one to the other.

Comments closed

sp_BlitzInMemoryOLTP

Ned Otter announces a new tool for troubleshooting memory-optimized databases:

Instance level evaluates the following:

  • the version/edition of SQL server

  • SQL Server ‘max memory’ setting

  • memory clerks

  • XTP memory consumers, aggregated

  • XTP memory consumers, detailed

  • the value of the committed_target_kb column from sys.dm_os_sys_info

  • whether or not instance-level collection of execution statistics has been enabled for all natively compiled stored procedures (because this can kill their performance….)

  • when running Enterprise, if there are any resource groups defined, and which memory-optimized databases are bound to them

  • XTP and buffer pool memory allocations, because In-Memory OLTP can affect on-disk workloads

  • summary of memory used by XTP

There’s a lot of useful information you can get out of this procedure.  Click through for the full documentation.

Comments closed

Dumping Error Log Data Into A Table

Kenneth Fisher has a script to take information out of the error log and put it into a table:

This is one of my favorite scripts. It pulls all of the data from the error log and dumps it into a temp table. I did one version here but it was pretty specific to I/O errors. I also gave some homework to find/build a script like it. So if you didn’t do the homework here is a nice little (more generic) script that I personally find extraordinarily handy. There are two parameters at the top that will restrict the data pulled in, and a query against #LogInfo at the bottom that filters out backupand logon entries since they tend to get in the way of what I’m looking for. That said, if you want to create a report on logons (or backups) you’ll want to modify the query. I do leave #LogInfo open and the end so you can run multiple queries against it.

Click through for the script.

Comments closed

Stack Overflow Developer Survey Data Available

Julia Silge has a post looking at the Stack Overflow 2018 developer survey:

Starting today, you can access the public data release for Stack Overflow’s 2018 Developer Survey. Over 100,000 developers from around the world shared their opinions about everything from their favorite technologies to job preferences, and this data is now available for you to analyze yourself. This year, we are partnering with Kaggle to publish and highlight this dataset. This means you can access the data both here on our site and on Kaggle Datasets, and that on Kaggle, you can explore the dataset using Kernels. Kaggle is awarding two $1,000 awards over the next two weeks to authors of top Kernels on the Stack Overflow dataset.

Looks like an interesting data set.

Comments closed

A Non-Relational Database Taxonomy

Thomas Henson has a taxonomy of non-relational databases:

Columnar Database

The first type of NoSQL database is the Columnar databases which is optimized for reading and writing columns of data as opposed to rows of data. Column-oriented storage for database tables is an help drive down the input/output requirements for database. Since the I/O profile is lowered, overall storage footprint is lowered. One main feature of Columnar Databases is their ability to compress data. Instead of data being written in traditional row orientation, Columnar databases use column orientation. Each column will be associated with column key. Checkout this example from my HBase Blog Post.

He then goes on to describe the other three types.  I agree with the taxonomy he uses.

Comments closed

Configuring SQL Agent For Linux

Steve Jones shows how to install and configure the SQL Agent for Linux:

When I first started playing with this version, I noticed that SQL Agent was disabled. That’s not great, since SQL Agent is a great tool for various tasks in SQL Server. I can’t start the agent from here, as the underlying implementation is different, and I’m not really a host OS admin when connecting in SSMS.

After checking which patch level I was at (CU6), I changed to my Linux console, and ran the configuration utility. For Linux, this is mssql-conf.

It’s not an overly complicated process but the process is a bit different from Windows, so check it out.

Comments closed