rqueryis already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now
rqueryis also one of the fastest methods to wrangle data in-memory in
data.table, via a thin adaption supplied by
rqueryand fully benchmarking it is a big task, so in this note we will limit ourselves to a single example and benchmark. Our intent is to use this example to promote
rqdatatable, but frankly the biggest result of the benchmarking is how far out of the pack
data.tableitself stands at small through large problem sizes. This is already known, but it is a much larger difference and at more scales than the typical non-
data.tableuser may be aware of.
Click through for the benchmark and information on how to grab the package before it goes into CRAN.
The Dplython README provides some clear examples of how the package can be used. Below is an summary of the common functions:
select() – used to get specific columns of the data-frame.
sift() – used to filter out rows based on the value of a variable in that row.
sample_n() and sample_frac() – used to provide a random sample of rows from the data-frame.
arrange() – used to sort results.
mutate() – used to create new columns based on existing columns.
I think the Tidyverse is immediately accessible for data platform professionals, so it’s good to see these concepts making their way to Python as well as R.
When people think about Heaps and the problems they can cause, they don’t often think of temp tables and table variables as Heaps.
Of course, without a clustered index, any table is a Heap.
This isn’t an argument for or against indexing temp tables, but while working with a client we came across something strange!
sp_BlitzFirst was reporting hundreds of thousands of Forwarded Records, but there were no Heaps in user databases.
When we dug in closer to what queries were doing, we found lots of places where temp tables had an insert/update pattern.
Click through for a demonstration and an explanation of why this can be trouble.
The ssisUnit GUI does not support creating the persisted dataset. If you switch the IsResultsStored flag to true on the dataset’s properties, it gives a warning “The expected dataset’s (<the dataset name>) stored results does not contain data. Populate the Results data table before executing.” during the test run.
To find out more about it, take a look at the source code.
This is a nice explanation of a current limitation in the tool and a workaround.
I didn’t choose the term “Execution Plan Reference” by accident. The core of the EPR will be a full description of all that is known about every operator known to exist in execution plans: what it does, how it functions, what properties it can have, how those properties affect its behavior, and any additional information that may be relevant to understand the operator. This section will be one page for each operator. Of course, some operators are fairly simple while others are more complex, so those pages will vary in length.
Apart from that core content, I planned some generic pages. It makes no sense to repeat the explanation for properties such as Estimated Number of Rows or Number of Executions on each operator’s page, so instead I wanted to have a single page to list and describe all common properties. I also wanted an introduction page that explains the basics of reading an execution plan, lists the properties for plans as a whole, and debunks some common misconceptions.
And there will be articles with additional background. Instead of having to explain what exactly an “anti semi join” is on each of the four pages for the four join operators, I decided to create a single page describing all the logical join types. When working on Hash Match, the page, was already very long and complex before I even had a chance to start on the details of the “dynamic destaging” process that handles memory spills, so I decided to leave that for a future page. As I continue to work on the EPR, I will probably continue to create or plan separate pages for content that applies to multiple operators, or that I consider too deep and too advanced for the operator’s normal page.
This is a huge undertaking, but even in its current state, the Execution Plan Reference looks great and has tremendous potential.
When availability group was initially released with SQL Server 2012, the transaction log redo was handled by a single redo thread for each database in an AG secondary replica. This redo model is also called as serial redo. In SQL Server 2016, the redo model was enhanced with multiple parallel redo worker threads per database to share the redo workload. In addition, each database has a new helper worker thread for handling the dirty page disk flush IO. This new redo model is called parallel redo.
With the new parallel redo model that is the default setting since SQL Server 2016, workloads with highly concurrent small transactions are expected to achieve better redo performance. When the transaction redo operation is CPU intensive, such as when data encryption and/or data compression are enabled, parallel redo has even higher redo throughput (Redone Bytes/sec) compared to serial redo. Moreover, indirect checkpoint allows parallel redo to offload more disk IO (and IO waits for slow disk) to its helper worker thread and frees main redo thread to enumerate more received log records in secondary replica. It further speeds up the redo performance.
Read on to learn more about these two models, including positives and negatives for each and how to switch from one to the other.
Instance level evaluates the following:
the version/edition of SQL server
SQL Server ‘max memory’ setting
XTP memory consumers, aggregated
XTP memory consumers, detailed
the value of the committed_target_kb column from sys.dm_os_sys_info
whether or not instance-level collection of execution statistics has been enabled for all natively compiled stored procedures (because this can kill their performance….)
when running Enterprise, if there are any resource groups defined, and which memory-optimized databases are bound to them
XTP and buffer pool memory allocations, because In-Memory OLTP can affect on-disk workloads
summary of memory used by XTP
There’s a lot of useful information you can get out of this procedure. Click through for the full documentation.
This is one of my favorite scripts. It pulls all of the data from the error log and dumps it into a temp table. I did one version here but it was pretty specific to I/O errors. I also gave some homework to find/build a script like it. So if you didn’t do the homework here is a nice little (more generic) script that I personally find extraordinarily handy. There are two parameters at the top that will restrict the data pulled in, and a query against #LogInfo at the bottom that filters out backupand logon entries since they tend to get in the way of what I’m looking for. That said, if you want to create a report on logons (or backups) you’ll want to modify the query. I do leave #LogInfo open and the end so you can run multiple queries against it.
Click through for the script.