Using Cohen’s D for Experiments

Nina Zumel takes us through Cohen’s D, a useful tool for determining effect sizes in experiments:

Cohen’s d is a measure of effect size for the difference of two means that takes the variance of the population into account. It’s defined as
d = | μ1 – μ2 | / σpooled
where σpooled is the pooled standard deviation over both cohorts.

Read the whole thing.

Comparing Iterator Performance in R

Kevin Feasel

2019-06-21

R

Ulrik Stervbo has a performance comparison for for, apply, and map functions in R:

It is usually said, that for– and while-loops should be avoided in R. I was curious about just how the different alternatives compare in terms of speed.

The first loop is perhaps the worst I can think of – the return vector is initialized without type and length so that the memory is constantly being allocated.

The performance of map isn’t great, though the benefits to me are less about performance and more about readability. H/T R-bloggers

Calculating Value Above Replacement

Rob Collie explains how to calculate Value Above Replacement:

Let’s say you’re trying to improve the fuel efficiency of your car, and you have a budget of $100 to make those improvements.  Furthermore, let’s say that you can only spend your budget on three different kinds of improvements:  better tires, better spark plugs, and a better exhaust system.

The simple answer, of course, is to buy the Best of each part – the best Tires, the best Sparkplugs, and the best Exhaust System.

But what if Reality rears its head and rudely insists that the Best Costs More?  What if there are a range of options available in each bucket, with varying price AND effectiveness?  Then what?

Read on for the solution which has not nearly enough baseball.

An Intro to SQL Server Big Data Clusters

Mohammad Darab has a series on Big Data Clusters. Part zero explains what they are:

The *absolute* unique feature that BDCs offer that *no* other company or product offers is: Data Virtualization

The new and enhanced SQL Server 2019 Polybase feature comes with connectors to many different data sources: Oracle, Teradata, Apache Spark, MongoDB, Azure Cosmos DB, and even ODBC connectivity to IBM’s DB2, SAP HANA and excel (see image below)

Part one shows how to set one up:

So far, Microsoft does not have a simple way to create a Big Data Cluster. It’s a bit cumbersome of a process and the learning curve is a bit steep. However, Microsoft is currently working on making it easier to deploy a Big Data Cluster via Notebook in Azure Data Studio and eventually some type of “deployment wizard.” But for now, the only option is to do it the long way.

The series will continue, but check out the setup work.

Replaying Workloads with WorkloadTools

Gianluca Sartori shows an example of using the WorkloadTools application to replay a workload, including where the analytics server cannot directly access the production database:

Regardless of the method that you decided to use, at the end of the replays, you will have two distinct sets of tables containing the workload analysis data, sitting in different schemas in the same database or in completely different databases.

WorkloadViewer will let you visualize performance over time, as we have seen for a single workload analysis, but this time it will be able to show you data from both workloads, so that you can compare them.

This sort of production load testing is both important and difficult; WorkloadTools makes it easier.

Troubleshooting Query Performance Changes

Erin Stellato walks us through a troubleshooting guide when users complain about poorly-performing queries:

This is tale of troubleshooting…

When you unexpectedly or intermittently encounter a change in query performance, it can be extremely frustrating and troublesome for a DBA or developer. If you’re not using Query Store, a third-party application, or your own method to capture query data, then you probably don’t have information about what query performance looked like when things were good…you just know how it’s running now. I was working with a customer of Tim’s last week that had been chasing this exact problem for, in their words, years. They had recently upgraded to SQL Server 2016, and the problem was still occurring.

Strangely, “blame the network” didn’t appear in Erin’s post, so I don’t know if it’s comprehensive.

Workarounds for Updating Stats on Secondaries

Niko Neugebauer wants statistics updates on tables running on readable Availability Group secondary nodes:

Let’s list the basic known details for the possible solution(for the Enterprise Edition of the Sql Server that is):
– We can make the secondary replica readable and read the same data on it. (Not that you should do that by default, but if you really know what you are doing …)
– We can copy our object into the TempDB (yeah, your Multi-TB table is probably not the best candidate for this operation), or maybe into some other writable DB.
– We can write results in the shared folder between the replicas (let’s say in a text file into a File Share)
– We can export the BLOB object of the statistics out of the SQL Server
– We can import the BLOB object of the statistics into the statistics

Read the whole thing.

Hidden Patterns in SQL Agent Outputs

Solomon Rutzky has found patterns which cause SQL Agent step outputs not to write any details:

And that’s when it happened: I started getting a value back from QUOTENAME(ORIGINAL_DB_NAME()), and not just the square brackets with nothing between them. The name of a database was now between the square brackets. Why wasn’t that showing up before?

Solomon knows the how but not the why in this case.

Categories

June 2019
MTWTFSS
« May Jul »
 12
3456789
10111213141516
17181920212223
24252627282930