P-Hacking and Multiple Comparison Bias

Patrick David has a great article on hypothesis testing, p-hacking, and multiple comparison bias:

The most important part of hypothesis testing is being clear what question we are trying to answer. In our case we are asking:
“Could the most extreme value happen by chance?”
The most extreme value we define as the greatest absolute AMVR deviation from the mean. This question forms our null hypothesis.

Give this one a careful read and try out the code. This is an important topic for anyone who analyzes data to understand.

SPLIT And FLATTEN In Snowflake DB

Koen Verbeeck takes us through a couple of Snowflake functions which behave quite differently from their SQL Server equivalent:

I’m doing a little series on some of the nice features/capabilities in Snowflake (the cloud data warehouse). In each part, I’ll highlight something that I think it’s interesting enough to share. It might be some SQL function that I’d really like to be in SQL Server, it might be something else.
Let’s start with SPLIT. Splitting string is something most of us have to do from time to time. In SQL Server, we had to wait until SQL Server 2016 when the table-valued function STRING_SPLIT came along. The syntax of both functions is exactly the same. You specify the string to split and the delimiter.

It turns out that SPLIT doesn’t quite split the same way that STRING_SPLIT does in SQL Server. It gets closer when you add FLATTEN but someone coming from a SQL Server background will still need to pay attention.

Get Windows Failover Cluster Errors

John Morehouse walks us through the Get-ClusterLog cmdlet in Powershell:

Sometimes you know that a problem occurred, but the tools are not giving you the right information.  If you ever look at the Cluster Failover Manager for a Windows Cluster, sometimes that can happen.  The user interface won’t show you any errors, but you KNOW there was an issue.  This is when knowing how to use other tools to extract information from the cluster log becomes useful.
You can choose to use either Powershell or a command at the command prompt.  I tend to lean towards Powershell. I find it easier to utilize and gives me the most flexibility.

Click through for an example, including of a method which filters out everything but error messages.

Updating Power BI Reports Based On Website Changes

Kasper de Jonge shows how you can scrape a webpage using an M function and check for specific updates:

A quick hack today. Got this question from someone who needed to be able to update a report and show users that something had changed. This is easy when you have access to a database and can add data to it but in this case that was not possible.
So I came up with a hacky (and great :P) way to do this. and wanted to share it in case it came handy in your box of tricks :).

It is a little bit hacky, but much less so on a website with a last updated date visible someplace.

Azure Databricks Security

Tristan Robinson looks at what’s currently available in terms of security on Azure Databricks:

You’ll notice that as part of this I’m retrieving the secrets/GUIDS I need for the connection from somewhere else – namely the Databricks-backed secrets store. This avoids exposing those secrets in plain text in your notebook – again this would not be ideal. The secret access is then based on an ACL (access control list) so I can only connect to Data Lake if I’m granted access into the secrets. While it is also possible to connect Databricks up to the Azure Key Vault and use this for secrets store instead, when I tried to configure this I was denied based on permissions. After research I was unable to overcome the issue. This would be more ideal to use but unfortunately there is limited support currently and the fact the error message contained spelling mistakes suggests to me the functionality is not yet mature.

To be charitable, there appears to be room for implementation improvement.

The Thought Behind Metrics

Shannon Holck takes a book as a jumping-off point for failure by metric success:

There were some great use cases in the book.  Doctors that stopped taking cases that were difficult because it would ruin their surgical success metric.  Police that stopped responding to calls because it would ruin their case closure rate if they couldn’t solve it.

Muller states “The problem is not measurement, but excessive measurement, and inappropriate measurements – not metrics, but metric fixation.”

Shannon’s case study and recommendations were interesting.

Data Lake Organization Tips

Melissa Coates has some great advice for people working with data lakes:

Q: Partitioning by date is common. Where should the dates go in the folder hierarchy?

Almost always, you will want the dates to be at the end of the folder path. This is because we often need to set security at specific folder levels (such as by subject area), but we rarely set up security based on time elements.

Optimal for folder security: \SubjectArea\DataSource\YYYY\MM\DD\FileData_YYYY_MM_DD.csv

Tedious for folder security: \YYYY\MM\DD\SubjectArea\DataSource\FileData_YYYY_MM_DD.csv

Click through for all of Melissa’s advice in FAQ form.

Calculating Median In SQL Server 2019

Tomaz Kastrun shows that batch aggregation mode on window functions allow PERCENTILE_CONT finally to become useful:

Next query, for median calculation was a window function query.


SELECT DISTINCT

PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY c1)

OVER (PARTITION BY (SELECT 1)) AS MedianCont

FROM t1


To my surprise, the performance was even worse, and at this time, I have to say, I was running this on SQL Server 2017 with CU7. But luckily, I had a SQL Server 2019 CTP 2.0 also installed and here, with no further optimization the query ran little over 1 second.

I’ve warned people away from this function for all but tiny data sets because of how poorly it performs. With SQL Server 2019, I might be able to recommend it.

Categories

January 2019
MTWTFSS
« Dec Feb »
 123456
78910111213
14151617181920
21222324252627
28293031