Press "Enter" to skip to content

Day: October 3, 2019

Multiple Hypothesis Testing with R

Roland Stevenson shows how we can perform multiple hypothesis tests on data, as well as potential issues:

Both results show that evaluating two tests on the same family of data will lead to a ~10% chance that a researcher will claim a “significant” result if they look for either test to reject the null. Any claim there is a maximum 5% false positive rate would be mistaken. As an exercise, verify that doing the same on \(m=4\) tests will lead to an ~18% chance!

A bad testing platform would be one that claims a maximum 5% false positive rate when any one of multiple tests on the same family of data show significance at the 5% level. Clearly, if a researcher is going to claim that the FWER is no more than \(\alpha\), then they must control for the FWER and carefully consider how individual tests reject the null.

This is worth taking some time to read carefully. H/T R-Bloggers

Comments closed

Migrating Databricks Workspaces

Gerhard Brueckl has made DatabricksPS better:

I do not know what is/was the problem here but I did not have time to investigate but instead needed to come up with a proper solution in time. So I had a look what needs to be done for a manual export. Basically there are 5 types of content within a Databricks workspace:

– Workspace items (notebooks and folders)
– Clusters
– Jobs
– Secrets
– Security (users and groups)

For all of them an appropriate REST API is provided by Databricks to manage and also exports and imports. This was fantastic news for me as I knew I could use my existing PowerShell module DatabricksPS to do all the stuff without having to re-invent the wheel again.

I’ve used DatabricksPS and really like it for cases where I’d have to loop with the Databricks REST API—for example, when uploading files.

Comments closed

Backing Up Cosmos DB

Josh Smith takes us through backing up Cosmos DB yourself:

Unfortunately if you are restricting access to your Cosmos DB service based on IP address (a reasonable security measure) then Data Factory won’t work as of this writing as Azure Data Factory doesn’t operate like a trusted Azure service and presents as IP address from somewhere in the data center where it is spun up. Thankfully they are working on this. In the meantime however the next best thing is to use the Cosmos DB migration tool (scripts below) to dump the contents to a location where they can be retained as long as needed. Be aware in addition to the RU cost of returning the data that if you bring these backups back out of the data center where the Cosmos DB lives you’ll also incur egress charges on the data.

Having a plan for this kind of thing is important, even if you normally rely on service-provided automated backups.

Comments closed

Building an Azure Usage Report with Powershell

June Castillote shows us how we can use Powershell to get usage data from Azure for our subscriptions:

In the section above, it would be common for the command to return many thousand objects especially for long date ranges. To prevent overwhelming the API, the Get-UsageAggregates command only returns a maximum of 1000 results. If you’ve saved the $usageData variable as covered in the previous section, you can confirm it by using running this command $usageData.UsageAggregations.count.

What if there are more than 1000 results? You’re going to have to do a little more work.

Knowing how much you’re spending is critical in an Op-X world like Azure or AWS.

Comments closed

Generating Anonymous Data

Daniel Hutmacher has a nice web API to generate fake customer data:

I’ve been working on a little gadget for a while now, and today I finally got around to completing it and so now I’ve published it for everyone to try out. It’s a web API (wait, wait, don’t go away – it’s for database people!) that creates a randomized list of names, addresses, etc.

In this post, I’ll show you how easy it is to use this service to anonymize a development or test database so you don’t have all that personally identifiable information floating around.

Read the whole thing and check out his service. Also, Daniel was the one who spurred me on to update the theme here to get rid of some problems, so you can thank him for that too.

Comments closed

Percentages of Totals in Snowflake

Koen Verbeeck shows how you can use the RATIO_TO_REPORT() function in Snowflake to determine the current row’s percentage of the total:

This episode talks about a new window function Snowflake recently introduced: RATIO_TO_REPORT. The function returns the ratio of the value of the current row to the sum of the values within the set. Or in other words, some sort of “percentage of total”. Nothing we couldn’t calculate before, but a bit of syntactic sugar so we don’t have to write two expressions.

Click through to see how to use it and a contrast with the ANSI SQL approach.

Comments closed

Naming Temporary Columns in DAX

Marco Russo and Alberto Ferrari team up to share a standard for naming temporary columns in DAX:

The formula works just fine, but it violates one of the golden rules of DAX code: you always prefix a column reference with its table name, and you never use the table name when referencing a measure. Therefore, when reading DAX code, [Sales Amt] is a measure reference, whereas ‘Product'[Sales Amt] is a column reference.

Nevertheless, in our DAX example ProdSalesAmt is a column of a temporary table (SalesByProduct) created by the FilteredSalesAmount measure. As such, ProdSalesAmt is a temporary column that does not originate from any column in the model and does not have a table name you can use as a prefix. This situation creates ambiguity in the code: it is not easy to discriminate between a column reference and a measure reference. Therefore, the code is harder to read and more error prone.

Read on for their standard, which is pretty easy to follow.

Comments closed

Azure Data Studio Process Explorer

Dave Bland shows us the process explorer in Azure Data Studio:

Notice that the pids all point to the azuredatestudio.exe processes.  Azure Data Studio provides just a bit more information than Task Manager.  Please be careful changing the state of a service.  In other words, be careful stopping a process unless it is a last restore approach to fixing an issue.

The first thing I thought when looking at it wasn’t the Task Manager; it was Chrome’s process explorer.

Comments closed