The Use And Misuse Of P Values

John Mount and Nina Zumel explain what p-values are and how people routinely misuse them:

The many things I happen to have issues with in common mis-use of p-values include:

  1. p-hacking. This includes censored data bias, repeated measurement bias, and even outright fraud.

  2. “Statsmanship” (the deliberate use of statistical terminology for obscurity, not for clarity). For example: saying p instead of saying what you are testing such as “significance of a null hypothesis”.

  3. Logical fallacies. This is the (false) claim that p being low implies that the probability that your model is good is high. At best a low-p eliminates a null hypothesis (or even a family of them). But saying such disproof “proves something” is just saying “the butler did it” because you find the cook innocent (a simple case of a fallacy of an excluded middle).

  4. Confusion of population and individual statistics. This is the use of deviation of sample means (which typically decreases as sample size goes up) when deviation of individual differences (which typically does not decrease as sample size goes up) is what is appropriate . This is one of the biggest scams in data science and marketing science: showing that you are good at predicting aggregate (say, the mean number of traffic deaths in the next week in a large city) and claiming this means your model is good at predicting per-individual risk. Some of this comes from the usual statistical word games: saying “standard error” (instead of “standard error of the mean or population”) and “standard deviation” (“instead of standard deviation of individual cases”); with some luck somebody won’t remember which is which and be too afraid to ask.

Even if you know what p-values are, this is definitely worth reading, as it’s so easy to misuse p-values (even when I’m not on my Bayesian post hurling tomatoes at frequentists).

Using Multiple Cosmos DB APIs

Vincent-Philippe Lauzon shows how to access graph data stored in Cosmos DB using the DocumentDB API:

Now here’s a little secret:  although we choose the “model” (e.g. Gremlin) at the Cosmos DB account level, we can use other models to query the data.

Not all combination are possible, but many are.  Specifically, we can query a Gremlin graph using DocumentDB / SQL query language.

The graph is then projected into documents.

We will explore that in this article.

Why is that interesting?  Because there are a lot of tools out there we might be familiar with to manipulate DocumentDB (or MongoDB).  Having to possibility to look at a Graph with other APIs extends our toolset from Gremlin-based ones.

That is interesting.

Automatically E-Mailing R Reports

Tim Ali shows how to use SendGrid and Displayr to generate and e-mail out automated reports in R:

To automatically write and send email reports we need to have three tools:

  • A programming language. If the analysis is non-trivial, R is usually the best way forward.

  • An app that can automatically run the analyses at specified times. I’ve used Displayr. Of course, if you have the time you can avoid this commercial product and set up your own servers. (Disclaimer, I work for Displayr.) You can read this post to see how to get your data into Displayr and sign up to Displayr here if you do not already have an account.

  • An app to actually send the emails. I’ve chosen an email delivery application called SendGrid because R can work with its API.

To be honest, I’d probably use a different mechanism for deploying reports, but if you just need to put something together with a bit of chewing gum and bailing wire, the tools are there.  H/T R-Bloggers

Fiscal Year Columns In A Power BI Date Dimension

Reza Rad takes a date dimension in Power BI and adds fiscal year details:

As you can see in the image above; June 2017 considered as fiscal year 2017. However, July 2017 is part of fiscal year 2018. So the simple logic can be like this:

if (calendar month >= fiscal year start)

then fiscal year = calendar year

else fiscal year = calendar year + 1

This code is pseudo code. don’t write that exactly in M! Let’s now implement it in M;

If you have to deal with multiple fiscal years (e.g., state and federal government fiscal years), the process is the same, only repeated.

DAX Is Still Important

Reid Havens explains why he teaches DAX in his Power BI courses:

The Check Formula button is an easily overlooked feature. However, before I hit ok and save my DAX Measure I ALWAYS press this button first! But what exactly does this button do? Well it’s checking your DAX syntax and making sure everything is written correctly. Now you COULD simply hit OK after writing your DAX and see if it errors, this is true. However when doing that your data model is actually attempting to calculate the DAX measure in the background as well. Not a big deal with a few thousand rows, but if you’re working with a model that has millions of rows then that could take a long time for it to calculate, and then error!

The smart thing to do is to check your DAX syntax using the Check Formula button BEFORE hitting ok. Checking your DAX syntax doesn’t run your calculation and returns a rewarding No errors in formula output if everything was written correctly. Such a simple thing that can save you SO MUCH TIME! I highly recommend as a best practice to always use this before hitting ok and saving your measures, you’ll thank me later.

It makes for interesting reading.

Invalid Class Error Trying To Access WMI Class

Claudio Silva troubleshoots an error which gives the user a red herring:

This can return more than one line with different ComputerManagement (like ComputerManagement10). It depends on the versions you have installed on the host. The number “10” refers to the SQL Server 2008.
Now I can uncomment the last command and run it. The result is:

Get-CimInstance : Invalid class
At line:1 char:1
+ Get-CimInstance -CimSession $CIMsession -Namespace $(“rootMicrosoftSQLServerC …
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : MetadataError: (:) [Get-CimInstance], CimException
+ FullyQualifiedErrorId : HRESULT 0x80041010,Microsoft.Management.Infrastructure.CimCmdlets.GetCimInstanceCommand
+ PSComputerName : HOST001

Ok, a different error message. Let’s dig in it. I logged in on the host and confirmed that I have a SQL Server 2008 R2 instance installed. This means that I’m not accessing a lower version than 2005 like the initial warning message was suggesting.

Read the whole thing.

Parent-Child Relationships And Native Compilation

Ned Otter looks at different ways to insert data with parent-child relationships using natively compiled, memory-optimized procedures:

This blog post demonstrates various approaches when using native compilation to insert rows into parent/child tables.

First, let’s create tables named Parent and Child, and relate them with a FOREIGN KEY constraint. Note that the Parent table uses the IDENTITY property for the PRIMARY KEY column.

Ned’s first example uses @@IDENTITY and shows that this doesn’t work.  But there’s an easy version which is superior to @@IDENTITY and is supported:  SCOPE_IDENTITY().  Even outside of memory-optimized tables, you want to use SCOPE_IDENTITY() over @@IDENTITY anyhow because of the risk of triggers changing the “current” identity value.  Here’s a quick example I ginned up using SCOPE_IDENTITY:

CREATE PROCEDURE dbo.TestParentChild
@SomeChar CHAR(1),
@SomeChildChar CHAR(1)
BEGIN ATOMIC WITH (TRANSACTION ISOLATION LEVEL = SNAPSHOT, LANGUAGE = N'us_english')	INSERT INTO dbo.Parent	(	SomeChar	)	VALUES	(	@SomeChar	);	INSERT INTO dbo.Child	(	ParentId,	SomeChildChar	)	VALUES	(	SCOPE_IDENTITY(),	@SomeChildChar	);	SELECT	p.Id,	p.SomeChar	FROM dbo.Parent p;	SELECT	c.Id,	c.ParentId,	c.SomeChildChar	FROM dbo.Child c;
EXEC dbo.TestParentChild	@SomeChar = 'A',	@SomeChildChar = 'B';
EXEC dbo.TestParentChild	@SomeChar = 'Z',	@SomeChildChar = 'Y';
EXEC dbo.TestParentChild	@SomeChar = 'W',	@SomeChildChar = 'L';

The results are what you’d expect.

More On Certificates Versus Trusted Assemblies

Solomon Rutzky compares using the new Trusted Assemblies functionality in SQL Server 2017 versus generating certificates for loading CLR objects:

Considering that “Trusted Assemblies” is entirely worthless within the context of the regular (i.e. non-Azure) SQL Server, does the same argument of “just use the existing Certificates and Module Signing functionality” also apply here?

Well, this just happens to be the one area where there might possibly be, in the worst-case scenario, some argument made for keeping this feature. The problem with applying the same Certificate / ADD SIGNATURE logic to Azure SQL Database is that you can’t create a signature-based Login (from either a Certificate or an Asymmetric Key) in that environment. Ouch! That is definitely a nail-in-the-coffin for the Certificate idea. Ok, so assuming that “Trusted Assemblies” would work in this scenario, is it an acceptable solution to the problem?

This has been a thought-provoking series so far.  If you agree with his conclusions, Solomon has a Connect item he’d like you to upvote.

Grouping Sets Of Tables In Biml ETL Loads

Ben Weissman puts together clusters of tables for data loads:

The table meta.containers could technically also be a temporary table. We’ve decided against that so you can see what’s happening behind the scenes.

Let’s focus on the meta.tables table for now. It has three columns:

– TableName – guess what we’ll store in there
– Container – this one will hold the information, which container we want this table to be loaded it, which will be automatically populated by our stored procedure
– Cost – this column will hold the load cost of this specific table. In our opinion, this should ideally be the average time it took SSIS to load this table in the recent past. If you don’t have that information available, it might as well something like the size of this table in either Gigabytes or Rows. The more accurate this column is, the better your results will be.

The only tricky part in Ben’s code is figuring out appropriate values for Cost, but if you’ve got rough timing measures or even good priors, you can get to a reasonable solution quickly.  And if time is of the essence, you can model, simulate, and apply results as part of an analytics project.

Chart Style Controls

Wolfgang Strasser shows off a new feature in Power BI:

The theme documentation provides a list of available visual names, cardNames and property names.

At this point some further explanation is needed for the hierarchy within the theme definition:

  • visualName corresponds to available PBI visuals like treeMap, card, columnChart,…

  • styleName (as of today I am not sure whereto this corresponds to PBI Desktop language.. :-)) maybe someone can further explain this to me

  • cardName corresponds to the formatting card/option within Power BI Desktop. Attention here: the name in the theme JSON file is defined different than the User Interface name + do not forget case-sensitivity! (i.e.  general => General; categoryAxis => X-Axis, valueAxis => Y-Axis, ..). See the documentation for the rest of the mapping.

This is good news if it makes it easier for developers to write CVD-friendly reports.


September 2017
« Aug