An Example of p-Hacking

Vincent Granville explains why using p-values for model-worthiness can lead you to a bad outcome:

Recently, p-values have been criticized and even banned by some journals, because they are used by researchers, who cherry-pick observations and repeat experiments until they obtain a p-value worth publishing to obtain grant money, get tenure, or for political reasons.  Even the American Statistical Association wrote a long article about why to avoid p-values, and what you should do instead: see here.  For data scientists, obvious alternatives include re-sampling techniques: see here and here. One advantage is that they are model-independent, data-driven, and easy to understand. 

Here we explain how the manipulation and treachery works, using a simple simulated data set consisting of purely random, non-correlated observations. Using p-values, you can tell anything you want about the data, even the fact that the features are highly correlated, when they are not. The data set consists of 16 variables and 30 observations, generated using the RAND function in Excel. You can download the spreadsheet here.

And for a more academic treatment of the problem, I love this paper by Andrew Gelman and Eric Loken, particularly because it points out that you don’t have to have malicious intent to end up doing the wrong thing.

Predicting Intermittent Demand

Bruno Rodrigues shows one technique for forecasting intermittent data:

Now, it is clear that this will be tricky to forecast. There is no discernible pattern, no trend, no seasonality… nothing that would make it “easy” for a model to learn how to forecast such data.

This is typical intermittent demand data. Specific methods have been developed to forecast such data, the most well-known being Croston, as detailed in this paper. A function to estimate such models is available in the {tsintermittent} package, written by Nikolaos Kourentzes who also wrote another package, {nnfor}, which uses Neural Networks to forecast time series data. I am going to use both to try to forecast the intermittent demand for the {RDieHarder} package for the year 2019.

Read the whole thing. H/T R-Bloggers

TensorFrames: Spark Plus TensorFlow

Adi Polak gives us an introduction to TensorFrames:

In all TensorFrames functionality, the DataFrame is sent together with the computations graph. The DataFrame represents the distributed data, meaning in every machine there is a chunk of the data that will go through the graph operations/ transformations. This will happen in every machine with the relevant data. Tungsten binary format is the actual binary in-memory data that goes through the transformation, first to Apache Spark Java object and from there it is sent to TensorFlow Jave API for graph calculations. This all happens in the Spark Worker process, the Spark worker process can spin many tasks which mean various calculation at the same time over the in-memory data.

An interesting bit of turnabout here is that the Scala API is the underdeveloped one; normally for Spark, the Python API is the Johnny-Come-Lately version.

Linear Regression With Python In Power BI

Emanuele Meazzo builds a linear regression in Power BI using a Python visual:

As a prerequisite, of course, you’ll need to have python installed in your machine, I recommend having an external IDE like Visual Studio Code to write your Python code as the PowerBI window offers zero assistance to coding.

You can follow this article in order to configure Python Correctly for PowerBI.

Step 2 is to add a Python Visual to the page, and let the magic happen.

Click through for the step-by-step instructions, including quite a bit of Python code and a few warnings and limitations.

T-SQL Tuesday 115 Roundup

Mohammad Darab has a roundup for this month’s T-SQL Tuesday:

It was an absolute honor to host this month’s TSQL Tuesday. I decided on doing the “Dear 20 year old self” as a way for us to reflect on life. It seemed like this topic hit home with a lot of people. I enjoyed reading each one of the posts.

If you don’t find your post in this Round Up, please email me your link and I will update this post!

There were 14 responses this month; click through for the full set.

Testing Inline Scalar UDF Performance

Erik Darling whips up a performance test covering scalar UDF changes in SQL Server 2019:

This is a slightly different take on yesterday’s post, which is also a common problem I see in queries today.

Someone wrote a function to figure out if a user is trusted, or has the right permissions, and sticks it in a predicate — it could be a join or where clause.

If you do need to use scalar UDFs, SQL Server 2019 is a big step forward.

Making Non-SARGable Queries SARGable with an Index

Denis Gobo violates Betteridge’s Law of Headlines:

This question came up the other day from a co-worker, he said he couldn’t change a query but was there a way of making the same query produce a better plan by doing something else perhaps (magic?)

He said his query had a WHERE clause that looked like the following

WHERE RIGHT(SomeColumn,3) = '333'

I then asked if he could change the table, his answer was that he couldn’t mess around with the current columns but he could add a column

Click through to see how Denis was able to solve this problem.

Containers and Data

Randolph West argues that you should keep data and containers separated:

Where it gets interesting is that the SQL Server container is also where the database files are stored by default. I raised a point (which Grant and others have already noted in the past) that persisted storage volumes allow us to throw away a SQL Server container without throwing away the database files, provided that the container is set up to use that persisted storage.

For example, I can map the SQL Server container to a local directory on my Ubuntu or Windows Server, or — as is the case with Kubernetes — a second container can serve as the storage volume. SQL Server is then just a compute engine, or a “service” as Anthony points out in the Twitter thread.

Because every rule has a counter-example (even this one), there are cases when you do want the data to live with the container. For example, a test database for a unit test runner should probably live inside the container rather than being a persisted volume. The reason is that you can blow away the database after a test run and start over for the next run. Putting together a database for a hackathon or user group can be another exception for the same reason. But for pretty much every other purpose, I’d rather have a persisted volume.

Learning R Versus Python

Andy Kirk shares the results of a rather informal Twitter poll:

Yesterday I ran a simple Twitter poll about the relative ease of learning R vs. Python. Although a correct answer to this query will ALWAYS have to be based on nuances like pre-existing skills and the scope of need, this originates from people telling me they encounter job or career profiles that list a need for R and/or Python. If they don’t have either, if they prioritised the pursuit of just one, which would be possible to develop a degree of competency more easily, more quickly and more efficiently?

Andy has also created a Twitter moment from the responses.

My thought, based only on the question itself, is that R would be better than Python because the hypothetical person has no additional programming skills. For someone with additional programming skills, the breakdown for me starts with, if your background is statistics, database development, or functional programming, you probably want R; if your background is object-oriented development or imperative programming, you probably want Python. And then it gets nuanced.

Using the ML.NET Model Builder

I have a post looking at the ML.NET Model Builder:

You have four options from which to choose: two-class classification, multi-class classification, regression, or Choose Your Own Adventure. Today, we’re going to create a two-class classification model. Incidentally, they’re not kidding about things changing in preview—last time I looked at this, they didn’t have multi-class classifiers available.

Once you select Sentiment Analysis (that is, two-class classification of text), you can figure out how to feed data to this trainer.

I think this is fine for developers who are looking to add a machine learning component as a small part of a bigger product. I don’t think it will beat a trained human using R or Python, but it’s an interesting avenue.

Categories

June 2019
MTWTFSS
« May  
 12
3456789
10111213141516
17181920212223
24252627282930