Press "Enter" to skip to content

Day: November 15, 2017

Interpreting P-Value Histograms

David Robinson visualizes and interprets different p-value histograms:

So you’re a scientist or data analyst, and you have a little experience interpreting p-values from statistical tests. But then you come across a case where you have hundreds, thousands, or even millions of p-values. Perhaps you ran a statistical test on each gene in an organism, or on demographics within each of hundreds of counties. You might have heard about the dangers of multiple hypothesis testing before. What’s the first thing you do?

Make a histogram of your p-values. Do this before you perform multiple hypothesis test correction, false discovery rate control, or any other means of interpreting your many p-values. Unfortunately, for some reason, this basic and simple task rarely gets recommended (for instance, the Wikipedia page on the multiple comparisons problem never once mentions this approach). This graph lets you get an immediate sense of how your test behaved across all your hypotheses, and immediately diagnose some potential problems. Here, I’ll walk you through a basic example of interpreting a p-value histogram.

It’s a fun read and informative as well.

Comments closed

A Compendium Of R Errors

Sumendar Karupakala has a bunch of errors you might find in R, as well as their explanations and fixes:

#pull out the animals which are dogs
animaldata[animaldata$Animal.Type == “Dog” ] # throuws an error
Error in `[.data.frame`(animaldata, animaldata$Animal.Type == “Dog”): undefined columns selected
Traceback:
1. animaldata[animaldata$Animal.Type == “Dog”]
2. `[.data.frame`(animaldata, animaldata$Animal.Type == “Dog”)
3. stop(“undefined columns selected”)
In [8]:
#fixed error
animaldata[animaldata$Animal.Type == “Dog”, ] # missedout comma with in the bracket

Some of it is basic syntax; others are a bit nastier.

Comments closed

Microservices With Kafka Streams

Ben Stopford walks us through a microservices architecture built on top of Kafka:

So we can use the Kafka Streams API to piece together complex business systems as a collection of asynchronously executing, event-driven services. The differentiator here is the API itself, which is far richer than, say, the Kafka Producer or Consumer. It makes code more readable, provides reusable implementations of common patterns like joins, aggregates, and filters and wraps the whole ecosystem with a transparent level of correctness.

Systems built in this way, in the real world, come in a variety of guises. They can be fine grained and fast executing, completing in the context of an HTTP request, or complex and long-running, manipulating the stream of events that map a whole company’s business flow. This post focusses on the former, building up a real-world example of a simple order management system that executes within the context of a HTTP request, and is entirely built with Kafka Streams. Each service is a small function, with well-defined inputs and outputs. As we build this ecosystem up, we will encounter problems such as blending streams and tables, reading our own writes, and managing consistency in a distributed and autonomous environment.

This post stays high-level and covers a lot of ground.  I’m wishy-washy on the idea of microservices, but if you are going to do them, it’s better to do them right.

Comments closed

Validating Database Mail

Frank Gill has a script to validate that your database mail settings are valid:

In my last post, I shared a script to automate the migration of SQL Server Database Mail settings. In this post, I show how to send test e-mails from all Database Mail profiles on an instance. The migration I was working on contained 21 Database Mail profiles. The following script will send a test e-mail from each profile to confirm successful configuration. I hope you can put this code to use in your migrations.

Click through for the script.

Comments closed

The GDPR And You

William Brewer has some Q&A regarding the General Data Protection Regulation:

The General Data Protection Regulation (GDPR) will affect organisations in countries around the world, not just those in Europe. The GDPR regulates how personal data is stored, moved, handled, and destroyed. Not following the regulation will lead to dire consequences for your organisation. As a data professional or developer, you may have many questions and might be wondering how it will affect the way you will do your job. William Brewer answers common questions about the GDPR that you were too shy to ask.

Grant Fritchey summarizes the rules:

Ever heard of the General Data Protection Regulation? If not, go and read the Wiki. I’ll wait.

I can already hear what you’re thinking. “Grant, this doesn’t apply to me because my company is in the <insert non-EU country here>.” How do I know you’re thinking that? Because every single person with whom I’ve brought this up has had the same response. You might want to go back and re-read it.

As a data professional, you’re going to want to know about this regulation.

Comments closed

Database Projects: Helping Find Obsolete References

Jan Mulkens explains some of those “unresolved reference” warnings in SQL Server Data Tools database projects:

if you’re developing databases in SSDT, like you should, you’re probably getting a lot of build warnings.
One of the warnings you’ll see the most often is the “unresolved reference”.
Usually you solve these by adding either the master, the msdb or some application database as a database reference.
This post is about a warning you might get when out of habit (or, if like me, you didn’t know any better yet) you’re using old system views like sys.sysprocesses. You expect it to work but it simply doesn’t…

Worth reading the whole thing, as well as keeping up-to-date with your DMV and system view usage.

Comments closed

Jitter In Power BI Charts

Rob Collie shows how to incorporate jitter in Power BI scatter charts:

Now, sometimes you may WANT multiple rows to combine into one dot, but in this particular case, I want to see each row of my source data as its own dot.

When adding a new calculated column, there are LOTS of ways to uniquely “stamp” each row with its own distinct value.  I could do this in DAX, but it would require concatenating/combining enough columns together (in this case, probably [Game #], [Qtr], and [Time], since no two rows can “happen” at the same time in the same game.

But for other reasons that you will see shortly, I need the unique identifier to be a number, and I don’t want to go through the contortions of converting text values to numeric, plus as you can see, the data is incomplete in the [Time] column (lots of blanks).

There’s a lot here, and the end result is a great addition to your Power BI toolbelt.  But as I’m reading Rob’s post, I’m thinking about how much easier it is to do some of this with ggplot2.

Comments closed

Finding The Right Batch Size For Bulk Loads

Dan Guzman has some bulk load batch size considerations:

Bulk load has long been the fastest way to mass insert rows into a SQL Server table, providing orders of magnitude better performance compared to traditional INSERTs. SQL Server database engine bulk load capabilities are leveraged by T-SQL BULK INSERT, INSERT…SELECT, and MERGE statements as well as by SQL Server client APIs like ODBC, OLE DB, ADO.NET, and JDBC. SQL Server tools like BCP and components like SSIS leverage these client APIs to optimize insert performance.

SQL Server 2016 and later improves performance further by turning on bulk load context and minimal logging by default when bulk loading into SIMPLE and BULK LOGGED recovery model databases, which previously required turning on trace flags as detailed in this blog post by Parikshit Savjani of the MSSQL Tiger team. That post also includes links to other great resources that thoroughly cover minimal logging and data loading performance, which I recommend you peruse if you use bulk load often. I won’t repeat all that information here but do want to call attention to the fact that these new bulk load optimizations can result in much more unused space when a small batch size is used compared to SQL Server 2014 and older versions.

Click through for some tips.

Comments closed

Searching Stored Procedures And Ad Hoc Queries

Bert Wagner has a couple queries to help you find references in T-SQL objects, as well as ad hoc statements which are currently in the plan cache:

Have you ever wanted to find something that was referenced in the body of a SQL query?

Maybe you need to know what queries you will have to modify for an upcoming table rename. Or maybe you want to see how many queries on your server are running SELECT *

Below are two templates you can use to search across the text of SQL queries on your server.

Click through for the scripts.  Finding references in T-SQL objects (views, procedures, functions, triggers, etc.) is a fairly straightforward process.  Finding references in ad hoc statements is much more hit-or-miss.

Comments closed