Defining Result Sets With ML Services

Dave Mason covers a pain point in SQL Server Machine Learning Services:

The example above is so simple, defining the RESULT SETS poses no problems. But what if the format of the output isn’t known at design time? R (or Python) might take the input data set and add, remove, or change columns conditionally. Further, the input data set might not even be known at design time. How would you define the RESULT SETS at run time?

WITH RESULT SETS needs a MAKE_A_GUESS or FIGURE_IT_OUT option. If there’s some other type of “easy button” for this, I haven’t found it.

It would be nice if the service could the ability to read the data frame columns and use those by default.

Using The GROUPING SETS Operator

Alfonso Hernandez goes into detail with what you can do with GROUPING SETS:

In T-SQL, you summarize data by using the GROUP BY clause within an aggregate query. This clause creates groupings which are defined by a set of expressions. One row per unique combination of the expressions in the GROUP BY clause is returned, and aggregate functions such as COUNT or SUMmay be used on any columns in the query. However, if you want to group the data by multiple combinations of group by expressions, you may take one of two approaches. The first approach is to create one grouped query per combination of expressions and merge the results using the UNION ALLoperator. The other approach is to use the GROUPING SETS operator along with the GROUP BY clause and define each grouping set within a single query.

In this article I’ll demonstrate how to achieve the same results using each method.

Mastering GROUPING SETS makes reporting queries in T-SQL so much more effective.

Thoughts On Dynamic Data Masking

Kellyn Pot’vin-Gorman talks about Dynamic Data Masking:

Anyone with the unmask privilege or DB_OWNER will be able to view the data.  As many development and testing environments grant higher privileges to the users and in SQL Server, it’s not rare for a developer to be the DB_OWNER, (I used to come across this all the time when recoveries were performed by the wrong OS user) this leaves this data still quite vulnerable.  I do like that if you were to take a backup and recover it with masking, the obfuscated data is what is recovered physically.  I’m more concerned about those odd environments where compliance hasn’t been put in place on owners of the database that would still view the originally masked data, but unmasked.

Performance isn’t impacted, (i.e. no referential integrity concerns or execution plans) as the optimizer  performs all steps against the real data, which leads me to wonder what happens with some of the newer monitoring tools that state they can display SQL and bind variable data without accessing the database directly.  Would they “sniff” the masked data or unmasked?  Would it matter who the OS User or roles in the database?

The important thing here is that DDM isn’t really a security product.  It’s a something-or-another product that might be useful to stop shoulder surfing but pretty much nothing else.

Interpreting P-Value Histograms

David Robinson visualizes and interprets different p-value histograms:

So you’re a scientist or data analyst, and you have a little experience interpreting p-values from statistical tests. But then you come across a case where you have hundreds, thousands, or even millions of p-values. Perhaps you ran a statistical test on each gene in an organism, or on demographics within each of hundreds of counties. You might have heard about the dangers of multiple hypothesis testing before. What’s the first thing you do?

Make a histogram of your p-values. Do this before you perform multiple hypothesis test correction, false discovery rate control, or any other means of interpreting your many p-values. Unfortunately, for some reason, this basic and simple task rarely gets recommended (for instance, the Wikipedia page on the multiple comparisons problem never once mentions this approach). This graph lets you get an immediate sense of how your test behaved across all your hypotheses, and immediately diagnose some potential problems. Here, I’ll walk you through a basic example of interpreting a p-value histogram.

It’s a fun read and informative as well.

A Compendium Of R Errors

Sumendar Karupakala has a bunch of errors you might find in R, as well as their explanations and fixes:

#pull out the animals which are dogs
animaldata[animaldata$Animal.Type == “Dog” ] # throuws an error
Error in `[.data.frame`(animaldata, animaldata$Animal.Type == “Dog”): undefined columns selected
Traceback:
1. animaldata[animaldata$Animal.Type == “Dog”]
2. `[.data.frame`(animaldata, animaldata$Animal.Type == “Dog”)
3. stop(“undefined columns selected”)
In [8]:
#fixed error
animaldata[animaldata$Animal.Type == “Dog”, ] # missedout comma with in the bracket

Some of it is basic syntax; others are a bit nastier.

Microservices With Kafka Streams

Ben Stopford walks us through a microservices architecture built on top of Kafka:

So we can use the Kafka Streams API to piece together complex business systems as a collection of asynchronously executing, event-driven services. The differentiator here is the API itself, which is far richer than, say, the Kafka Producer or Consumer. It makes code more readable, provides reusable implementations of common patterns like joins, aggregates, and filters and wraps the whole ecosystem with a transparent level of correctness.

Systems built in this way, in the real world, come in a variety of guises. They can be fine grained and fast executing, completing in the context of an HTTP request, or complex and long-running, manipulating the stream of events that map a whole company’s business flow. This post focusses on the former, building up a real-world example of a simple order management system that executes within the context of a HTTP request, and is entirely built with Kafka Streams. Each service is a small function, with well-defined inputs and outputs. As we build this ecosystem up, we will encounter problems such as blending streams and tables, reading our own writes, and managing consistency in a distributed and autonomous environment.

This post stays high-level and covers a lot of ground.  I’m wishy-washy on the idea of microservices, but if you are going to do them, it’s better to do them right.

Validating Database Mail

Frank Gill has a script to validate that your database mail settings are valid:

In my last post, I shared a script to automate the migration of SQL Server Database Mail settings. In this post, I show how to send test e-mails from all Database Mail profiles on an instance. The migration I was working on contained 21 Database Mail profiles. The following script will send a test e-mail from each profile to confirm successful configuration. I hope you can put this code to use in your migrations.

Click through for the script.

The GDPR And You

William Brewer has some Q&A regarding the General Data Protection Regulation:

The General Data Protection Regulation (GDPR) will affect organisations in countries around the world, not just those in Europe. The GDPR regulates how personal data is stored, moved, handled, and destroyed. Not following the regulation will lead to dire consequences for your organisation. As a data professional or developer, you may have many questions and might be wondering how it will affect the way you will do your job. William Brewer answers common questions about the GDPR that you were too shy to ask.

Grant Fritchey summarizes the rules:

Ever heard of the General Data Protection Regulation? If not, go and read the Wiki. I’ll wait.

I can already hear what you’re thinking. “Grant, this doesn’t apply to me because my company is in the <insert non-EU country here>.” How do I know you’re thinking that? Because every single person with whom I’ve brought this up has had the same response. You might want to go back and re-read it.

As a data professional, you’re going to want to know about this regulation.

Database Projects: Helping Find Obsolete References

Jan Mulkens explains some of those “unresolved reference” warnings in SQL Server Data Tools database projects:

if you’re developing databases in SSDT, like you should, you’re probably getting a lot of build warnings.
One of the warnings you’ll see the most often is the “unresolved reference”.
Usually you solve these by adding either the master, the msdb or some application database as a database reference.
This post is about a warning you might get when out of habit (or, if like me, you didn’t know any better yet) you’re using old system views like sys.sysprocesses. You expect it to work but it simply doesn’t…

Worth reading the whole thing, as well as keeping up-to-date with your DMV and system view usage.

Jitter In Power BI Charts

Rob Collie shows how to incorporate jitter in Power BI scatter charts:

Now, sometimes you may WANT multiple rows to combine into one dot, but in this particular case, I want to see each row of my source data as its own dot.

When adding a new calculated column, there are LOTS of ways to uniquely “stamp” each row with its own distinct value.  I could do this in DAX, but it would require concatenating/combining enough columns together (in this case, probably [Game #], [Qtr], and [Time], since no two rows can “happen” at the same time in the same game.

But for other reasons that you will see shortly, I need the unique identifier to be a number, and I don’t want to go through the contortions of converting text values to numeric, plus as you can see, the data is incomplete in the [Time] column (lots of blanks).

There’s a lot here, and the end result is a great addition to your Power BI toolbelt.  But as I’m reading Rob’s post, I’m thinking about how much easier it is to do some of this with ggplot2.

Categories

November 2017
MTWTFSS
« Oct  
 12345
6789101112
13141516171819
20212223242526
27282930