Press "Enter" to skip to content

Curated SQL Posts

Building Dynamic Row Headers With ML Services

Dave Mason tries to get around his RESULT SETS limitation when using SQL Server Machine Learning Services:

The columns in the data frame clearly have names, but SQL Server isn’t using them. The data frame columns have types in R too (more on this in a moment). Now that makes me wonder about the data types for the data returned by SQL. How is that determined? If SQL isn’t using the column names, can I assume it isn’t making use of the R column types either?

For a point of reference, let’s run some more R code to show the column names and types. As before, the rvest package is used to scrape a web page, with each HTML <table> found becoming a data frame in the “tables” list (line 3). A data frame of table metadata is created by calling data.frame(). The first parameter is a vector of column names (line 4), the second parameter is a vector of column classes (line 5), and the third parameter causes the row “names” to be incrementing digits (line 6).

This is a work in progress as Dave continues his series.

Comments closed

Getting Started With Zeppelin

Sangeeta Gulia shows us how to get started building notebooks with Apache Zeppelin on top of Spark:

There are 3 interpreter modes available in Zeppelin.

1) Shared Mode

In Shared mode, a SparkContext and a Scala REPL is being shared among all interpreters in the group. So every Note will be sharing single SparkContext and single Scala REPL. In this mode, if NoteA defines variable ‘a’ then NoteB not only able to read variable ‘a’ but also able to override the variable.

2) Scoped Mode

In Scoped mode, each Note has its own Scala REPL. So variable defined in a Note can not be read or overridden in another Note. However, still single SparkContext serves all the Interpreter Groups. And all the jobs are submitted to this SparkContext and fair scheduler schedules the job. This could be useful when user does not want to share Scala session, but want to keep single Spark application and leverage its fair scheduler.

3) Isolated Mode

In Isolated mode, each Note has its own SparkContext and Scala REPL.

The default mode of %spark interpreter is ‘Globally Shared’.

This is mostly a step-by-step on installing Zeppelin, but does go into some detail on how Zeppelin works.

Comments closed

Everyone’s Data Is Dirty

Chirag Shivalker hits the highlights on dirty data:

It might sound a bit abrupt, but clean data is a myth. If your data is dirty, so is everyone else’s. Enterprises are more than dependent on data these days, and it is going to stay the same in coming years. They need to collect data in order to analyze it, which necessarily will not be 100% clean, pristine, or perfect in nature.

Nearly all companies face the challenge of dirty data in the form of a lot of duplicates, incorrect fields, and missing values. This happens due to omnichannel data influx, followed by hundreds, if not thousands, of employees wrestling and torturing that data to derive professional outcomes and insights. Don’t forget that even the best of the data has that tendency to decay in few weeks.

The saying goes that any analytics project is about 80% data cleansing and feature extraction.  I’d say that number’s probably closer to 90-95%, and dirty data is a big part of that.

Comments closed

Query Store Plan Forcing: You Can’t Always Get What You Want

Kendra Little shows an example where trying to force a Query Store plan results in an oddity:

This is not considered a “failure”

When I check the Query Store DMVs, force_failure_count is 0. The last_force_failure_reason_desc is NONE.

Query Store didn’t fail to apply the narrow plan. Instead, it’s just deciding not to give it to me, now that I’ve forced that plan.

Seems kinda like an adolescent, doesn’t it?

The answer remains a bit of a mystery, but read on to see how Kendra troubleshoots this.

Comments closed

What To Do With A Database In Source Control

Ed Elliott with Database Source Control 102:

This post is for a specific type of person if you are:

  • New to source control
  • Are getting started on your path to the continuous delivery nirvana
  • Have been able to get your database into some sort of source control system
  • Have more than one person checking in code to the database source
  • You are unsure what yo do next

Then this post is for you!

This is a nice post with some next-steps for when you have a database in source control but aren’t quite sure what to do next.

Comments closed

Go Download SQL Operations Studio

The SQL Server Team has announced SQL Operations Studio:

We are excited to announce that SQL Operations Studio is now available in preview. SQL Operations Studio is a free, light-weight tool for modern database development and operations for SQL Server on Windows, Linux and Docker, Azure SQL Database and Azure SQL Data Warehouse on Windows, Mac or Linux machines.

Download SQL Operations Studio to get started.

It’s not SSMS, but it is cross-platform.  And I think that over time, it will end up being better than SSMS.

Comments closed

Defining Result Sets With ML Services

Dave Mason covers a pain point in SQL Server Machine Learning Services:

The example above is so simple, defining the RESULT SETS poses no problems. But what if the format of the output isn’t known at design time? R (or Python) might take the input data set and add, remove, or change columns conditionally. Further, the input data set might not even be known at design time. How would you define the RESULT SETS at run time?

WITH RESULT SETS needs a MAKE_A_GUESS or FIGURE_IT_OUT option. If there’s some other type of “easy button” for this, I haven’t found it.

It would be nice if the service could the ability to read the data frame columns and use those by default.

Comments closed

Using The GROUPING SETS Operator

Alfonso Hernandez goes into detail with what you can do with GROUPING SETS:

In T-SQL, you summarize data by using the GROUP BY clause within an aggregate query. This clause creates groupings which are defined by a set of expressions. One row per unique combination of the expressions in the GROUP BY clause is returned, and aggregate functions such as COUNT or SUMmay be used on any columns in the query. However, if you want to group the data by multiple combinations of group by expressions, you may take one of two approaches. The first approach is to create one grouped query per combination of expressions and merge the results using the UNION ALLoperator. The other approach is to use the GROUPING SETS operator along with the GROUP BY clause and define each grouping set within a single query.

In this article I’ll demonstrate how to achieve the same results using each method.

Mastering GROUPING SETS makes reporting queries in T-SQL so much more effective.

Comments closed

Thoughts On Dynamic Data Masking

Kellyn Pot’vin-Gorman talks about Dynamic Data Masking:

Anyone with the unmask privilege or DB_OWNER will be able to view the data.  As many development and testing environments grant higher privileges to the users and in SQL Server, it’s not rare for a developer to be the DB_OWNER, (I used to come across this all the time when recoveries were performed by the wrong OS user) this leaves this data still quite vulnerable.  I do like that if you were to take a backup and recover it with masking, the obfuscated data is what is recovered physically.  I’m more concerned about those odd environments where compliance hasn’t been put in place on owners of the database that would still view the originally masked data, but unmasked.

Performance isn’t impacted, (i.e. no referential integrity concerns or execution plans) as the optimizer  performs all steps against the real data, which leads me to wonder what happens with some of the newer monitoring tools that state they can display SQL and bind variable data without accessing the database directly.  Would they “sniff” the masked data or unmasked?  Would it matter who the OS User or roles in the database?

The important thing here is that DDM isn’t really a security product.  It’s a something-or-another product that might be useful to stop shoulder surfing but pretty much nothing else.

1 Comment

Interpreting P-Value Histograms

David Robinson visualizes and interprets different p-value histograms:

So you’re a scientist or data analyst, and you have a little experience interpreting p-values from statistical tests. But then you come across a case where you have hundreds, thousands, or even millions of p-values. Perhaps you ran a statistical test on each gene in an organism, or on demographics within each of hundreds of counties. You might have heard about the dangers of multiple hypothesis testing before. What’s the first thing you do?

Make a histogram of your p-values. Do this before you perform multiple hypothesis test correction, false discovery rate control, or any other means of interpreting your many p-values. Unfortunately, for some reason, this basic and simple task rarely gets recommended (for instance, the Wikipedia page on the multiple comparisons problem never once mentions this approach). This graph lets you get an immediate sense of how your test behaved across all your hypotheses, and immediately diagnose some potential problems. Here, I’ll walk you through a basic example of interpreting a p-value histogram.

It’s a fun read and informative as well.

Comments closed