Press "Enter" to skip to content

Month: November 2017

Using The GROUPING SETS Operator

Alfonso Hernandez goes into detail with what you can do with GROUPING SETS:

In T-SQL, you summarize data by using the GROUP BY clause within an aggregate query. This clause creates groupings which are defined by a set of expressions. One row per unique combination of the expressions in the GROUP BY clause is returned, and aggregate functions such as COUNT or SUMmay be used on any columns in the query. However, if you want to group the data by multiple combinations of group by expressions, you may take one of two approaches. The first approach is to create one grouped query per combination of expressions and merge the results using the UNION ALLoperator. The other approach is to use the GROUPING SETS operator along with the GROUP BY clause and define each grouping set within a single query.

In this article I’ll demonstrate how to achieve the same results using each method.

Mastering GROUPING SETS makes reporting queries in T-SQL so much more effective.

Comments closed

Thoughts On Dynamic Data Masking

Kellyn Pot’vin-Gorman talks about Dynamic Data Masking:

Anyone with the unmask privilege or DB_OWNER will be able to view the data.  As many development and testing environments grant higher privileges to the users and in SQL Server, it’s not rare for a developer to be the DB_OWNER, (I used to come across this all the time when recoveries were performed by the wrong OS user) this leaves this data still quite vulnerable.  I do like that if you were to take a backup and recover it with masking, the obfuscated data is what is recovered physically.  I’m more concerned about those odd environments where compliance hasn’t been put in place on owners of the database that would still view the originally masked data, but unmasked.

Performance isn’t impacted, (i.e. no referential integrity concerns or execution plans) as the optimizer  performs all steps against the real data, which leads me to wonder what happens with some of the newer monitoring tools that state they can display SQL and bind variable data without accessing the database directly.  Would they “sniff” the masked data or unmasked?  Would it matter who the OS User or roles in the database?

The important thing here is that DDM isn’t really a security product.  It’s a something-or-another product that might be useful to stop shoulder surfing but pretty much nothing else.

1 Comment

Interpreting P-Value Histograms

David Robinson visualizes and interprets different p-value histograms:

So you’re a scientist or data analyst, and you have a little experience interpreting p-values from statistical tests. But then you come across a case where you have hundreds, thousands, or even millions of p-values. Perhaps you ran a statistical test on each gene in an organism, or on demographics within each of hundreds of counties. You might have heard about the dangers of multiple hypothesis testing before. What’s the first thing you do?

Make a histogram of your p-values. Do this before you perform multiple hypothesis test correction, false discovery rate control, or any other means of interpreting your many p-values. Unfortunately, for some reason, this basic and simple task rarely gets recommended (for instance, the Wikipedia page on the multiple comparisons problem never once mentions this approach). This graph lets you get an immediate sense of how your test behaved across all your hypotheses, and immediately diagnose some potential problems. Here, I’ll walk you through a basic example of interpreting a p-value histogram.

It’s a fun read and informative as well.

Comments closed

A Compendium Of R Errors

Sumendar Karupakala has a bunch of errors you might find in R, as well as their explanations and fixes:

#pull out the animals which are dogs
animaldata[animaldata$Animal.Type == “Dog” ] # throuws an error
Error in `[.data.frame`(animaldata, animaldata$Animal.Type == “Dog”): undefined columns selected
Traceback:
1. animaldata[animaldata$Animal.Type == “Dog”]
2. `[.data.frame`(animaldata, animaldata$Animal.Type == “Dog”)
3. stop(“undefined columns selected”)
In [8]:
#fixed error
animaldata[animaldata$Animal.Type == “Dog”, ] # missedout comma with in the bracket

Some of it is basic syntax; others are a bit nastier.

Comments closed

Microservices With Kafka Streams

Ben Stopford walks us through a microservices architecture built on top of Kafka:

So we can use the Kafka Streams API to piece together complex business systems as a collection of asynchronously executing, event-driven services. The differentiator here is the API itself, which is far richer than, say, the Kafka Producer or Consumer. It makes code more readable, provides reusable implementations of common patterns like joins, aggregates, and filters and wraps the whole ecosystem with a transparent level of correctness.

Systems built in this way, in the real world, come in a variety of guises. They can be fine grained and fast executing, completing in the context of an HTTP request, or complex and long-running, manipulating the stream of events that map a whole company’s business flow. This post focusses on the former, building up a real-world example of a simple order management system that executes within the context of a HTTP request, and is entirely built with Kafka Streams. Each service is a small function, with well-defined inputs and outputs. As we build this ecosystem up, we will encounter problems such as blending streams and tables, reading our own writes, and managing consistency in a distributed and autonomous environment.

This post stays high-level and covers a lot of ground.  I’m wishy-washy on the idea of microservices, but if you are going to do them, it’s better to do them right.

Comments closed

Validating Database Mail

Frank Gill has a script to validate that your database mail settings are valid:

In my last post, I shared a script to automate the migration of SQL Server Database Mail settings. In this post, I show how to send test e-mails from all Database Mail profiles on an instance. The migration I was working on contained 21 Database Mail profiles. The following script will send a test e-mail from each profile to confirm successful configuration. I hope you can put this code to use in your migrations.

Click through for the script.

Comments closed

The GDPR And You

William Brewer has some Q&A regarding the General Data Protection Regulation:

The General Data Protection Regulation (GDPR) will affect organisations in countries around the world, not just those in Europe. The GDPR regulates how personal data is stored, moved, handled, and destroyed. Not following the regulation will lead to dire consequences for your organisation. As a data professional or developer, you may have many questions and might be wondering how it will affect the way you will do your job. William Brewer answers common questions about the GDPR that you were too shy to ask.

Grant Fritchey summarizes the rules:

Ever heard of the General Data Protection Regulation? If not, go and read the Wiki. I’ll wait.

I can already hear what you’re thinking. “Grant, this doesn’t apply to me because my company is in the <insert non-EU country here>.” How do I know you’re thinking that? Because every single person with whom I’ve brought this up has had the same response. You might want to go back and re-read it.

As a data professional, you’re going to want to know about this regulation.

Comments closed

Database Projects: Helping Find Obsolete References

Jan Mulkens explains some of those “unresolved reference” warnings in SQL Server Data Tools database projects:

if you’re developing databases in SSDT, like you should, you’re probably getting a lot of build warnings.
One of the warnings you’ll see the most often is the “unresolved reference”.
Usually you solve these by adding either the master, the msdb or some application database as a database reference.
This post is about a warning you might get when out of habit (or, if like me, you didn’t know any better yet) you’re using old system views like sys.sysprocesses. You expect it to work but it simply doesn’t…

Worth reading the whole thing, as well as keeping up-to-date with your DMV and system view usage.

Comments closed

Jitter In Power BI Charts

Rob Collie shows how to incorporate jitter in Power BI scatter charts:

Now, sometimes you may WANT multiple rows to combine into one dot, but in this particular case, I want to see each row of my source data as its own dot.

When adding a new calculated column, there are LOTS of ways to uniquely “stamp” each row with its own distinct value.  I could do this in DAX, but it would require concatenating/combining enough columns together (in this case, probably [Game #], [Qtr], and [Time], since no two rows can “happen” at the same time in the same game.

But for other reasons that you will see shortly, I need the unique identifier to be a number, and I don’t want to go through the contortions of converting text values to numeric, plus as you can see, the data is incomplete in the [Time] column (lots of blanks).

There’s a lot here, and the end result is a great addition to your Power BI toolbelt.  But as I’m reading Rob’s post, I’m thinking about how much easier it is to do some of this with ggplot2.

Comments closed

Finding The Right Batch Size For Bulk Loads

Dan Guzman has some bulk load batch size considerations:

Bulk load has long been the fastest way to mass insert rows into a SQL Server table, providing orders of magnitude better performance compared to traditional INSERTs. SQL Server database engine bulk load capabilities are leveraged by T-SQL BULK INSERT, INSERT…SELECT, and MERGE statements as well as by SQL Server client APIs like ODBC, OLE DB, ADO.NET, and JDBC. SQL Server tools like BCP and components like SSIS leverage these client APIs to optimize insert performance.

SQL Server 2016 and later improves performance further by turning on bulk load context and minimal logging by default when bulk loading into SIMPLE and BULK LOGGED recovery model databases, which previously required turning on trace flags as detailed in this blog post by Parikshit Savjani of the MSSQL Tiger team. That post also includes links to other great resources that thoroughly cover minimal logging and data loading performance, which I recommend you peruse if you use bulk load often. I won’t repeat all that information here but do want to call attention to the fact that these new bulk load optimizations can result in much more unused space when a small batch size is used compared to SQL Server 2014 and older versions.

Click through for some tips.

Comments closed