2017-10-24 – Curated SQL

This means you absolutely must sweat the details up front. Big Data project failures are more often than not predicated by the statement: “We will do this bit now, and figure the rest out later”. But you need to begin with the end in mind.

You need to know the performance that you’ll be able to deliver and what your requirements are. You need to know how to integrate with your corporate Active Directory, and LDAP, and Kerberos services. You need to know your network topology and security requirements as well as the required user roles and responsibilities breakdown. You need to know how you’ll handle high availability, QoS, and multi-tenancy. You need to know how you’ll manage upgrades to the latest versions of your Hadoop distribution or other big data tools, and how you’ll respond to requests for new big data frameworks and new data science tools. If not, you’re just asking for trouble.

The motif in his post is building your own car, which makes sense as an extended metaphor.

Comments closed

The Correct Way To Load Libraries In R

Published 2017-10-24 by Kevin Feasel

Gerald Belton opens a can of worms:

When I was an R newbie, I was taught to load packages by using the command library(package). In my Linear Models class, the instructor likes to use require(package). This made me wonder, are the commands interchangeable? What’s the difference, and which command should I use?

Interchangeable commands . . .

The way most users will use these commands, most of the time, they are actually interchangeable. That is, if you are loading a library that has already been installed, and you are using the command outside of a function definition, then it makes no difference if you use “require” or “library.” They do the same thing.

… Well, almost interchangeable

Read on to understand the differences between the two. I end up doing something very similar to his code snippet for exactly the reason he describes. H/T R-Bloggers

Comments closed

How To Create Difficult Measures In Power BI

Published 2017-10-24 by Kevin Feasel

Matt Allington walks through his process of how he creates measures in Power BI:

Killer Tip 1: Create Good Test Data

The first thing I did was to replicate the test data shown above. As I have mentioned many times, good quality test data is essential to getting a quick correct answer to your problem. The data from the OP looked pretty good as it had covered the relevant scenarios (1 ID had just blue, 1 ID had blue and red, several IDs had no blue – this is good test data).

Starting out with good test data is vital—it helps you clarify exactly what it is you want, and if you come up with edge cases, you have the makings of a good test workbench to ensure that your code actually works, and not just in the simplest scenario.

Comments closed

Don’t Unit Test Private Methods

Published 2017-10-24 by Kevin Feasel

Vladimir Khorikov argues that you should not unit test private methods:

When your tests start to know too much about the internals of the system under test (SUT), that leads to false positives during refactoring. Which means they won’t act as a safety net anymore and instead will impede your refactoring efforts because of the necessity to refactor them along with the implementation details they are bound to. Basically, they will stop fulfilling their main objective: providing you with the confidence in code correctness.

When it comes to unit testing, you need to follow this one rule: test only the public API of the SUT, don’t expose its implementation details in order to enable unit testing. Your tests should use the SUT the same way its regular clients do, don’t give them any special privileges. Here you can read more about what an implementation detail is and how it is different from public API: link.

In the database world, this is one reason why I like using stored procedures: they give the equivalent of a public API for database code, so you can write tests for them.

Comments closed

Azure SQL Database FAQ

Published 2017-10-24 by Kevin Feasel

Dimitri Furman answers some common questions about Azure SQL Database:

Q7. Can I use Windows Authentication in Azure SQL Database?

The short answer is no. Therefore, if you are migrating an application dependent on Windows Authentication from SQL Server to Azure SQL Database, you may have to either switch to SQL Authentication (i.e. use a separate login and password for database access), or use Azure Active Directory Authentication (AAD Authentication).

The latter is conceptually similar to Windows Authentication in the sense that connections from directory principals are authenticated without the need to provide additional secrets, such as a password. Since Azure Active Directory can be federated with the on-premises Active Directory Domain Services, it can effectively authenticate the same Active Directory principals that could access the database prior to migration. However, the authentication flow for AAD Authentication is significantly different, so the analogy with Windows Authentication only goes so far.

There are some good questions in here, especially the one about retry logic; that’s good to have in any situation, but becomes vital when working with a cloud service.

Comments closed

How Statistics In SQL Server Have Changed Over The Years

Published 2017-10-24 by Kevin Feasel

Erin Stellato gives us a version-based timeline of how SQL Server has handled statistics over the years:

SQL Server 2008

Filtered statistics are introduced, and these can be created separately from a filtered index. There are some limitations around filtered indexes with regard to the Query Optimizer (see Tim Chapman’s post The Pains of Filtered Indexes and Paul White’s post Optimizer Limitations with Filtered Indexes) post, and it’s important to understand the behavior of the counter that tracks modifications (and thus can trigger automatic updates). See Kimberly’s post Filtered indexes and filtered stats might become seriously out-of-date for more details, and I also recommend checking out her stored procedure that analyzes data skew and recommends where you can create filtered statistics to provide more information to the Query Optimizer. I’ve implemented this for several large customers that have VLTs and skewed distribution across columns frequently used in predicates.
Two new catalog views, sys.stats and sys.stats_columns, are added to provide easier insight into statistics and included columns. Use these two views instead of sp_helpstats, which is deprecated and provides less information.

This is a very interesting historical look. Most interesting to me was the decreases in the number of steps available.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Day: October 24, 2017

Build Versus Buy For Hadoop

The Correct Way To Load Libraries In R

Interchangeable commands . . .

… Well, almost interchangeable

How To Create Difficult Measures In Power BI

Killer Tip 1: Create Good Test Data

Don’t Unit Test Private Methods

Azure SQL Database FAQ

Q7. Can I use Windows Authentication in Azure SQL Database?

How Statistics In SQL Server Have Changed Over The Years