Press "Enter" to skip to content

Month: April 2017

More Isn’t Better With Data Collection

Andy Leonard argues that more data is not better data:

The Problem I am Trying To Solve

Is more data better? In his 2012 book, Antifragile, Nassim Nicholas Taleb (fooledbyrandomness.com | @nntaleb) – the first data philosopher I encountered – states:

“The fooled-by-data effect is accelerating. There is a nasty phenomenon called ‘Big Data’ in which researchers have brought cherry-picking to an industrial level. Modernity provides too many variables (but too little data per variable), and the spurious relationships grow much, much faster than real information, as noise is convex and information is concave.” – Nassim Nicholas Taleb, Antifragile, p. 416

According to Taleb, there’s a bias for error embedded in big data; more is not better, it’s worse. I’ve experienced this with business intelligence solutions and spoken about data quality in data warehouse solutions, saying:

“The ratio of good:bad data in a useless / inaccurate data warehouse is surprisingly high; almost always north of 95% and often higher than 99%.”

Taleb states more data includes a disproportionate amount of bad data, and that bigger data results in more spurious correlations. In other words, more is not better – it’s worse.

It’s an idea worth grappling with.  The other side of the argument is that for some problems, you won’t know what you need until you need it.

Comments closed

Guessing At Foreign Key Relationships

Daniel Hutmacher has put together a script to try to find hidden foreign key relationships in a database:

Now, before you go crazy with this stuff, remember, it’s not a magic bullet, but rather some automation help to save you some coding and to help you review your data model. The script doesn’t change the database, it only prints out its suggestions, and this is totally by design.

  • For this to work, you’ll obviously need proper primary keys or unique indexes on your referenced tables.

  • We’re working on the assumption that the referencing and referenced column names are the same. Go ahead and change the script to suit your naming standards (look for the comment in the CTE)

  • The script has no domain knowledge of your database, some of the suggestions are probably going to be downright silly.

This is a good first pass approach, especially if you have a larger database completely lacking in relational integrity.

Comments closed

Recompiling Memory-Optimized Procedures On AGs

Ned Otter takes note of how natively compiled stored procedures differ from traditional stored procedures, especially on Availability Group secondary nodes:

As of SQL 2016, the database engine automatically updates statistics for memory-optimized tables (documentation here), but recompilation of native modules must still be performed manually. But hey, that’s way better than SQL 2014, when you couldn’t recompile at all; you had to drop/recreate the native module. And natively compiled stored procedures don’t reside in the plan cache, because they are executed directly by the database engine.

This post attempts to determine if the requirement to manually recompile native modules is any different for AG secondary replicas.

The results are interesting.

Comments closed

Thinking About NULL

Duncan Greaves on NULL:

NULL exists because the following general conditions apply:

Existence –The attribute does not exist in the domain, or domain understanding is wrong. This means there is a missing entity in our domain model or entites are mixed in a table. E.g table contains hair colour for a car entity, Number of pregnancies for male patients.

Missing – The information has not been given at the time a row was created. E.g. A customer may decline to give their age.

Not Yet – Data is contingent upon an unknown event in the future, E.g. Termination date or Date of death.

Does not apply– Is not applicable for this instance of a record. E.g. Hair colour for bald people.

Placeholders – Indicates that we know that a bit of data exists, but we don’t know what it is, in this case keeping a NULL is useful for CUBE or ROLLUP queries.

In the real world applications of data structures NULLs are often unavoidable. However, it confuses users, designers and DBA’s (generally) hate it. It complicates Reporting, ETL, Business Intelligence and Data Science initiatives. As such, users need to be aware of the design and query compromises they need to use.

I think there’s significance in what NULL represents, but it’s a concept with its fair share of complexity.  Read the whole thing.

Comments closed

Collations

Robert Sheldon has an article on collations:

The ideal solution is to choose a collation when setting up SQL Server that can be used for all your user databases and character columns. Using one collation removes any issues you might encounter when querying the data in different ways. It can also be the best approach in terms of performance if multiple collations impact your queries. However, this approach works only if the same language and collation settings are appropriate for all your users and applications—or at least a good majority of them.

If you support multi-cultural environments, you’ll need to take into account a number of considerations. To begin with, you should pick collations that support the most users, and you should use Unicode data types where possible because they can help avoid code page conversion issues. Just keep in mind the storage requirements that come with Unicode’s two bytes per character.

My inclination is to say Unicode everywhere possible.  There are cases in which Unicode doesn’t fit, but it’s easy to do and if you have enough data to worry about the extra bytes Unicode characters take up, Unicode compression is available.

Comments closed

Microsoft R Open 3.3.3

David Smith reports that Microsoft R Open 3.3.3 is now available:

Microsoft R Open (MRO), Microsoft’s enhanced distribution of open source R, has been upgraded to version 3.3.3, and is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to R 3.3.3, upgrades the installer, and updates the bundled packages.

R 3.3.3 makes just a few minor fixes compared to R 3.3.2 (see the full list of changes here), so you shouldn’t encounter any compatibility issues when upgrading from MRO 3.3.2. For CRAN packages, MRO 3.3.3 points to CRAN snapshot taken on March 15, 2017 but as always, you can use the built-in checkpoint package to access packages from an earlier date (for compatibility) or a later date (to access new and updated packages).

Click through for more details.  As a side note, CRAN R 3.4 is scheduled for release this month, so given their recent cadence, I’d guess MRO 3.4 to be out late this year.

Comments closed

Network Navaigator Custom Visual

Devin Knight continues his Power BI custom visuals series:

In this module you will learn how to use the Network Navigator Power BI Custom Visual.  You may find the need to use the Network Navigator when you’re trying to find links between different attributes in a dataset. It does this by visualizing each attribute as a node and the strength of activity between those nodes can be represented in multiple ways.

Click through to get to Devin’s video.  This visual looks interesting for graphical analysis, like trying to tease out common connections or discovering dependencies.

Comments closed

Improving Read-HostSpecial

Jana Sattainathan makes Jeff Hicks’s Read-HostSpecial better:

Some shortcomings of Read-HostSpecial

While it does a lot already, I did find some very minor shortcomings that I wanted to address

  1. Bad inputs killed the program – If you are prompting for a series of inputs (like a wizard) and the user mis-keyed one input by mistake, the error recovery is very hard and the user has to start-over from the beginning. This called for a RepromptOnError switch which issues a gentle warning and then allows the user to input a valid value upon encountering validation errors.

  2. I needed a couple of more canned validations like ValidateFolder and ValidateFile.

  3. Too bad, there is no Write-HostSpecial – I wanted Read-HostSpecial to display some pretty text and not wait for input (like Write-Host) using the same nomenclature for fonts/look/feel/usability as Read-HostSpecial. So, I needed a NoWait switch.

Read on for Jana’s script and usage examples.

Comments closed

Finding Identity Columns Close To Overflow

Rob Sewell demos another dbatools feature:

It would be very useful to be able to quickly see what the current values of the identity columns are and how close they are to being full so that we can plan for and be able to take action before we end up with shouty smart suits at our desk. If we could do it with just one line of code that would be even easier.
Step forward dbatools.  This PowerShell module is a community based project written by excellent, brilliant people in their own time and available to you free. To find out more and how to use and install it visit https://dbatools.io
There is a command called Test-DbaIdentityUsage This command was created by Brandon Abshire. You can find Brandon blogging at netnerds.net. Thank you Brandon

That’s quite useful for taking a quick look at identity columns across a database or instance.

Comments closed

DataRows Everywhere

Shane O’Neill shows how to deal with .NET DataRows in Powershell:

Now I don’t know about you but when I query stuff in a SQL database, it’s to do something to/with the results. They could be a list of servers that I monitor, they could be a list of databases that I want to check the recovery model of, or it could be a list of tables that I want to see how much space they are using. The main point is that I want to do something with the results.

But for this simple case, I just want to list out the customer name from this table. Simple? Yes, but this is just a test case to prove a point.

So let’s PowerShell this!

The moral of the story is to Get-Member early and Get-Member often.

Comments closed