Press "Enter" to skip to content

Curated SQL Posts

Linchpins

Bert Wagner on the ongoing “what happens to my tech job?” question:

Seth Godin discusses the concept of a Linchpin in his same-titled book. A Linchpin is someone who is so good at what they do that they become indispensable to their organization. Linchpins are the kind of people who are self-motivated and are able to consistently deliver quality work. They are integral to the operation of a business, even if they don’t get all of the glamour of having VP or Director in their title.

And why are Linchpins always guaranteed jobs? In one scenario, Linchpins will outgrow their role and be promoted or find a better job. They are always learning and growing in addition to delivering, and so this is the natural procession. In the alternate scenario, if the Linchpin has to lose his or her current job (ie. think company buyouts where entire departments close), they will either 1) become promoted to elsewhere in the company because management recognizes their great skills or 2) they will have no problem finding work elsewhere, especially with great recommendations from their former employer.

It’s an interesting read.

Comments closed

Exporting SQL Server User Permissions

Rob Sewell shows how to use the Export-SqlUser cmdlet in dbatools:

There are times when DBA’s are required to export database user permissions to a file. This may be for a number of reasons. Maybe for DR purposes, for auditing, for transfer to another database or instance. Sometimes we need to create a new user with the same permissions as another user or perhaps nearly the same permissions. I was having a conversation with my good friend and MVP Cláudio Silva and we were talking about how Export-SqlUser from dbatools could help in these situations and he suggested that I blogged about it so here it is.

The dbatools module (for those that don’t know) is a PowerShell module written by amazing folks in the community designed to make administrating your SQL Server significantly easier using PowerShell. The instructions for installing it are available here It comprises of 182 separate commands at present

Cláudio wrote Export-SqlUser to solve a problem.

Read on to see what this does and how to use it.

Comments closed

Probabilistic Record Linking In Spark

Tom Lous builds a solution to link similar companies together by address:

Recently a colleague asked me to help her with a data problem, that seemed very straightforward at a glance.
She had purchased a small set of data from the chamber of commerce (Kamer van Koophandel: KvK) that contained roughly 50k small sized companies (5–20FTE), which can be hard to find online.
She noticed that many of those companies share the same address, which makes sense, because a lot of those companies tend to cluster in business complexes.

Read on for the solution.  Like many data problems, it turns out to be a lot more complicated than you’d think at first glance.

Comments closed

HDInsight 3.6 Available

Ashish Thapliyal points out some Hive improvements in HDInsight 3.6:

2 Create a new Hive table from scratch or alter Table

Create a new table by, clicking on the ‘+’ icon, which opens the create table wizard. Enter table name, column name and choose a data type from the dropdown. You can pick folloiwng advanced hive settings directly from the UI

  • Transactional : Turn on transaction support in Hive, by checking this flag. Note that the table must be bucketed and stored using an ACID compliant format (such as ORC).

  • Location : Hive stores the table data for managed tables in the Hive warehouse directory in HDFS which is configured in hive-site.xml with property hive.metastore.warehouse.dir. The default location is /apps/hive/warehouse. The location can be changed using the Location text field.

  • File Format : The default file format for CREATE TABLE statement is ORC. choose a format from the file format dropdown.

  • Row Format : Select a row format such as Field terminator, Lines terminator, and Stored File type.

  • Table can be altered to add new columns or change the column name or column datatype.

  • Tables can also be renames and altred

Read on for more improvements, including a graphical plan viewer and improved autocomplete.

Comments closed

MERGE In Hive

Carter Shanklin notes that Hive now has the ability to run MERGE statements:

As scalable as Apache Hadoop is, many workloads don’t work well in the Hadoop environment because they need frequent or unpredictable updates. Updates using hand-written Apache Hive or Apache Spark jobs are extremely complex.  Not only are developers responsible for the update logic, they must also implement all rollback logic, detect and resolve write conflicts and find some way to isolate downstream consumers from in-progress updates. Hadoop has limited facilities for solving these problems and people who attempted it usually ended up limiting updates to a single writer and disabling all readers while updates are in progress.

This approach is too complicated and can’t meet reasonable SLAs for most applications. For many, Hadoop became just a place for analytics offload — a place to copy data and run complex analytics where they can’t interfere with the “real” work happening in the EDW.

This post mostly describes the gains rather than showing code, but it does show that Hive developers are looking at expanding beyond common Hadoop warehousing scenarios.

Comments closed

More Isn’t Better With Data Collection

Andy Leonard argues that more data is not better data:

The Problem I am Trying To Solve

Is more data better? In his 2012 book, Antifragile, Nassim Nicholas Taleb (fooledbyrandomness.com | @nntaleb) – the first data philosopher I encountered – states:

“The fooled-by-data effect is accelerating. There is a nasty phenomenon called ‘Big Data’ in which researchers have brought cherry-picking to an industrial level. Modernity provides too many variables (but too little data per variable), and the spurious relationships grow much, much faster than real information, as noise is convex and information is concave.” – Nassim Nicholas Taleb, Antifragile, p. 416

According to Taleb, there’s a bias for error embedded in big data; more is not better, it’s worse. I’ve experienced this with business intelligence solutions and spoken about data quality in data warehouse solutions, saying:

“The ratio of good:bad data in a useless / inaccurate data warehouse is surprisingly high; almost always north of 95% and often higher than 99%.”

Taleb states more data includes a disproportionate amount of bad data, and that bigger data results in more spurious correlations. In other words, more is not better – it’s worse.

It’s an idea worth grappling with.  The other side of the argument is that for some problems, you won’t know what you need until you need it.

Comments closed

Guessing At Foreign Key Relationships

Daniel Hutmacher has put together a script to try to find hidden foreign key relationships in a database:

Now, before you go crazy with this stuff, remember, it’s not a magic bullet, but rather some automation help to save you some coding and to help you review your data model. The script doesn’t change the database, it only prints out its suggestions, and this is totally by design.

  • For this to work, you’ll obviously need proper primary keys or unique indexes on your referenced tables.

  • We’re working on the assumption that the referencing and referenced column names are the same. Go ahead and change the script to suit your naming standards (look for the comment in the CTE)

  • The script has no domain knowledge of your database, some of the suggestions are probably going to be downright silly.

This is a good first pass approach, especially if you have a larger database completely lacking in relational integrity.

Comments closed

Recompiling Memory-Optimized Procedures On AGs

Ned Otter takes note of how natively compiled stored procedures differ from traditional stored procedures, especially on Availability Group secondary nodes:

As of SQL 2016, the database engine automatically updates statistics for memory-optimized tables (documentation here), but recompilation of native modules must still be performed manually. But hey, that’s way better than SQL 2014, when you couldn’t recompile at all; you had to drop/recreate the native module. And natively compiled stored procedures don’t reside in the plan cache, because they are executed directly by the database engine.

This post attempts to determine if the requirement to manually recompile native modules is any different for AG secondary replicas.

The results are interesting.

Comments closed

Thinking About NULL

Duncan Greaves on NULL:

NULL exists because the following general conditions apply:

Existence –The attribute does not exist in the domain, or domain understanding is wrong. This means there is a missing entity in our domain model or entites are mixed in a table. E.g table contains hair colour for a car entity, Number of pregnancies for male patients.

Missing – The information has not been given at the time a row was created. E.g. A customer may decline to give their age.

Not Yet – Data is contingent upon an unknown event in the future, E.g. Termination date or Date of death.

Does not apply– Is not applicable for this instance of a record. E.g. Hair colour for bald people.

Placeholders – Indicates that we know that a bit of data exists, but we don’t know what it is, in this case keeping a NULL is useful for CUBE or ROLLUP queries.

In the real world applications of data structures NULLs are often unavoidable. However, it confuses users, designers and DBA’s (generally) hate it. It complicates Reporting, ETL, Business Intelligence and Data Science initiatives. As such, users need to be aware of the design and query compromises they need to use.

I think there’s significance in what NULL represents, but it’s a concept with its fair share of complexity.  Read the whole thing.

Comments closed