Press "Enter" to skip to content

Author: Kevin Feasel

MERGE In Hive

Carter Shanklin introduces the MERGE operator in Hive:

USE CASE 2: UPDATE HIVE PARTITIONS.

A common strategy in Hive is to partition data by date. This simplifies data loads and improves performance. Regardless of your partitioning strategy you will occasionally have data in the wrong partition. For example, suppose customer data is supplied by a 3rd-party and includes a customer signup date. If the provider had a software bug and needed to change customer signup dates, suddenly records are in the wrong partition and need to be cleaned up.

It has been interesting to see Hive morph over the past few years from a batch warehousing system to something approaching a relational warehouse.  This is one additional step in that direction.

Comments closed

Performance Problems Due To Readable Secondaries

Paul Randal describes a problem when you create a readable secondary on an Availability Group:

Yesterday I blogged about log shipping performance issues and mentioned a performance problem that can be caused by using availability group readable secondaries, and then realized I hadn’t blogged about the problem, only described it in our Insider newsletter. So here’s a post about it!

Availability groups (AGs) are pretty cool, and one of the most useful features of them is the ability to read directly from one of the secondary replicas. Before, with database mirroring, the only way to access the mirror database was through the creation of a database snapshot, which only gave a single, static view of the data. Readable secondaries are constantly updated from the primary so are far more versatile as a reporting or non-production querying platform.

But I bet you didn’t know that using this feature can cause performance problems on your primary replica?

Definitely read the whole thing.

Comments closed

Dealing With The Registry From SQL Server

Wayne Sheffield shows how to read and modify registry entries using SQL Server:

xp_instance_regread

In this example, I used xp_regread to read the direct registry path. If you remember from earlier, there are SQL Server instance-aware versions of each registry procedure. A comparable statement using the instance-aware procedure would be:

This statement returns the exact same information. Let’s look at the difference between these – in the first query, the registry path is the exact registry path needed, and it includes “\Microsoft SQL Server\MSSQL12.SQL2014\”. In the latter query, this string is replaced with “\MSSQLSERVER\”. Since the latter function is instance aware, it replaces the “MSSQLSERVER” with the exact registry path necessary for this instance of SQL Server. Pretty neat, isn’t it? This allows you to have a script that will run properly regardless of the instance that it is being run on. The rest of the examples in this post will utilize the instance-aware procedures to make it easier for you to follow along and run these yourself.

Sometimes you just have to change something in the registry from SQL Server.  Hopefully that “sometimes” is rare.

Comments closed

Attaching Databases To Docker

Andrew Pruski shows one scenario where Docker on Windows is better than Docker on Linux:

One of the (if not the) main benefits of working with SQL in a container is that you can create a custom image to build container from that has all of your development databases available as soon as the container comes online.

This is really simple to do with Windows containers. Say I want to attach DatabaseA that has one data file (DatabaseA.mdf) and a log file (DatabaseA_log.ldf): –

ENV attach_dbs="[{'dbName':'DatabaseA','dbFiles':['C:\\SQLServer\\DatabaseA.mdf','C:\\SQLServer\\DatabaseA_log.ldf']}]"

Nice and simple! One line of code and any containers spun up from the image this dockerfile creates will have DatabaseA ready to go.

However this functionality is not available when working with Linux containers. Currently you cannot use an environment variable to attach a database to a SQL instance running in a Linux container.

Read on to see what you can do if you’re using a Linux container.

Comments closed

NULL Replacement In SQL Server And Oracle

Daniel Janik shows a pair of non-standard functions you can use to replace NULL values:

It’s Wednesday and that means another SQL/Oracle post. Today we’ll be discussing NULL Values, which can sometimes be a real pain. Don’t worry though there’s a simple solution. Simply replace the NULL value with another.

Comparing a column with NULL and replacing with another value is really simple. There are built in functions for replacing NULL values. I’m not going to discuss the ANSI standard COALESCE here. If you want to know more about it you can find it on Bing.

I provide no comment on Daniel’s claim regarding being able to find something on Bing…  Click through to see the custom NULL replacement functions in SQL Server versus Oracle.

Comments closed

Biml Enrichment With Annotations

Bill Fellows shows why it’s useful to include annotations in your Biml scripts:

In many of the walkthroughs on creating relational objects via Biml, it seems like people skim over the Databases collection. There’s nothing built into the language to really support the creation of database nodes. The import database operations are focused on tables and schemas and assume the database node(s) have been created. I hate assumptions.

Read on for more about dealing with databases, and not just tables and other database objects, in Biml.

Comments closed

Dealing With Limited Rights In Biml

Shannon Lowder walks through a scenario where he wants limited rights to process metadata changes, separate from any data transfer:

My development environment has a local instance of SQL Server with AdventureWorks2014 on it.  I’m going to use that as my source.  I also created a database on this instance called BimlExtract to serve as my destination database.

To create a user that can only read the schema on the source system, I created a login and user named ‘Biml’.  I granted this user VIEW DEFINITION in AdventureWorks2014. I also added this user to the db_owner group in BimlExtract.  Now, this user can read the schema of the source, and create tables in the destination. I’ve included the T-SQL to set the permissions in Database Setup.sql.

Now, we’re ready to walk through the solution.

Click through for the solution and also a GitHub repo with all of Shannon’s code.

Comments closed

Understanding Decision Trees

Ramandeep Kaur explains how decision trees work:

Simply put, a decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.

It is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems and works for both categorical and continuous input and output variables. It is one of the most widely used and practical methods for Inductive Inference. (Inductive inference is the process of reaching a general conclusion from specific examples.)

Decision trees learn and train itself from given examples and predict for unseen examples.

Click through for an example of implementing the ID3 algorithm and generating a decision tree from a data set.

Comments closed

Bad Parameter Sniffing Flowchart

Grant Fritchey is asking for input on a new flowchart he has created:

Lots of people are confused by how to deal with bad parameter sniffing when it occurs. In an effort to help with this, I’m going to try to make a decision flow chart to walk you through the process. This is a rough, quite rough, first draft.

I would love to hear any input. For this draft, I won’t address the things I think I’ve left out. I want to see what you think of the decision flow and what you think might need to be included. Click on it to embiggen.

I think it’s a great first step.  I think a decision to add local variables and use them instead of parameters would be useful, particularly in contrast to using RECOMPILE and OPTIMIZE FOR UNKNOWN.

Comments closed

Checkpointing Code For Reproduction

David Smith tells an interesting story about a reproducibility problem with data analysis:

Timo Grossenbacher, data journalist with Swiss Radio and TV in Zurich, had a bit of a surprise when he attempted to recreate the results of one of the R Markdown scripts published by SRF Data to accompany their data journalism story about vested interests of Swiss members of parliament. Upon re-running the analysis in R last week, Timo was surprised when the results differed from those published in August 2015. There was no change to the R scripts or data in the intervening two-year period, so what caused the results to be different?

The version of R Timo was using had been updated, but that wasn’t the root cause of the problem. What had also changed was the version of the dplyr package used by the script: version 0.5.0 now, versus version 0.4.2 then. For some unknown reason, a change in the dplyr package in the intervening package caused some data rows (shown in red above) to be deleted during the data preparation process, and so the results changed.

Click through for the solution, which is pretty easy in R.

Comments closed