Press "Enter" to skip to content

Curated SQL Posts

Data Frames

Saravanan Subramanian has an introduction to data frames in R:

The R data frame is a high level data structure which is equivalent to a table in database systems.  It is highly useful to work with machine learning algorithms, and it’s very flexible and easy to use.

The standard definition of data frames are a “tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R‘s modeling software.”

Data frames are a powerful abstraction and make R a lot easier for database professionals than application developers who are used to thinking iteratively and with one object at a time.

Comments closed

Warehouses Will Live On

Jesse Seymour argues that in-memory analysis solutions will not entirely replace data warehouses:

The big reason that dimensional modeling increases clarity is that the dimensional model seeks to flatten data as much as possible.  Let’s compare two examples.  Both of these examples are for a fictional health clinic.

The first example is that we want a report on how many male patients were  treated with electric shock therapy by provider, grouped monthly and spanning year to date range.

Those big Kimball-style warehouses do a great job of making it easier for people who are not database specialists to query data and get meaningful, consistent results to known business questions.  The trick to understanding data platforms is that they tend to be complements rather than substitutes:  introducing Spark-R in your environment does not replace your Kimball-style warehouse; it complements it by letting analysts find trends more easily.  Similarly, a Hadoop cluster potentially lets you complement an existing data warehouse in a few ways:  acting as a data aggregator (which allows you to push some ETL work off onto the cluster), a data collector (especially for information which is useful but doesn’t really fit in your conformed warehouse), and a data processor (particularly for those gigantic queries which are not time-sensitive).

Comments closed

Migrating SQL Server To Azure IaaS

James Serra has links and notes on migrating SQL Server from on-premesis to Azure virtual machines:

  • Convert on-premises physical machine to Hyper-V VHD, upload to Azure Blob storage, and then deploy as new VM using uploaded VHD.  Use when bringing your own SQL Server license, when migrating a database that you will run on an older version of SQL Server, or when migrating system and user databases together as part of the migration of database dependent on other user databases and/or system databases.  Use on SQL Server 2005 or greater to SQL Server 2005 or greater

  • Ship hard drive using Windows Import/Export Service.  Use when manual copy method is too slow, such as with very large databases.  Use on SQL Server 2005 or greater to SQL Server 2005 or greater

If you’re looking for notes on where to get started, this is a good link.

Comments closed

Using Python 3.4 With EMR And Spark

Bruno Faria shows how to use Python 3.4 with Spark on Amazon’s ElasticMapReduce:

An EMR 4.6 cluster running Spark 1.6.1 will still use Python 2.7 as the default interpreter. If you want to change this, you will need to set the environment variable: PYSPARK_PYTHON=python34. You can do this when you launch a cluster by using the configurations API and supplying the configuration shown in the snippet below:

I’m more of a SQL and Scala guy, but if you like Python and are on the Python 3 side of the divide, here’s a solution for you.

Comments closed

More On Database Corruption

SQLWayne has a follow-up on database corruption:

You see the CHECKSUM on the backup along with the RESTORE VERIFYONLY.  The code was generated by right clicking on the database, selecting Tasks, then Backup, plug in the parameters, and select Script.  I put it in a new query window as I may back up several databases in the same job.  Sometimes I’ll just do a find/replace for the other databases since my backup.  The Restore Verifyonly gives you some confidence that your backup is recoverable: NEVER assume that just because your backup ran that the database is restorable!  The ONLY way to know is to actually restore it to another file!  You don’t want to accidentally clobber your production that probably has newer data in it.

Corruption is a serious event when your entire job revolves around protecting data.  Be prepared.

Comments closed

SET STATISTICS XML

Daniel Hutmacher explains how SET STATISTICS XML will generate execution plans for certain segments of code:

But sometimes you want to run a series of statements or procedures where you only want the execution plan for some of the statements. Here’s how:

The actual execution plan is enabled by turning on SET STATISTICS XML., not unlike enabling STATISTICS IO or TIME. And just like SET NOCOUNT, the SET statements apply to the current context, which could be a stored procedure, a session, etc. When this context ends, the setting reverts to that of the parent context.

I see code snippets with STATISTICS IO and TIME fairly regularly, but almost never see STATISTICS XML; instead, I see people (including myself) hit Ctrl-M or select the “Include Actual Execution Plan” button when generating execution plans is desirable.

1 Comment

What To Do When Corruption Hits You

Brent Ozar has a step-by-step guide explaining what to do when you CHECKDB reports corruption:

This sounds paranoid, but as an example, here’s a corruption case I had recently: shortly after detecting corruption, the team realized they’d have to revert to a backup of the database from a few days ago. Rather than telling users about that possibility, they let the users keep adding data into the already-corrupt database while the DBAs did troubleshooting. Several days later, as the corruption got worse, even Microsoft couldn’t repair the corruption – and the affected tables went completely offline, permanently. If the users would have been alerted earlier, they could have avoided even more data loss.

Good advice.  If you have Pluralsight, I recommend Paul Randal’s course on database corruption.  Watch that ideally before you have corruption…

Comments closed

Updates With Joins

James Anderson shows a case in which an update with a join clause can lead to undesirable results:

That’s right; SQL Server will just pick any child. It will not update the parent row for each child. When the MERGE statement runs into this problem, it raises an error and rolls back the UPDATE but our query will silently pick any value and move on. We don’t want to update data like this, we want our query to dictate the logic explicitly. This will require some changes to our query.

I prefer to write these types of updates with the use of a CTE. This allows us to easily highlight the SELECT in the CTE and execute it to see which rows will be updated (@variables can cause problems here though). Adding a COUNT can help to identify the problem in the previous query.

Highlight and execute the code inside the CTE below to check for duplicates. If this returns 0 rows then you are good to remove the COUNT, GROUP BY and HAVING then add the name columns before executing the whole statement.

When writing T-SQL updates with joins, it’s important to consider whether the grain changes, and if that change can make a difference in your update set.

Comments closed

How WITH ENCRYPTION Works

Paul White digs into the WITH ENCRYPTION clause:

The core RC4 algorithm is well-known, and relatively simple. It would be better implemented in a .Net language for efficiency and performance reasons, but there is a T-SQL implementation below.

These two T-SQL functions implement the RC4 key-scheduling algorithm and pseudorandom number generator, and were originally written by SQL Server MVP Peter Larsson. I have a made some minor modifications to improve performance a little, and allow LOB-length binaries to be encoded and decoded. This part of the process could be replaced by any standard RC4 implementation.

Using WITH ENCRYPTION is a gentleman’s agreement that you won’t look at the underlying code.  In practice, it’s trivial to get around, and Paul shows exactly why.

Comments closed

Changing Computed Columns

Steve Jones shows a how to modify a computed column:

I was working with a computed column the other day, and realized I had the wrong definition. In this case, I was performing some large calculation, and the result was larger than an int. However the first part of the formula was an int, which resulted in an implicit conversion to an int.

I needed to change the formula, and then realized that plenty of people might not work with computed columns much, and not realize how you alter a computed column.

You don’t.

In fact, you need to drop the column and add it back. In my case, this was what I did.

Fortunately, this tends to be a pretty quick operation, especially if the computed column is non-persisted.

Comments closed