Press "Enter" to skip to content

Category: Self-Promotion

Embrace the Power of 5th Normal Form

I have a new video up:

In this video, we drill into the other most important normal form, learning what Fifth Normal Form (5NF) is, why Boyce-Codd Normal Form is not enough, and examples of why 5NF can be such a challenge to implement.

Until I read CJ Date’s Database Design and Relational Theory (2nd edition), my level of appreciation for 5th Normal Form was somewhat limited, but that’s mostly because I didn’t understand it well at all. I liked the connection trap example in this article, but Date’s book was the first really good explanation of 5NF and just how powerful it is. My hope is that I was successfully able to convey that power to audiences.

Comments closed

A Primer on Boyce-Codd Normal Form

I have a new video:

In this video, we drill into one of the two most important normal forms, learning what Boyce-Codd Normal Form (BCNF) is, how you can get to BCNF, and a practical example of it. We also learn why I cast so much shade on 2nd and 3rd Normal Forms.

Boyce-Codd Normal Form is one of the two most important normal forms, and I’m pretty happy with the way this video came together to explain how you can get from 1NF into BCNF, as well as the specific benefits this provides.

Comments closed

PolyBase and Excel

I have a post on setting up PolyBase to work with Microsoft Excel:

If you tried to use Microsoft’s Excel driver prior to 2019 CU2, you’d get the following error:

Msg 105082, Level 16, State 1, Line LineNumber
105082;Generic ODBC error: [Microsoft][ODBC Excel Driver]Optional feature not implemented

To this point, I recommended in PolyBase Revealed that you use a different driver, like CData’s, which did work. CData’s driver still works (I assume…PolyBase ODBC support is a fluid situation, it seems), but now I can officially say that PolyBase supports the Microsoft Access Database Engine Redistributable driver for Microsoft Excel. Let’s go to the tape.

Click through for the instructions.

Comments closed

ggplot2 Scales And Coordinates

I continue my series on ggplot2:

The other thing I want to cover today is coordinate systems.  The ggplot2 documentation shows seven coordinate functions.  There are good reasons to use each, but I’m only going to demonstrate one.  By default, we use the Cartesian coordinate system and ggplot2 sets the viewing space.  This viewing space covers the fullness of your data set and generally is reasonable, though you can change the viewing area using the xlim and ylim parameters.

The special coordinate system I want to point out is coord_flip, which flips the X and Y axes.  This allows us, for example, to turn a column chart into a bar chart.  Taking our life expectancy by continent, data I can create a bar chart whereas before, we’ve been looking at column charts.

There are a lot of pictures and more step-by-step work.  Most of these are still 3-4 lines of code, so again, pretty simple.

Comments closed

Polybase And HDInsight

I have a post up on trying to integrate Polybase with HDInsight:

But now we run into a problem:  there are certain ports which need to be open for Polybase to work.  This includes port 50010 on each of the data nodes against which we want to run MapReduce jobs.  This goes back to the issue we see with spinning up data nodes in Docker:  ports are not available.  If you’ve put your HDInsight cluster into an Azure VNet and monkey around with ports, you might be able to open all of the ports necessary to get this working, but that’s a lot more than I’d want to mess with, as somebody who hasn’t taken the time to learn much about cloud networking.

As I mention in the post, I’d much rather build my own Hadoop cluster; I don’t think you save much maintenance time in the long run going with HDInsight.

Comments closed

Using Temporal Tables For SCD2

I have a post on pain that I experienced with temporal tables:

This query succeeds but returns results we don’t really want:

ProductModelTemporalSameDate

This brings back all 9 records tied to products 1 and 2 (because product 3 didn’t exist on July 2nd at 8 AM UTC). But it gives us the same start and end date, so that’s not right. What I really want to do is replace @InterestingTime with qsp‘s DatePredictionMade, so let’s try that:

ProductModelTemporalInvalid

This returns a syntax error. It would appear that at the time FOR SYSTEM_TIME is resolved, QuantitySoldPrediction does not yet exist. This stops us dead in our tracks.

This is one of the two things I’d really like to change about temporal tables; the other thing (now that auto-retention is slated for release) is the ability to backfill data without turning off system versioning.

Comments closed

Partitioned Views With Polybase

I look at using SQL 2000-style partitioning with a set of external tables:

Using a view, we were able to create a “partitioned” Polybase experience, similar to what we had in SQL Server 2000.  This form of poor man’s partitioning allows us to segment out data sets and query them independently, something which can be helpful when storing very large amounts of data off-site and only occasionally needing to query it.  The thing to remember, though, is that if you store this in Azure Blob Storage, you will need to pull down the entire table’s worth of data to do any processing.

This leads to a concept I first heard from Ginger Grant:  pseudo-StretchDB.  Instead of paying for what Stretch offers, you get an important subset of the functionality at a much, much lower price.  If you do store the data in Azure Blob Storage, you’re paying pennies per gigabyte per month.  For cold storage, like a scenario in which you need to keep data around to keep the auditors happy but your main application doesn’t use that information, it can work fine.  But if you need to query this data frequently, performance might be a killer.

For Polybase tables without the ability to perform external pushdown, coming up with a good partitioning strategy is probably one of the two best ways to improve performance, with creating a Polybase scale-out cluster the other method.

Comments closed

Forcing Polybase External Pushdown

I have a post showing how to control predicate pushdown in Polybase:

As a reminder, in order to allow predicate pushdown to occur, we need to hit a Hadoop cluster; we can’t use predicate pushdown on other systems like Azure Blob Storage.  Second, we need to have a resource manager link set up in our external data source.  Third, we need to make sure that everything is configured correctly on the Polybase side.  But once you have those items in place, it’s possible to use the FORCE EXTERNALPUSHDOWN command like so:

There’s also discussion of preventing MapReduce job creation as well as a pushdown-related error I had received in the past.

Comments closed

Building A Hadoop Cluster

I have a post on building a five-node Hadoop cluster using Docker containers:

Notice how 3bd shows up for pretty much all of these services.  This is not what you’d want to do in a real production environment, but because we want to use Docker and easily pass ports through, it’s the simplest way for me to set this up.  If you knew beforehand which node would host which service, you could modify the run.sh batch script that we discussed earlier and open those specific ports.

After assigning masters, we next have to define which nodes are clients in which clusters.

Click through for a screenshot-laden walkthrough.

Comments closed