ETL With Spark

Eric Maynard demonstrates that moving data across Hadoop clusters can be sped up by using Spark:

By leveraging Spark for distribution, we can achieve the same results much more quickly and with the same amount of code. By keeping data in HDFS throughout the process, we were able to ingest the same data as before in about 36 seconds. Let’s take a look at Spark code which produced equivalent results as the bash script shown above — note that a more parameterized version of this code code and of all code referenced in this article can be found down below in the Resources section.

Read the whole thing.

SSAS Tabluar RAM Requirements

Bill Anton looks at ensuring your Tabular server has enough RAM:

In addition to being an “in-memory” technology, Analysis Services Tabular is also a “column-store” technology which means all the values in a table for a single column are stored together. As a result – and this is especially true for dimensional models – we are able to achieve very high compression ratios. On average, you can expect to see compression ratios anywhere from 10x-100x depending on the model, data types, and value distribution.

What this ultimately means is that your 2 TB data mart will likely only requrie between 20 GB of memory (low-end) and 200 GB (high-end) of memory. That’s pretty amazing – but still leaves us with a fairly wide margin of uncertainty. In order to further reduce the level of uncertainty, you will want to take a representative sample from your source database, load it into a model on a server in your DEV environment, and calculate the compression factor.

Read the whole thing; Bill has several factors he considers when sizing a machine.

SQL Server Port Changes

Steve Jones shows how to change the port of your SQL Server instance:

Notice that I have multiple instances here, so I need to choose one. Once I do, I see the protocols on the right. In this case, I want to look at the properties of TCP/IP, which is where I’ll get the port.

If I look at properties, I’ll start with the Protocol tab, but I want to switch to the IP Addresses tab. In here, you can see I’ll see an entry for each of the IPs my instance is listening on. I can see which ones are Active as well as the port. In my case, I have these set to dynamic ports.

My rules of thumb, which might differ from your rules of thumb:  disable the Browser, don’t change off of 1433 for a single instance, and hard-code ports if you happen to be using named instances.  There’s a small argument in favor of “hiding” your instance by putting it onto a higher port (i.e., 50000+), but that’s not a great way of protecting a system, as an attacker can run nmap (or any other port scanner) and find your instance.  The major exception to this is if you also have something like honeyports set up.  In that case, changing the port number can increase security, and will almost definitely increase the number of developers who accidentally get blackholed from the server.

Data Migration Assistant Timeouts

Kenneth Fisher shows that the Data Migration Assistant developers thought ahead:

I’ve been really excited about the new Data Migration Assistant (DMA) since I first heard about it. One of the things I like best about it is that unlike the old Upgrade Advisor it doesn’t have to be run on the server being upgraded. You can run it against any number of instances from a single workstation. The other day I was working from home and tried running the DMA against a couple of moderate size databases (about 1.25tb total) and I consistently got timeout errors.

Click through for the solution.  I’d prefer “have the queries be quick enough not to require this change” be the solution, but I don’t know exactly which queries they’re using, and some DMVs/DMFs can be quite slow.

Replicating To Azure SQL DB

Jeffrey Verheul shows how to enable replication from your on-prem SQL Server up to Azure SQL DB:

Replication to another on-premise instance is easy. You just follow the steps in the wizard, it works out-of-the-box, and the chances of this process failing are small. With replicating data to an Azure SQL database it’s a bit more of a struggle. Just one single word took me a few HOURS of investigation and a lot of swearing…

The magic word is “secure.”  Read the whole thing if you’re thinking of migrating an app to use Azure SQL DB and want to minimize downtime, or if you just want that extra level of protection that having a copy of your database out of the data center can give you.

Truncate Table And Stats

Kendra Little shows that TRUNCATE TABLE does not always reset stats:

You might expect to see that the statistic on Quantity had updated. I expected it, before I ran through this demo.

But SQL Server never actually had to load up the statistic on Quantity for the query above. So it didn’t bother to update the statistic. It didn’t need to, because it knows that the table is empty, and this doesn’t show up in our column or index specific statistics.

Check it out.

Wiring A Raspberry Pi 3

Drew Furgiuele begins his project to build an easy button for backups:

I should also pause for a second and talk about wiring hobby boards like this. Good news first: you won’t electrocute yourself on it. I mean, if you do something really dumb like try to wire it underwater or eat it or something then maybe you could but you shouldn’t ever receive a shock while working with a board like this, even plugged in. The bad news is that even though you won’t damage yourself, you could very well damage the board if you just randomly plug things in. Here’s a hard and fast rule: if you’re not an electronics expert or an electrical engineer, leave it to experts to tell you where and how to wire. I’m not calling myself an expert here, but I have sort of a basic understanding of how to wire these things up. The point I’m attempting at making is: if you want to really learn and understand circuit design, there are lots of great resources of where to get started. And it’s quite a rabbit hole to go down, but it’s well worth your time if you want to learn more.

Read the whole thing.  Over a weekend, with your Pi 3.

Recurring Server-Side Traces

Kevin Hill shows how to set up a server-side trace which runs periodically:

How to set up a recurring Server-side SQL trace that runs every hour for 10 minutes.


  • 6 people in the room are staring at me waiting for the last second request to be done at the end of an 11 hour day (3 of them from the VBV – Very Big Vendor)

  • Trace file names must be different, or you get errors

  • Trace files cannot end with a number

  • I can’t tell time when I am hungry and tired

Extended Events are still the preferred method over server-side traces for getting information, but when a vendor demands traces, the scope for saying “There’s a better way” diminishes quickly, and it’s good to know how to create a server-side trace so you aren’t opening Profiler regularly.

30K Non-Indexed Column Stats

Lonny Niederstadt tests the limits of statistics on non-indexed columns in SQL Server:

A friend pointed out that the same references indicates a maximum of 30,000 columns in a wide table.  That got me thinking – maybe 30,000 stats is a per-table maximum?

Not too hard to test.  Yep – limit per table.

Filed under Swart’s Ten Percent Rule.

Querying Genomic Data With Athena

Aaron Friedman explains how to use Amazon Athena to query S3 files:

Recently, we launched Amazon Athena as an interactive query service to analyze data on Amazon S3. With Amazon Athena there are no clusters to manage and tune, no infrastructure to setup or manage, and customers pay only for the queries they run. Athena is able to query many file types straight from S3. This flexibility gives you the ability to interact easily with your datasets, whether they are in a raw text format (CSV/JSON) or specialized formats (e.g. Parquet). By being able to flexibly query different types of data sources, researchers can more rapidly progress through the data exploration phase for discovery. Additionally, researchers don’t have to know nuances of managing and running a big data system. This makes Athena an excellent complement to data warehousing on Amazon Redshift and big data analytics on Amazon EMR 

In this post, I discuss how to prepare genomic data for analysis with Amazon Athena as well as demonstrating how Athena is well-adapted to address common genomics query paradigms.  I use the Thousand Genomes dataset hosted on Amazon S3, a seminal genomics study, to demonstrate these approaches. All code that is used as part of this post is available in our GitHub repository.

This feels a lot like a data lake PaaS process where they’re spinning up a Hadoop cluster in the background, but one which you won’t need to manage. Cf. Azure Data Lake Analytics.


October 2018
« Sep