The Data Lake From 10,000 Feet

Pradeep Menon has a high-level explanation of what a data lake is and how it differs from traditional data warehouses:

With the changes in the data paradigm, a new architectural pattern has emerged. It’s called as the Data Lake Architecture. Like the water in the lake, data in a data lake is in the purest possible form. Like the lake, it caters to need to different people, those who want to fish or those who want to take a boat ride or those who want to get drinking water from it, a data lake architecture caters to multiple personas. It provides data scientists an avenue to explore data and create a hypothesis. It provides an avenue for business users to explore data. It provides an avenue for data analysts to analyze data and find patterns. It provides an avenue for reporting analysts to create reports and present to stakeholders.

The way I compare a data lake to a data warehouse or a mart is like this:

Data Lake stores data in the purest form caters to multiple stakeholders and can also be used to package data in a form that can be consumed by end-users. On the other hand, Data Warehouse is already distilled and packaged for defined purposes.

One way of thinking about this is that data warehouses are great for solving known business questions:  generating 10K reports or other regulatory compliance reporting, building the end-of-month data, and viewing standard KPIs.  By contrast, the data lake is (among other things) for spelunking, trying to answer those one-off questions people seem to have but which the warehouse never seems to have quite the right set of information.

A New ODBC Package For R

David Smith looks at the odbc package in R:

The odbc package is a from-the-ground-up implementation of an ODBC interface for R that provides native support for additional data types (including dates, timestamps, raw binary, and 64-bit integers) and parameterized queries. The odbc package provides connections with any ODBC-compliant database, and has been comprehensively tested on SQL Server, PostgreSQL and MySQL. Benchmarks show that it’s also somewhat faster than RODBC: 3.2 times faster for reads, and 1.9 times faster for writes.

Sounds like odbc lets you run ad hoc queries and also lets you use dplyr as an ORM, similar to using Linq in C#.

Multi-Migration To Azure

Kevin Hill shows how to migrate multiple databases to Azure SQL Database:

Since I might be doing a similar thing in the near future, I decided to see if I could push 4-5 small test databases to Azure SQLDB at the same time (SQL on Azure VM test may come later).  If this happens with my client it will be hundreds of databases, with the same schema but wildly varying sizes.

The “Plan”

  • Create 5 sample databases

  • Create .bacpacs on local drive

  • Create Azure Logical Server if not already done (see step 4)

  • Create a Notepad of SQLPackage.exe commands,  one for each .bacpac

  • Run with “start” in CMD to fire off separate windows

  • Wait.

  • Enjoy.

Worth reading the whole thing.

Resizing A Linux Partition

Steve Jones shows how to add disk space to a Linux partition:

While working with some SQL Server 2017 tests, I ran out of disk space. I tend to size my VMs around 40GB, and that works for some things, but I’ll run out of space.

I needed to expand the VMWare disk. That doesn’t mean Linux sees the space directly, and I had to figure out how to make the partition bigger. I could have added another disk, but I wanted to work through this process. I learned I needed to have an inactive partition, so I download gparted on a live cd and booted to that.

Steve uses the GUI approach; in the comments, David Klee links to his CLI approach.

Using The COMPRESS Function In SQL Server

Kendra Little explains the COMPRESS() function in SQL Server 2016:

One cool little feature in SQL Server 2016 is COMPRESS(). It’s a TSQL function available in all editions that shrinks down data using the GZIP algorithm (documentation).

Things to know about COMPRESS():

  • Compressed data is in the VARBINARY(max) data type

  • You get the data “back to normal” by  using the DECOMPRESS function – which also outputs VARBINARY(max)

  • You can’t use columns of the VARBINARY(max) type in an index key column– but it may be useful to use the column as a filter in a filtered index, in some cases

COMPRESS() uses standard GZip compression, so you could use methods other than DECOMPRESS() to inflate the data—for example, bring the compressed data out to your application and use language-specific GZip libraries to decompress the data.  Read the whole thing.

Stream Analytics Into Power BI

Rolf Tesmer shows off how to use Azure Stream Analytics to push data in real time via the Power BI API into your Power BI dashboard:

You can push data to the Power BI streaming dataset API in a few ways… but they generally boil down to these 3 options

  1. Directly call the API from code
  2. Directly call the API from an Azure Logic App
  3. Use Azure Stream Analytics to push data into the API

This blog post extends on my previous post – and thus I will be leveraging Option #3 above.

Definitely worth checking out if you are interested in real-time Power BI dashboards.

A Game Written In SQL Server

Daniel Janik has built a game that runs entirely in SQL Server:

As it turns out SQL Server Management Studio (SSMS) can display many types of emoji and this is the key.

Keeping in mind that the whole thing was not written with performance or best practice in mind, I’d like to introduce the world to the very first action adventure game 100% written and played in SQL Server!

The goal here is to have a game which helps teach some basics of development practices.  Interesting concept.

Using Startup Stored Procedures

John Morehouse explains how to set up a startup stored procedure, as well as some of the risks involved:

Startup procedures automatically execute whenever SQL Server is started.  Where would you use this?  One example is if you had an application that required the use of a global temporary table, you could use a startup procedure to create it.  This would allow the table to be immediately accessible to any application that requires it.

Another possibility is that you could “warm up” the cache so that data is already cached when an application needs it.  This would help prevent that initial I/O hit the first time the query is called, thus potentially improving initial performance.

Click through for details.  I’ve created a couple of these, but they rarely come to mind and that’s one of the big problems:  if there is an issue, there’s probably quite a bit of investigative work that would take place before somebody remembers that hey, these things exist.


August 2017
« Jul