Press "Enter" to skip to content

Author: Kevin Feasel

Evolution of the Data Lake

Jim Wankowski takes us through the history of data lakes:

It is important to understand the difference between data lakes and data warehouses. A data warehouse is highly structured. Much effort is done upfront in developing schemas and hierarchies prior to the data being loaded into a warehouse. There is no hierarchy or structure to the way data is stored in a data lake. The structure is applied afterward. There can be multiple schemas applied to the same data in a data lake.

Read on to learn how the data lake concept has evolved over the past few years.

Comments closed

Use SQL for XML and JSON Creation

Lukas Eder argues that if you’re storing the data in SQL and you need to get data from a database into JSON or XML format, just use SQL for that:

In English: We need a list of actors, and the film categories they played in, and grouped in each category, the individual films they played in.

Let me show you how easy this is with SQL Server SQL (all other database dialects can do it these days, I just happen to have a SQL Server example ready:

Lukas makes a great point and has a FAQ to follow up on it. If there’s a reason for mapping at a higher layer—if you’re actually adding value rather than building out a set of converters—that’s one thing, but if you’re just accepting a data set and returning a JSON blob…well, your database product can do that too.

Comments closed

Problems with SQL Server Index Recommendations

Brent Ozar has some grievances to air:

And if you don’t have time to review one query at a time, SQL Server makes wide-ranging analysis easy too, letting you query dynamic management views like sys.dm_db_missing_index_details to get index recommendations for the entire database. You can even use tools like sp_BlitzIndex to analyze and report on ’em.

Except…

Both of these – the index recommendations in the query plan and the ones in the DMVs – suffer from some pretty serious drawbacks.

Click through for the list. There are some doozies in there.

Comments closed

Creating Azure Data Studio Notebooks Using Powershell

Rob Sewell inverts the “Use Azure Data Studio to create Powershell notebooks” mantra:

This module contains only 3 commands at present

* Convert-ADSPowerShellForMarkdown

This will create the markdown link for embedding PowerShell code in a Text Cell for a SQL Notebook as described in this blog post

* New-ADSWorkBookCell

This command will create a workbook text cell or a code cell for adding to the New-ADSWorkBook command

* New-ADSWorkBook

This will create a new SQL Notebook using the cell objects created by New-ADSWorkBookCell

Click through for an example.

Comments closed

Optimizing for Sequential Keys

Milos Radivojevic is excited about OPTIMIZE_FOR_SEQUENTIAL_KEY:

The results show that the feature should be used for tables, where latch convoys already happen or are about to happen. In a not-so-heavy workloads, it brings a small overhead. You can see that inserts with 100 parallel sessions are 16% slower for a table with the optimized sequential key. However, when you hit the convoy, the feature definitely helps.

This graf is critical: if you don’t have high enough concurrency on the table, insertion can be a little slower than otherwise, so don’t go setting this for every table.

Comments closed

Deploying a Big Data Cluster with Azure Data Studio

Mohammad Darab shows how you can deploy a Big Data Cluster to Azure Kubernetes Service using Azure Data Studio:

A few months ago I posted a blog on deploying a BDC using the built-in ADS notebook. This blog post will go a bit deeper into deploying a Big Data Cluster on AKS (Azure Kubernetes Service) using Azure Data Studio (version 1.13.0). In addition, I’ll go over the pros and cons and dive deeper into the reasons why I recommend going with AKS for your Big Data Cluster deployments.

AKS does make it pretty easy. The toughest part for me was figuring out which instance types were supported—I tried a few which would save me money and they weren’t available. I do like that they added a check to view availability before completing the notebook; that wasn’t in the preview version.

Comments closed

Get the Stack Overflow Columnstore Edition Database

Erik Darling has more for you on Stack Overflow + Columnstore:

If you want to download the database, here’s the magnet link for the torrent. I don’t have another means of distributing this; it’s too big of a file.

If you want the GitHub scripts to create and load data, head over here.

In addition, Erik has some quick queries showing table relationships in a world without foreign key constraints:

To get you started exploring the Stack Overflow column store database, here are some queries that show how tables are related.

The two main relationships are User Id, and Post Id.

Quick side note: joining together large columnstore indexed tables? Generally not the best idea.

Comments closed

Explaining Duplicate Indexes

Kevin Hill will be shocked and amazed that I finally linked to him again:

Duplicate indexes are those that exactly match the Key and Included columns.  That’s easy.

Possible duplicate indexes are those that very closely match Key/Included columns.

Why do you care?
Indexes have to be maintained. When I say that, most people immediately think of Reorganizing, rebuilding and updating statistics, and they are not wrong.

Click through for a great explanation of what “duplicate” indexes are, as well as ways to find them. If you’re searching for dupes, I’d recommend a couple blog posts from Kim Tripp as well on whether an index is really a duplicate and how to remove duplicate indexes.

Comments closed

Using pdqr for Statistical Uncertainty

Evgeni Chasnovski has a new CRAN package:

I am glad to announce that my latest, long written R package ‘pdqr’ is accepted to CRAN. It provides tools for creating, transforming and summarizing custom random variables with distribution functions (as base R ‘p*()’, ‘d*()’, ‘q*()’, and ‘r*()’ functions). You can read a brief overview in one of my previous posts.

Click through for a description of the package.

Comments closed

Using Azure DevOps for Power BI CI/CD

Marc Lelijveld and Ton Swart look at today’s CI/CD options for Power BI:

As a developer we might be used to working with Git repositories, especially in order to have release management in place. Git is well known as a modern version control system. By using Git, you will have a local copy of the code on your machine as well. Based on these local copies, you can continue developing. After you’re finished with your work, you can easily push your local repository to merge with the online (shared) repository. By doing this, only the changes will be pushed and saved in the online repository. In fact, only for the new code there will be a new version created. 

Versioning of Power BI files is a whole different story. Since pbix files are binary files, there is no way of checking-in only the code changes. The process of pushing changes identifies the pbix file as one object which has a new version.

Read on for the state of the art. To be honest, I don’t like the state of the art that much, but that has nothing to do with Marc and Ton’s great article.

Comments closed