Press "Enter" to skip to content

Day: August 8, 2018

Databricks Delta: Data Skipping And ZORDER Clustering

Adrian Ionescu explains a couple of concepts which can help make selective queries with Databricks much faster:

The general use-case for these features is to improve the performance of needle-in-the-haystack kind of queries against huge data sets. The typical RDBMS solution, namely secondary indexes, is not practical in a big data context due to scalability reasons.

If you’re familiar with big data systems (be it Apache Spark, Hive, Impala, Vertica, etc.), you might already be thinking: (horizontal) partitioning.

Quick reminder: In Spark, just like Hive, partitioning works by having one subdirectory for every distinct value of the partition column(s). Queries with filters on the partition column(s) can then benefit from partition pruning, i.e., avoid scanning any partition that doesn’t satisfy those filters.

The main question is: What columns do you partition by?
And the typical answer is: The ones you’re most likely to filter by in time-sensitive queries.
But… What if there are multiple (say 4+), equally relevant columns?

Read the whole thing.

Comments closed

Bayesian Approaches To The Cold Start Problem

John Cook explains what you can do with data-driven applications when you don’t yet have the data:

How do you operate a data-driven application before you have any data? This is known as the cold start problem.

We faced this problem all the time when I designed clinical trials at MD Anderson Cancer Center. We used Bayesian methods to design adaptive clinical trial designs, such as clinical trials for determining chemotherapy dose levels. Each patient’s treatment assignment would be informed by data from all patients treated previously.

But what about the first patient in a trial? You’ve got to treat a first patient, and treat them as well as you know how. They’re struggling with cancer, so it matters a great deal what treatment they are assigned. So you treat them according to expert opinion. What else could you do?

Read on for John’s solution.

Comments closed

Propagating The Last Non-Null Value With T-SQL

Tomaz Kastrun shows a great use of window functions in T-SQL:

So you have NULL values in your SQL Server table and you want to populate those NULL values with the last non-NULL value, based on a particular order. Once you have only one NULL values encapsulated between two populated values, there are quick and fast solutions. But what if you find a larger gap of NULL values and you want to populate these values as well?

Click through for a partial solution, followed by the real solution.

Comments closed

Building Resilient Microservices

Samir Behara has some tips for designing resilient microservices:

While architecting distributed cloud applications, you should assume that failures will happen and design your applications for resiliency. A Microservice ecosystem is going to fail at some point or the other and hence you need to learn embracing failures. Don’t design systems with the assumption that its going to be sunny throughout the year. Be realistic and account for the chances of having rain, snow, thunderstorms and other adverse conditions. In short, design your microservices with failure in mind. Things don’t go as per plan always and you need to be prepared for the worst case scenario.

If Service A calls Service B which in turn calls Service C, what happens when Service B is down? What is your fallback plan in such a scenario?

  • Can you return a pre-decided error message to the user?

  • Can you call another service to fetch the information?

  • Can you return values from cache instead?

  • Can you return a default value?

Microservices have their drawbacks, but one big advantage is that they tend to be concise enough that you can reason about them more clearly than kitchen sink applications.  Planning ahead on potential failure modalities differentiates flaky services from robust services.

Comments closed

Joins When No Join Types Are Valid

Hugo Kornelis has a brain-teaser for us:

The query below can be executed in any version of the AdventureWorks sample database. Don’t bother understanding the logic, there is none. It is merely constructed to show how SQL Server handles what appears to be an impossible situation.

If you look at the descriptions of the various join operators in the Execution Plan Reference, you will see that this query poses the optimizer for what appears to be an insolvable problem: none of the join operators can be used for this query!

But it’s possible, and Hugo explains exactly what happens, as well as places where the optimizer could be better at solving the impossible (or at least marginally difficult).

Comments closed

Writing Long Strings To Output In SSMS

Bert Wagner shows us a few techniques for printing long strings in SSMS:

Erik Darling posts one solution to this problem in his T-SQL Tuesday #104 entry (as well as some other problems/solutions for lengthy SQL variables). Specifically he links to a SQL string printing script that will loop through the lengthy variable and print everything while maintaining formatting:

And while I like using that stored procedure on my primary server, I’m too lazy to install it every where I need it.

Instead, I have a couple of go-to solutions that work on all SQL Server instances 2008 forward.

The approach Bert outlines isn’t perfect, but it is definitely interesting and easier to write than the ones which work a bit better.

Comments closed

So You Locked Out Your Sysadmin User…What Next?

Sreekanth Bandarla shows how you can recover from having your sysadmin user account locked out or removed:

In this blog post, let’s see how to regain admin access on a SQL Server Instance in case you lost it by mistake or for whatever reason. It’s not a very common scenario, but hey you never know. I ran into this some time last week(Fortunately it’s in our POC environment), Okay, Here’s the deal – we have a POC SQL Instance which was installed by an individual who is no longer working with us and apparently he forgot to make our DBA grp as sysadmins. Basically we don’t have admin rights to our own SQL Instance, SA account is disabled(Well, No one has no clue what that pwd was to begin with). So, how did we recover from this disastrous event?

This is the “fake rock with a key in it” workaround.  Also, a good reason why there should be as few local administrators on your Windows machines as you can get away with.

Comments closed

Dataflows In Power BI

James Serra gives us a preview of Power BI Dataflows:

In short, Dataflows integrates data lake and ETL technology directly into Power BI, so anyone with Power Query skills (yes – Power Query is now part of Power BI service and not just Power BI Desktop and is called Power Query online) can create, customize and manage data within their Power BI experience (think of it as self-service data prep).  Dataflows include a standard schema, called the Common Data Model (CDM), that contains the most common business entities across the major functions such as marketing, sales, service, finance, along with connectors that ingest data from the most common sources into these schemas.  This greatly simplifies modeling and integration challenges (it prevents multiple metadata/definition on the same data).  You can also extend the CDM by creating custom entities.  Lastly – Microsoft and their partners will be shipping out-of-the-box applications that run on Power BI that populate data in the Common Data Model and deliver insights through Power BI.

A dataflow is not just the data itself, but also logic on how the data is manipulated.  Dataflows belong to the Data Warehouse/Mart/Lake family.  Its main job is to aggregate, cleanse, transform, integrate and harmonize data from a large and growing set of supported on-premises and cloud-based data sources including Dynamics 365, Salesforce, Azure SQL Database, Excel, SharePoint.  Dataflows hold a collection of data-lake stored entities (i.e. tables) which are stored in internal Power BI Common Data Model compliant folders in Azure Data Lake Storage Gen2.

Also check out the comments for some clarification on why you’d want to use Dataflows rather than doing the work directly in the data lake.

Comments closed

SSIS TargetServerVersion Versus ProjectVersion

Andy Leonard explains the differences between the TargetServerVersion and ProjectVersion properties in SQL Server Integration Services:

Since I am still adjusting to the 9.5 hour time zone shift, I decided to poke around social media. My friend and brother from another mother, TJay Belt, had tagged me in a tweet. TJay was looking for answers about updating an SSIS 2014 project in SQL Server Data Tools (SSDT) for Visual Studio 2017.

TJay and I started this conversation a couple days ago when he mentioned a team member installed Visual Studio 2017 and experienced difficulties getting SSIS 2012 packages to execute in the debugger.

I don’t think I blogged about it, but I had some interesting experiences upgrading to Visual Studio 2017 and SQL Server Data Tools for VS 2017. The sum of my experience was: uninstall everything, then install Visual Studio, then install SSDT.

Read on for the answer.

Comments closed

Power BI Administration With Powershell

Adam Saxton shares a video covering what is available in the MicrosoftPowerBIMgmt Powershell module:

The Power BI PowerShell and the Admin API allow you to quickly inventory your organization and manage workspaces and access. Forget the hastle of setting up the App registration within Azure Active Directory for API access, the Power BI PowerShell cmdlets take care of it for you. Just install and start using today!

Click through for that video.

Comments closed