Press "Enter" to skip to content

Day: June 29, 2020

The Basics of Spark Streaming

Muskan Gupta gives us an introduction to Spark Streaming:

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. It was added to Apache Spark in 2013. We can get data from many sources such as Kafka, Flume etc. and process it using functions such as map, reduce etc. After processing we can push data to filesystem, databases and even to live dashboards.

In Spark Streaming we work on near real time data. It divides the received input stream into batches. The Spark Engine processes the batches and generate final output in batches.

Read on to understand the key mechanisms behind Spark Streaming.

Comments closed

Tips for Optimizing Columnstore Indexes

Ed Pollack continues a series on columnstore indexes:

This is worth a second mention: Avoid updates at all costs! Columnstore indexes do not treat updates efficiently. Sometimes they will perform well, especially against smaller tables, but against a large columnstore index, updates can be extremely expensive.

If data must be updated, structure it as a single delete operation followed by a single insert operation. This will take far less time to execute, cause less contention, and consume far fewer system resources.

Read on for several more tips along these lines.

Comments closed

Role-Playing Dimensions in Power BI

Martin Schoombee explains the concept of role-playing dimensions and then explains how that works in the Power Bi world:

In technical terms a role-playing dimension is when a dimension table has multiple (foreign key) relationships to the same fact table. In more non-technical terms, it is when you have the same attribute (“Date” for instance) that can relate to the same metrics in different ways.

If you look at the data model below as an example, you can see the Date entity could be used to reference either the Invoice Date or Delivery Date from the Sales entity, changing the perspective of the metrics we’re looking at.

It’s a little surprising to me that there isn’t an easier way to handle this concept. Role-playing dimensions are a core part of the Kimball model, and they’re common enough that you’d expect support to be a bit simpler.

Comments closed

Formatting TimeSpans in Powershell

Jeffrey Hicks shows how we can format TimeStamp objects:

This is pretty straight forward. Subtract the LastBootUpTime property from the current datetime to get a timespan object that shows how long this computer has been up and running. But…I want to get rid of the milliseconds value.  It’s irrelevant as far as I’m concerned and takes up space that I might want to use for another property.  I need to format that  value.

Click through for several methods.

Comments closed

Availability Group Bug when Removing and Adding the Same Database

Josh Darnell takes us through a tricky problem:

I came across a bug in SQL Server 2016 where the Availability Group (AG) health check can get stuck in an infinite loop after removing and re-adding a database from an AG.

Unfortunately, I don’t know exactly what version this bug was introduced. I first noticed the problem on SQL Server 2016 SP2 CU7 GDR (13.0.5366.0). It may have existed before then, but I never encountered it.

Read on for a workaround. And hopefully there will be a proper fix soon. Also, it’d be interesting to see if it can be reproduced in 2017 or 2019.

Comments closed

Delayed Prefetch and Hidden Reads

Hugo Kornelis looks at when worlds collide:

So let’s check. The picture above shows, side by side, the properties of the Index Seek and the Key Lookup operator. They show that the Index Seek did 3 logical reads only, while Key Lookup did 650 logical reads. A clear indication where the majority of the work is done.

But wait. Aren’t we missing something?

The SET STATISTICS IO ON output indicates a total of 722 logical reads. The two screenshots above add up to 653 logical reads. Where are the other 69 logical reads?

Read on for the answer.

Comments closed

Pattern-Matching and Text Extraction in Power Query

Imke Feldmann shows how we can match specific patterns in Power Query, which lacks regular expresssions:

I plan to approach this by

1. stepping through the string and check each character if it is valid.
2. If so, store it
3. and if not, forget it. And in that case, also forget the stored values so far so the collection starts from scratch.
4. Then if a series of matches builds up, I will have to check the count of the stored values to not exceed the length of my target pattern.
5. Once the length is reached, no further checks shall be performed and the found values be returned.

My aim is to find a generic way so that this solution can be adapted to many other similar use cases as well. Therefore the pattern must be described in a scalable way and the identification of the pattern elements should be easy to compute.

It’s good to be able to adapt, improvise, and overcome, though this is exactly what regular expressions are intended to do. It’s odd that there appears to be no built-in capability, where instead you’d have to do something like bring in external languages like JavaScript.

Comments closed