Press "Enter" to skip to content

Author: Kevin Feasel

Right to Be Forgotten in Delta Lake

Milos Colic, et al, tackle a tricky problem:

With Delta, we have one more tool at our disposal to address GDPR compliance and, in particular, “the right to be forgotten” – VACUUM. Vacuum operation removes the files that are no longer needed and that are older than a predefined retention period. The default retention period is 30 days to align with GDPR definition of undue delay. Our earlier blog on a similar topic explains in detail how you can find and delete personal information related to a consumer by running two commands:

The part I’m finding tricky here is, how does this handle “time travel” scenarios in which you’re looking at prior iterations of data? I haven’t run through all of the scenarios so this is just speculation, but it seems that even with all of these changes, you’d still have to worry about historical data containing that sensitive information.

Comments closed

Grouping Sets and Groupings

Kevin Wilkie has fun with grouping sets:

Let’s look at our dbo.Person1 table that we worked with earlier. Today, I want to find a count of all of the persons in each of the following categories: ZipCode, Gender, and Email Domains. And just for fun, let’s add in there where each of those categories cross – for example, Zipcode and Gender, ZipCode and Email Domain, etc…

Most people would think all kinds of awful thoughts at this point about all of the GROUP BY statements you’ll have to write. For anyone wondering – this is one way to do it. Notice all kinds of UNION statements and I’m sure someone is wondering if that’s truly all of the combinations. And we don’t want to go into the maintenance on this if things do happen to change…

And don’t forget about the GROUPING() function:

Let’s say our business partner asks us to determine which fields are aggregated together. Since we only have 2 fields and a grand total of 15 rows, we could determine this by eye. But, like all good developers, we want to do this programmatically.

Here’s where our friend – the GROUPING() function – comes into play.

GROUPING SETS is an extremely useful operator in the ANSI SQL standard. Definitely worth learning how to use.

Comments closed

Making Daily Standups Worthwhile

Amit Nair has some advice:

 A Big No to technical discussions

This is the first reason why Daily scrum meetings take more than 15 mins. We can understand that Scrum Team needs to discuss the technical issue when they face one during the execution of a task. These technical issues need to be reported to Scrum Master as an obstacle or impediment, which need to be resolved later on not when the meeting is in progress.  

This is generally sound advice, especially because the idea of a daily standup is to together, quickly discuss plans and activities relating to the sprint, and make the team aware of blockers. Going beyond that is typically unnecessary. As teams get smaller, you can be a bit more lax with the rules; as you get closer to that 8-10 person team, you have to be pretty ruthless about keeping on time and on topic. Save the more detailed discussions for relevant meetings afterward.

1 Comment

Filling Values to the Right (or Left) in Power Query

Imke Feldmann has a new function:

The first function argument takes the table you want to apply the function on. The second argument is the list of column names that shall be filled up into empty values to the right. In the example in the function documentation, this is:  {“H1”, “H2”, “H3”}. The curly brackets define a list object in Power Query and its list elements must be put in quotes if they shall represent strings. So here the columns H1, H2 and H3 are included.
An optional 3rd argument can be used to fill to the left instead. You can fill in any value there, so once it is used, the fill will work to the left instead.

This might not be something you use on a daily basis but I will say I’ve run into situations in which having a function like this at hand would have been quite valuable.

Comments closed

Migrating SSIS On-Prem Workloads into Azure

Jitendra Yadeo has put together a how-to guide:

– There can be scenario where organization wants to migrate there existing SSIS ETL process on cloud so instead of rewriting SSIS package using Cloud specific ETL tools like Azure Data Factory we can directly migrate SSIS packages and call it through Azure Data Factory.

– Goal of this blog is to show how SSIS packages hosted on on-premise can be migrated to Azure Data Factory (ADF) using Azure-SSIS Integration Runtime (IR).

Read on for a step-by-step guide.

Comments closed

Simple Parameterization and Data Types

Paul White continues a series on simple parameterization:

The details of how each data type is guessed are complex and incompletely documented. As a starting point, SQL Server infers a basic type from the textual representation of the value, then uses the smallest compatible subtype.

For a string of numbers without quotation marks or a decimal point, SQL Server chooses from tinyintsmallint, and integer. For such numbers beyond the range of an integer, SQL Server uses numeric with the smallest possible precision. For example, the number 2,147,483,648 is typed as numeric(10,0). The bigint type isn’t used for server-side parameterization. This paragraph explains the data types selected in the prior examples.

Read the whole thing to see what’s in and what’s out, as well as what this all means.

Comments closed

Exponential Smoothing in Excel

Chris Webb starts an interesting series:

[This function] calculates or predicts a future value based on existing (historical) values by using the AAA version of the Exponential Smoothing (ETS) algorithm. The predicted value is a continuation of the historical values in the specified target date, which should be a continuation of the timeline. You can use this function to predict future sales, inventory requirements, or consumer trends.

Recently I started playing around with this function to see how it could be used with cube functions and since I learned so many interesting things I thought it would make a good series of blog posts.

Read on for an example of the normal way to use this function.

Comments closed

Writing a Single JSON File in Databricks

Falek Miah performs a surprisingly difficult task:

When writing to a JSON destination using the DataFrameWriter the dataset is split into multiple files to reflect the number of RDD partitions in the dataframe when in memory – this is the most efficient way for Spark to write data out.

However, this creates a directory containing the data files, as well as Spark metadata files…but what if you just wanted a single JSON file? It’s a scenario that comes up a lot with our clients and, despite it not being the most efficient way to use Spark, we need to implement it all the same.

Click through to see how to do this, including the removal of all metadata files (committed, started, and success files).

Comments closed

Visualizing Air Pressure Spikes from the Hunga Tonga Eruption in R

Neil Saunders reviews some personal weather station data:

Wow. Now, pause for a moment and try to recall the last time you read any news about Tonga since the event.
The eruption sent an atmospheric pressure wave, clearly visible in this imagery, around the world. Friends online reported that this was detected by their personal weather stations (PWS) which made me wonder: was the wave apparent in online weather station data and can it be visualized using R?

The answers are yes and yes again.

Read on to see how.

Comments closed

Intelligent Cache for Spark in Synapse

Avinanda Chattapadday makes an announcement:

Traditionally, when querying a file or table from your data lake, the Apache Spark engine in Synapse makes a call to your remote ADLS Gen2 storage for each read of the data. For workloads with frequent repeat queries, this process can be redundant and add latency to the overall processing time. Although Apache Spark provides a great caching feature, it must be manually set and released to minimize the latency and improve overall performance. It can also result in queries of stale data if the underlying data changes. This is where the intelligent cache in Azure Synapse can simplify the process; by automatically detecting changes to the underlying files and automatically refreshing them in the cache, you ensure you have access to the most recent data. When the cache reaches its size limit, it will automatically release the least-read data to make space for more recent data.

Click through to see how you can enable this, as well as a few more details on the process.

Comments closed