Press "Enter" to skip to content

Day: March 30, 2022

Right to Be Forgotten in Delta Lake

Milos Colic, et al, tackle a tricky problem:

With Delta, we have one more tool at our disposal to address GDPR compliance and, in particular, “the right to be forgotten” – VACUUM. Vacuum operation removes the files that are no longer needed and that are older than a predefined retention period. The default retention period is 30 days to align with GDPR definition of undue delay. Our earlier blog on a similar topic explains in detail how you can find and delete personal information related to a consumer by running two commands:

The part I’m finding tricky here is, how does this handle “time travel” scenarios in which you’re looking at prior iterations of data? I haven’t run through all of the scenarios so this is just speculation, but it seems that even with all of these changes, you’d still have to worry about historical data containing that sensitive information.

Comments closed

Zero-Rename Writes in ElasticMapReduce Hive

Suthan Phillips, et al, show off some updates to the way Hive transactions commit in AWS’s ElasticMapReduce:

Our customers use Apache Hive on Amazon EMR for large-scale data analytics and extract, transform, and load (ETL) jobs. Amazon EMR Hive uses Apache Tez as the default job execution engine, which creates Directed Acyclic Graphs (DAGs) to process data. Each DAG can contain multiple vertices from which tasks are created to run the application in parallel. Their final output is written to Amazon Simple Storage Service (Amazon S3).

Hive initially writes data to staging directories and then move it to the final location after a series of rename operations. This design of Hive renames supports task failure recovery, such as rescheduling the failed task with another attempt, running speculative execution, and recovering from a failed job attempt. These move and rename operations don’t have a significant performance impact in HDFS because it’s only a metadata operation when compared to Amazon S3 where the performance can degrade significantly based on the number of files written.

This post discusses the new optimized committer for Hive in Amazon EMR and also highlights its impressive performance by running a TPCx-BB performance benchmark and comparing it with the Hive default commit logic.

Read on for a description of how commit operations work in general and how the updated Hive committer can help with certain types of queries.

Comments closed

Making Daily Standups Worthwhile

Amit Nair has some advice:

 A Big No to technical discussions

This is the first reason why Daily scrum meetings take more than 15 mins. We can understand that Scrum Team needs to discuss the technical issue when they face one during the execution of a task. These technical issues need to be reported to Scrum Master as an obstacle or impediment, which need to be resolved later on not when the meeting is in progress.  

This is generally sound advice, especially because the idea of a daily standup is to together, quickly discuss plans and activities relating to the sprint, and make the team aware of blockers. Going beyond that is typically unnecessary. As teams get smaller, you can be a bit more lax with the rules; as you get closer to that 8-10 person team, you have to be pretty ruthless about keeping on time and on topic. Save the more detailed discussions for relevant meetings afterward.

1 Comment

Grouping Sets and Groupings

Kevin Wilkie has fun with grouping sets:

Let’s look at our dbo.Person1 table that we worked with earlier. Today, I want to find a count of all of the persons in each of the following categories: ZipCode, Gender, and Email Domains. And just for fun, let’s add in there where each of those categories cross – for example, Zipcode and Gender, ZipCode and Email Domain, etc…

Most people would think all kinds of awful thoughts at this point about all of the GROUP BY statements you’ll have to write. For anyone wondering – this is one way to do it. Notice all kinds of UNION statements and I’m sure someone is wondering if that’s truly all of the combinations. And we don’t want to go into the maintenance on this if things do happen to change…

And don’t forget about the GROUPING() function:

Let’s say our business partner asks us to determine which fields are aggregated together. Since we only have 2 fields and a grand total of 15 rows, we could determine this by eye. But, like all good developers, we want to do this programmatically.

Here’s where our friend – the GROUPING() function – comes into play.

GROUPING SETS is an extremely useful operator in the ANSI SQL standard. Definitely worth learning how to use.

Comments closed

Filling Values to the Right (or Left) in Power Query

Imke Feldmann has a new function:

The first function argument takes the table you want to apply the function on. The second argument is the list of column names that shall be filled up into empty values to the right. In the example in the function documentation, this is:  {“H1”, “H2”, “H3”}. The curly brackets define a list object in Power Query and its list elements must be put in quotes if they shall represent strings. So here the columns H1, H2 and H3 are included.
An optional 3rd argument can be used to fill to the left instead. You can fill in any value there, so once it is used, the fill will work to the left instead.

This might not be something you use on a daily basis but I will say I’ve run into situations in which having a function like this at hand would have been quite valuable.

Comments closed

Migrating SSIS On-Prem Workloads into Azure

Jitendra Yadeo has put together a how-to guide:

– There can be scenario where organization wants to migrate there existing SSIS ETL process on cloud so instead of rewriting SSIS package using Cloud specific ETL tools like Azure Data Factory we can directly migrate SSIS packages and call it through Azure Data Factory.

– Goal of this blog is to show how SSIS packages hosted on on-premise can be migrated to Azure Data Factory (ADF) using Azure-SSIS Integration Runtime (IR).

Read on for a step-by-step guide.

Comments closed

Simple Parameterization and Data Types

Paul White continues a series on simple parameterization:

The details of how each data type is guessed are complex and incompletely documented. As a starting point, SQL Server infers a basic type from the textual representation of the value, then uses the smallest compatible subtype.

For a string of numbers without quotation marks or a decimal point, SQL Server chooses from tinyintsmallint, and integer. For such numbers beyond the range of an integer, SQL Server uses numeric with the smallest possible precision. For example, the number 2,147,483,648 is typed as numeric(10,0). The bigint type isn’t used for server-side parameterization. This paragraph explains the data types selected in the prior examples.

Read the whole thing to see what’s in and what’s out, as well as what this all means.

Comments closed

Exponential Smoothing in Excel

Chris Webb starts an interesting series:

[This function] calculates or predicts a future value based on existing (historical) values by using the AAA version of the Exponential Smoothing (ETS) algorithm. The predicted value is a continuation of the historical values in the specified target date, which should be a continuation of the timeline. You can use this function to predict future sales, inventory requirements, or consumer trends.

Recently I started playing around with this function to see how it could be used with cube functions and since I learned so many interesting things I thought it would make a good series of blog posts.

Read on for an example of the normal way to use this function.

Comments closed