L-Diversity versus K-Anonymity

Duncan Greaves explains the concepts behind l-diversity:

There are problems with K-anonymous datasets, namely the homogeneous pattern attack, and the background knowledge attack, details of which are in my original post. A slightly different approach to anonymising public datasets comes in the form of ℓ -diversity, a way of introducing further entropy/diversity into a dataset.

A sensitive data record is made of the following microdata types: the ID; any Key Attributes; and the confidential outcome attribute(s). ℓ -diversity seeks to extend the equivalence classes that we created using K-anonymity by generalisation and masking of the quasi-identifiers (the QI groups) to the confidential attributes in the record as well. The ℓ -diversity principle demands that, in each QI-group, at most 1/ ℓ of its tuples can have an identical sensitive attribute value.

L-diversity is not perfect either, but Duncan gives a good explanation of the topic.

Fixing High VLF Counts

Ajay Dwivedi shares a technique for optimizing VLF counts on log files:

DBAs! I guess everyone know that huge number of Virtual Log Files (VLFs) in SQL Server can cause Backup/Restore & Database Recovery process slow. Even in rare cases, it can introduce slowness at transaction level.
https://sqlperformance.com/2013/02/system-configuration/transaction-log-configuration

Even we all are aware of it, it has been still a challenge to remove High VLF counts from SQL Server log files since it involved log of manual effort with Shrinking and re-growing the log files.

This is where my ‘Space-Capacity-Automation‘ open source project comes to your rescue. It has a parameter option @optimizeLogFiles that can help you optimize your log files by below below tasks:-

Click through to see Ajay’s technique.

Bugs and Pitfalls with Non-Determinism

Itzik Ben-Gan has started a new series on T-SQL bugs, pitfalls, and best practices. The first post in this series is on non-deterministic behavior:

A nondeterministic function is a function that given the same inputs, can return different results in different executions of the function. Classic examples are SYSDATETIME, NEWID, and RAND (when invoked without an input seed). The behavior of nondeterministic functions in T-SQL can be surprising to some, and could result in bugs and pitfalls in some cases.

Many people assume that when you invoke a nondeterministic function as part of a query, the function gets evaluated separately per row. In practice, most nondeterministic functions get evaluated once per reference in the query. 

This is the start to a great series.

Diving Into SQL Notebooks

Rob Sewell tries out Azure Data Studio’s SQL notebooks, currently in preview:

OK, so now that we have the dependencies installed we can create a notebook. I decided to use the ValidationResults database that I use for my dbachecks demos and describe here. I need to restore it from my local folder that I have mapped as a volume to my container. Of course, I use dbatools for this 

Click through to see how to install and use SQL notebooks.

Building Temporal Tables From Existing Tables

Mala Mahadevan shows how you can convert an existing SQL Server table into a history table using temporal tables in SQL Server 2016:

SQL Server has had many different methods to track changes to data. There has been old-fashioned trigger-based logging, Change Data Capture, and Change Tracking. All of these features allow you to see how data has changed with varying degrees of how far you can go back historically to see the changes. However, there has never been the ability to see how the entire table looked at any given point in time. That is what Temporal Tables do – they log every change that happens to the table. When you want to query it for any specific period of time, SQL Server will do the hard work of getting the snapshot of how the data in the entire table looked at that time.

A great introduction on how to set up temporal tables in different ways with various limitations can be found here. In this article, you will learn how to set up versioning when creating a new table and how to convert a table with an existing history to a system-versioned temporal table.

Read on to learn how to use temporal tables.

Batch Consumption from Kafka with Spark

Swapnil Chougule shares a few tips on performing batch processing of a Kafka topic using Apache Spark:

Spark as a compute engine is very widely accepted by most industries. Most of the old data platforms based on MapReduce jobs have been migrated to Spark-based jobs, and some are in the phase of migration. In short, batch computation is being done using Spark. As a result, organizations’ infrastructure and expertise have been developed around Spark.

So, the now question is: can Spark solve the problem of batch consumption of data inherited from Kafka? The answer is yes.

The advantages of doing this are: having a unified batch computation platform, reusing existing infrastructure, expertise, monitoring, and alerting.

Click through to get to the starting point on this as well as a few tips to avoid stumbling blocks.

Building a VPC with AWS

Priyaj Kumar takes us through the process of building a Virtual Private Cloud in AWS:

AWS provides a lot of services, these services are sufficient to run your architecture. The backbone for the security of this architecture is VPC (Virtual Private Cloud). VPC is basically a private cloud in the AWS environment that helps you to use all the services by AWS in your defined private space. You have control over the virtual network and you can also restrict the incoming traffic using security groups.

Overall, VPC helps you to secure your environment and give you a complete authority of incoming traffic. There are two types of VPCs, Default VPC that is by default created by Amazon and Non-Default VPC that is created by you to suffice your security needs.

Now that you have an idea of how VPC works, I will take you through the different services offered by Amazon VPC.

Read on to see how to set one up.

Counting Working Days with DAX

Alberto Ferrari shows how we can ignore weekends in date calculations with DAX:

How is it possible to compute the difference between the two dates, only computing working days and skipping weekends and holidays? Simple math is no longer useful here, and DAX does not offer a predefined function.

A solution to this scenario requires a date table – more details here – with a specific column, IsWorkingDay, which indicates whether that particular day is a working day or not. The following figure shows an example:

Another good use of date tables (AKA calendar tables), which are also quite useful in T-SQL queries.

PolyBase and Hive Shim Errors

I ran into a problem with Hive 3 and PolyBase:

My initial plan was to google things. The specific error: java.lang.IllegalArgumentException: Unrecognized Hadoop major version number. That pops up HIVE-15326 and HIVE-15016 but gave me no immediate joy.

After reaching out to James Rowland-Jones (t), we (by which I mean he) eventually figured out the issue.

Click through for the solution.

Formatting Lists of Values with DAX

Alberto Ferrari and Patrick LeBlanc have a great video on formatting lists of filter values in DAX like 2003, 2005-2007, 2009:

Alberto Ferrari joins Patrick to walk through how you can use DAX to format a list of values within Power BI Desktop. This takes the concatenate values quick measure to the next level.

Transmuting Adam into Alberto shows Patrick’s ultimate power.

Categories

March 2019
MTWTFSS
« Feb  
 123
45678910
11121314151617
18192021222324
25262728293031