There are problems with K-anonymous datasets, namely the homogeneous pattern attack, and the background knowledge attack, details of which are in my original post. A slightly different approach to anonymising public datasets comes in the form of ℓ -diversity, a way of introducing further entropy/diversity into a dataset.
A sensitive data record is made of the following microdata types: the ID; any Key Attributes; and the confidential outcome attribute(s). ℓ -diversity seeks to extend the equivalence classes that we created using K-anonymity by generalisation and masking of the quasi-identifiers (the QI groups) to the confidential attributes in the record as well. The ℓ -diversity principle demands that, in each QI-group, at most 1/ ℓ of its tuples can have an identical sensitive attribute value.
L-diversity is not perfect either, but Duncan gives a good explanation of the topic.
DBAs! I guess everyone know that huge number of Virtual Log Files (VLFs) in SQL Server can cause Backup/Restore & Database Recovery process slow. Even in rare cases, it can introduce slowness at transaction level.
Even we all are aware of it, it has been still a challenge to remove High VLF counts from SQL Server log files since it involved log of manual effort with Shrinking and re-growing the log files.
This is where my ‘Space-Capacity-Automation‘ open source project comes to your rescue. It has a parameter option @optimizeLogFiles that can help you optimize your log files by below below tasks:-
Click through to see Ajay’s technique.
Itzik Ben-Gan has started a new series on T-SQL bugs, pitfalls, and best practices. The first post in this series is on non-deterministic behavior:
A nondeterministic function is a function that given the same inputs, can return different results in different executions of the function. Classic examples are SYSDATETIME, NEWID, and RAND (when invoked without an input seed). The behavior of nondeterministic functions in T-SQL can be surprising to some, and could result in bugs and pitfalls in some cases.
Many people assume that when you invoke a nondeterministic function as part of a query, the function gets evaluated separately per row. In practice, most nondeterministic functions get evaluated once per reference in the query.
This is the start to a great series.
OK, so now that we have the dependencies installed we can create a notebook. I decided to use the ValidationResults database that I use for my dbachecks demos and describe here. I need to restore it from my local folder that I have mapped as a volume to my container. Of course, I use dbatools for this
Click through to see how to install and use SQL notebooks.
SQL Server has had many different methods to track changes to data. There has been old-fashioned trigger-based logging, Change Data Capture, and Change Tracking. All of these features allow you to see how data has changed with varying degrees of how far you can go back historically to see the changes. However, there has never been the ability to see how the entire table looked at any given point in time. That is what Temporal Tables do – they log every change that happens to the table. When you want to query it for any specific period of time, SQL Server will do the hard work of getting the snapshot of how the data in the entire table looked at that time.
A great introduction on how to set up temporal tables in different ways with various limitations can be found here. In this article, you will learn how to set up versioning when creating a new table and how to convert a table with an existing history to a system-versioned temporal table.
Read on to learn how to use temporal tables.
Spark as a compute engine is very widely accepted by most industries. Most of the old data platforms based on MapReduce jobs have been migrated to Spark-based jobs, and some are in the phase of migration. In short, batch computation is being done using Spark. As a result, organizations’ infrastructure and expertise have been developed around Spark.
So, the now question is: can Spark solve the problem of batch consumption of data inherited from Kafka? The answer is yes.
The advantages of doing this are: having a unified batch computation platform, reusing existing infrastructure, expertise, monitoring, and alerting.
Click through to get to the starting point on this as well as a few tips to avoid stumbling blocks.
AWS provides a lot of services, these services are sufficient to run your architecture. The backbone for the security of this architecture is VPC (Virtual Private Cloud). VPC is basically a private cloud in the AWS environment that helps you to use all the services by AWS in your defined private space. You have control over the virtual network and you can also restrict the incoming traffic using security groups.
Overall, VPC helps you to secure your environment and give you a complete authority of incoming traffic. There are two types of VPCs, Default VPC that is by default created by Amazon and Non-Default VPC that is created by you to suffice your security needs.
Now that you have an idea of how VPC works, I will take you through the different services offered by Amazon VPC.
Read on to see how to set one up.
How is it possible to compute the difference between the two dates, only computing working days and skipping weekends and holidays? Simple math is no longer useful here, and DAX does not offer a predefined function.
A solution to this scenario requires a date table – more details here – with a specific column, IsWorkingDay, which indicates whether that particular day is a working day or not. The following figure shows an example:
Another good use of date tables (AKA calendar tables), which are also quite useful in T-SQL queries.
My initial plan was to google things. The specific error:
java.lang.IllegalArgumentException: Unrecognized Hadoop major version number. That pops up HIVE-15326 and HIVE-15016 but gave me no immediate joy.
After reaching out to James Rowland-Jones (t), we (by which I mean he) eventually figured out the issue.
Click through for the solution.
Alberto Ferrari joins Patrick to walk through how you can use DAX to format a list of values within Power BI Desktop. This takes the concatenate values quick measure to the next level.
Transmuting Adam into Alberto shows Patrick’s ultimate power.