Press "Enter" to skip to content

Month: June 2020

Obfuscating Data in SQL Server

Dave Mason has a data obfuscator:

In a previous post, I explored an option for generating fake data in sql server using Machine Learning services and the R language. I’ve expanded on that by creating some stored procedures that can be used for both generating data sets of fake data, and for obfuscating existing SQL Server data with fake data.

The code is available in a Github repository. For now, it consists of ten stored procedures. 

Unlike something like Dynamic Data Masking, this is a permanent update to the table. That makes it quite helpful for getting production distributions and use cases into non-production environments.

Comments closed

Open-sourcing Kube2Hadoop

Cong Gu, et al, announce the open-sourcing of a project:

By default, there is a gap between the security model of Kubernetes and Hadoop. Specifically, Hadoop uses Kerberos, a three-party protocol built on symmetric key cryptography to ensure any clients accessing the cluster are who they claim to be. In order to avoid frequent authentication checks against a Kerberos server, Delegation Tokens, a lightweight two-party authentication method, was introduced to complement Kerberos authentication. The Hadoop delegation token by default has a lifespan of one day and can be renewed for up to seven days. Kubernetes, on the other hand, uses a certificate-based approach for authentication, and does not expose the owner of a job in any of its public-facing APIs. Therefore, it is not possible to securely determine the authorized user from within the pod using the native Kubernetes API and then use that username to fetch the Hadoop delegation token for HDFS access.  

To allow for Kubernetes workloads to securely access HDFS, we built Kube2Hadoop, a scalable and secure integration with HDFS Kerberos. This enables AI modelers at LinkedIn to use HDFS data in Kubernetes pods with access control through a user account or a headless account. Headless accounts are oftentimes used to denote a virtual team that is working on projects that would share the same data within the team. The data acquired can then be used in their model exploration and training with KubeFlow components such as the tf-operator and mpi-operator. In this blog, we will describe the design and authentication model of Kube2Hadoop. 

Read on to see how it works and a link to the GitHub repo.

Comments closed

Incremental Refresh of Any Power BI Data Source

Gilbert Quevauvilliers wants incrementally to refresh all the sources:

The pattern that I am talking about is the following below, which will be used as my example below.

– Connect to a data source which can query fold.
> In my example I have installed and configured an Azure SQL Serverless DB
> In this database I have a date table.
– Configure date table to use Incremental refreshing as per the blog post
> Incremental refresh in Power BI
– Create a function which will then use the Date value as part of the parameter
> In my example I have got CSV files which have Exchange rate information from Azure Blob Storage.
> The file names of the CSV files is the date.
– Invoke the Function within the date table to extract the required information.

I know what you might be thinking, that as soon as I add in the column with the function it breaks the Query Folding. That is what I thought too.

The great news is that Incremental refreshing DOES STILL WORK!

Read on for the demonstration.

Comments closed

Analysis versus Reporting and Power BI

Rob Collie thinks about industry movement between analysis and reporting. Part one gives us some backstory:

Excel was about to make a large investment in BI-related capabilities, and the powers that be had selected me to lead our part in it. I was excited, but now I needed a crash course in “what the hell is BI?” I was given multiple tutors, and they all were quick to introduce the concept of Analysis versus Reporting. The “versus” seemed to be pretty important. It wasn’t an “and” – no, the “versus” was chosen deliberately in these sermons. You see, these were Two Very Different Things.

I struggled mightily to grasp this difference. I was told that interactive things like PivotTables were Analysis tools – NOT Reporting tools! Reports were something completely different. “But,” I pointed out, “they’re called ‘Insert PivotTable Report’ on the Excel menu today!” (This was Excel 2003). “Yeah,” said the mentors, “…we might want to fix that.”

Part two explains why analysis and reporting are both important:

Another “meta characteristic” of paginated reports is that they TEND to display details rather than aggregations. EX: specific transactions rather than emergent trends. In paginated reports, you’re MORE likely (but not guaranteed!) to be looking at “raw” rows of data from the original database, whereas in a Power BI report, you’re more likely (but again, not guaranteed!) to NOT be seeing raw individual rows, but rather intelligent aggregations of MANY rows. But either way, more detail means you’re more likely to need multiple pages.

Rob’s right on the money. And I’m looking forward to part three of the series.

Comments closed

Altering the Database without Rolling Back Users

Kenneth Fisher wants to change a database:

If this strikes a bit too close to home for you then you need to look at the ROLLBACK clause. It’s great for killing and rolling back all of the current connections before making my change.

But this is a pretty sensitive app and if there’s something running I have to let it finish. No ROLLBACK allowed. But I’m also not going to wait forever to see if my alter is going to happen. Turns out there is a nice easy option for this too.

Click through to see the option, as well as the message you get if it can’t work immediately.

Comments closed

One-Column Fusion with DAX

Phil Seamark has a performance tuning tip for DAX:

I wrote an article about an optimisation called DAX Fusion that attempts to fuse similar SE calls when it can. This article highlights an elegant DAX-based trick that works for a specific scenario by reducing the number of SE calls that doesn’t rely on DAX fusion. The difference between 1-Column fusion and the other is:

– DAX fusion in the engine works across multiple columns that have the same effective WHERE clause
– 1-Column fusion works by fusing multiple measures that all reference a single column, but with different WHERE clauses

Read on to learn how to switch around a bit of DAX to reduce the number of storage engine calls, as well as an example of one scenario in which it can come in handy.

Comments closed

How SQL Server Stores Floating Point Types

Randolph West continues a series on SQL Server data type storage:

If an integer or decimal amount is a precise representation of a value, a floating point is the closest approximation of that value in binary. Programming languages and databases use floating point numbers to trade storage (and memory) costs against precision. A floating point value is imprecise, but even that is underselling the problem.

Randolph also breaks all of the rules and writes out the largest FLOAT value you can have.

Comments closed

Using Random Cut Forests for Anomaly Detection

Chris Swierczewski and Lai Jiang have an example of using Random Cut Forests to perform anomaly detection against a dataset stored in Amazon Elasticsearch Service:

Based on these constraints and performance results from internal and publicly available benchmarks across many data domains, we chose the RCF algorithm for computing anomaly scores in data streams.

But this begs the question: How large of an anomaly score is large enough to declare the corresponding data point as an anomaly? The anomaly detector uses a thresholding model to answer this question. This thresholding model combines information from the anomaly scores observed thus far and certain mathematical properties of RCFs. This hybrid information approach allows the model to make anomaly predictions with a low false positive rate when relatively little data has been observed, and effectively adapts to the data in the long run. The model constructs an efficient sketch of the anomaly score distribution using the KLL Quantile Sketch algorithm. For more information, see Optimal Quantile Approximation in Streams.

The linked post is more of an explanation of process than a tutorial, but it’s interesting in seeing how different approaches can find anomalies at different rates.

Comments closed