Press "Enter" to skip to content

Day: June 11, 2020

Using Random Cut Forests for Anomaly Detection

Chris Swierczewski and Lai Jiang have an example of using Random Cut Forests to perform anomaly detection against a dataset stored in Amazon Elasticsearch Service:

Based on these constraints and performance results from internal and publicly available benchmarks across many data domains, we chose the RCF algorithm for computing anomaly scores in data streams.

But this begs the question: How large of an anomaly score is large enough to declare the corresponding data point as an anomaly? The anomaly detector uses a thresholding model to answer this question. This thresholding model combines information from the anomaly scores observed thus far and certain mathematical properties of RCFs. This hybrid information approach allows the model to make anomaly predictions with a low false positive rate when relatively little data has been observed, and effectively adapts to the data in the long run. The model constructs an efficient sketch of the anomaly score distribution using the KLL Quantile Sketch algorithm. For more information, see Optimal Quantile Approximation in Streams.

The linked post is more of an explanation of process than a tutorial, but it’s interesting in seeing how different approaches can find anomalies at different rates.

Comments closed

Higher-Order Functions in Scala

Rahul Agarwal explains how higher-order functions make your life easier:

As a part of the functional programming paradigm, whatever logic we need to write is to be implemented in terms of pure and immutable functions. Here, functions take arguments from other functions as input and return values/functions which used by other functions for further processing. Here, pure means that the function does not produce any side-effects like printing to the console and immutable means that the function takes in and produces immutable data(val) only.

Higher-order functions comply with the above idea. As compared to for loops, we can iterate a data structure using higher-order functions with much less code.

The term “higher-order function” can sound a bit overwhelming if you’re completely unfamiliar, but it’s a pretty simple concept: a function which takes another function as (at least) one of its inputs. As Rahul points out, this is quite the useful concept.

Comments closed

Creating Data-Driven Power BI Report Subscriptions

John White shows how to create a data-driven subscription for a Power BI report:

One of the features that has never made the leap from SQL Server Reporting Services (SSRS) on-premises to the cloud is data-driven subscriptions. Users can subscribe to reports, but a data-driven subscription allows individual subscriptions to be stored in a central location and parameterized, while delivering the reports to multiple locations. This article will describe a pattern for accomplishing this using SharePoint lists as the subscription store, and Power Automate as the automation tool, for a no-code solution to this requirement.

The other alternative would be to use Power BI Report Server, but if you’re not using that, this is an interesting approach and solution.

Comments closed

Optimizing Derived Table Expressions

Itzik Ben-Gan continues a series on table expressions:

As mentioned, next month I’ll get to the details of unnesting of derived tables. For now, suffice to say that SQL Server normally does apply an unnesting/inlining process to derived tables, where it substitutes the nested queries with a query against the underlying base tables. Well, I’m oversimplifying a bit. It’s not like SQL Server literally converts the original T-SQL query string with the derived tables to a new query string without those; rather SQL Server applies transformations to an internal logical tree of operators, and the outcome is that effectively the derived tables typically get unnested. When you look at an execution plan for a query involving derived tables, you don’t see any mention of those because for most optimization purposes they don’t exist. You see access to the physical structures that hold the data for the underlying base tables (heap, B-tree rowstore indexes and columnstore indexes for disk-based tables and tree and hash indexes for memory optimized tables).

This article deserves a careful reading.

Comments closed

The Function of Service Broker Queues

Chris Johnson continues a series on Service Broker:

A queue is a full database object, like a table or a stored procedure. As such, it is part of a schema, and appears in the sys.objects view. A queue holds messages that have been sent to it, in the same way that a table does, and these messages can even be queried in the same way that you would query a table.

You can’t change the columns that are available, and there are quite a few of them. To see what there is, just run SELECT * against any queue, but a few of the key ones are service_name, service_contract_name, message_type_name, message_body, message_enqueue_time, conversation_handle.

Read on to see how to create a new queue.

Comments closed

A Power BI FAQ

James Serra answers questions about Power BI:

Should we have dev, test, and prod workspaces?

Yes! You should use change management to move reports through the dev/test/prod workspace tiers via the new deployment pipelines in Power BI. Use the workspaces to collaborate on Power BI content with your colleagues, while distributing the report to a larger audience by publishing an app. You should also promote and certify your datasets. The reports and datasets should have repeatable test criteria.

Read on for the full set of questions and answers.

Comments closed

Including Headers in Zero-Row ADF Data Flows

Mark Kromer meets a challenge:

Today, we don’t have an option in data flows in ADF to include headers when your transformations result in zero rows. But you can build the logic to handle this scenario. So, until we add a checkbox feature to include headers, you can use this technique below to achieve this.

Click through for the explanation, as well as a completed version you can take for your own.

Comments closed

Improving Power BI Performance

Dan Szepesi continues a series on Power BI performance tuning:

As an example, I am going to go through in detail how to use the results from the Performance Analyzer to understand the performance of your visuals.  I downloaded the sample PBIX from the Power BI Documentation at Microsoft.com – https://docs.microsoft.com/en-us/power-bi/create-reports/sample-datasets and I will use the visuals from the Net Sales report in the screenshots that follow.

I am going to walk through how I would approach looking at the performance of this visuals on this report and show what we can learn from the data that the Performance Analyzer gives me.

Click through for that example as well as several helpful tips.

Comments closed