Press "Enter" to skip to content

Month: August 2020

Explaining the ROC Plot

Nina Zumel takes us through what each element of a ROC curve means:

In our data science teaching, we present the ROC plot (and the area under the curve of the plot, or AUC) as a useful tool for evaluating score-based classifier models, as well as for comparing multiple such models. The ROC is informative and useful, but it’s also perhaps overly concise for a beginner. This leads to a lot of questions from the students: what does the ROC tell us about a model? Why is a bigger AUC better? What does it all mean?

Read on for the answer.

Comments closed

Creating a Power BI Report in Portrait Mode

David Eldersveld controls the horizontal; David Eldersveld controls the vertical:

The default Power BI report layout uses a landscape orientation. While it’s not common to see report pages that use portrait mode, Prathy Kamasani (blog/twitter) has a notable example in Power BI here. Others have tried it as well. It’s used occasionally in other data communities or data-driven journalism. Overall, if you want to guide someone through your analysis by having them scroll like a web page, it may be an option to try.

So what happens if I create a report with a vertical layout in Power BI?

Read on for an example as well as some pros and cons of the idea.

Comments closed

Managing Azure IoT Digital Twins with Python

Paul Hernandez continues a series on digital twins:

As mentioned in my previous post, Azure Digital Twins (ADT) Service is a new way to build next generation IoT solutions. In the first post I show you in a video how to manage ADT instances with the ADT Explorer. In the second post I show how to do mostly the same but using Postman and the ADT Rest API.

ADT has control plane APIs and data plane APIs. The latest is used to manage elements in the ADT instance. In order to use these APIs Microsoft published a .Net (C#) SDK. And SDK is a convenient way to manage instances, since you can easily create applications for your digital twins. If for any reason you prefer to use another language like java, javascript or python, you need to generate your own SDK.

In this post I describe how to autogenerate a Python SDK using the tool Autorest and a swagger file.

Read on to see how you can manage digital twins by generating an SDK.

Comments closed

Skipping SSIS Package Validation

Andy Leonard takes us through a new feature in Visual Studio:

A new feature in version 3.9 of the Visual Studio 2019 SSIS extension is “skip package validation.” I’ve tested the functionality and it works well. Here’s how you set it up:

Click through to see how easy it is to enable. I’d imagine that this would work really well when dealing with dynamic packages where connection details aren’t available until runtime. Waiting for validation to fail to connect to a remote data source (on the UI thread!) was always annoying.

Comments closed

Estimated I/O Costs and Spills

Hugo Kornelis goes deep into the rabbit hole:

Typically, when I look at an operator that does I/O, I expected to see values in both the Estimated CPU Cost and the Estimated I/O Cost. An Index Seek locates specific data in an index. The CPU has to do some work, but most of the cost of this operator is the actual I/O. So when I look at an execution plan and I see an Index Seek that has an Estimated CPU Cost of  0.0001787 and an Estimated I/O Cost of 0.003125 (almost 20 times as much), I am not surprise. That’s an expected ratio.

Conversely, operators that don’t do any I/O should of course be estimated to have a zero I/O cost. And that is indeed the case. Operators such as Nested LoopsAssert, or Row Count Spool will always have an Estimated I/O Cost of zero.

And then there are some operators where the Estimated I/O Cost may or may not be zero.

Hugo then walks us through a case where the optimizer is promising you a bad time if you run the query.

Comments closed

T-SQL Tuesday 129 Round-Up

Tamera Clark wraps up T-SQL Tuesday #129:

Well, I don’t even know what to say, such great ideas and creativeness. The last couple of days I have internally struggled with, as Glenn wrote this ” human malware situation”. Reading all of these posts has made me smile.

Read on for twelve people’s thoughts on this month’s topic, a time capsule for SQL Server professionals.

Comments closed

Querying Power BI from Visual Studio Code

Phil Seamark shows us how to write queries against Power BI using Visual Studio Code:

It’s helpful to understand there are two main client libraries for Analysis Services. A client library is what you can add to any new Visual Studio Code Project to provide objects, methods and functions relevant for the tool you are building.

Make sure you download the NetCore (.Net Core) versions of these libraries when working with Visual Studio Code. There are .Net Framework versions of these libraries that are more suited to use with the full Visual Studio product.

Read on for links to those libraries and a thorough demonstration.

Comments closed

Using Flink Stateful Functions

Igal Shilman gives us an example of where Apache Flink’s Stateful Functions come in handy:

In this blog post, we’ll take a look at a class of use cases that is a natural fit for Flink Stateful Functions: monitoring and controlling networks of connected devices (often called the “Internet of Things” (IoT)).

IoT networks are composed of many individual, but interconnected components, which makes getting some kind of high-level insight into the status, problems, or optimization opportunities in these networks not trivial. Each individual device “sees” only its own state, which means that the status of groups of devices, or even the network as a whole, is often a complex aggregation of the individual devices’ state. Diagnosing, controlling, or optimizing these groups of devices thus requires distributed logic that analyzes the “bigger picture” and then acts upon it.

A powerful approach to implement this is using digital twins: each device has a corresponding virtual entity (i.e. the digital twin), which also captures their relationships and interactions. The digital twins track the status of their corresponding devices and send updates to other twins, representing groups (such as geographical regions) of devices. Those, in turn, handle the logic to obtain the network’s aggregated view, or this “bigger picture” we mentioned before.

Read on to see where stateful functions come into play.

Comments closed

Late-Arriving Data with Spark Streaming

Sarfaraz Hussain continues a series on Spark streaming:

The size of the State (discussed in the previous blog) will continue to increase indefinitely as we really don’t know when a bucket can be closed.

But practically a query is not going to receive data 1 week late or in that matter such late-arriving data is of no use to us.

So, to specify the information when to stop considering older buckets for the streaming query we use Watermark.

Read on to see how you can design and implement a watermark.

Comments closed

A Review of Azure Synapse Analytics

Teo Lachev looks at Azure Synapse Analytics:

There is plenty to like in Azure Synapse which is the evaluation of Azure SQL DW. If you’re tasked to implement a cloud-based data warehouse, you have a choice among three Azure SQL Server-based PaaS offerings, including Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse. In a nutshell, Azure SQL Database and Azure SQL MI are optimized for OLTP workloads. For example, they have full logging enabled and replicate each transaction across replicas. Full logging is usually a no-no for decent size DW workloads because of the massive ETL changes involved.

In addition, to achieve good performance, you’ll find yourself moving up the performance tiers and toward the price point of the lower Azure Synapse SKUs. Not to mention that unlike Azure SQL Database, Azure Synapse can be paused, such as when reports hit a semantic layer instead of DW, and this may offer additional cost cutting options.

Teo focuses primarily on SQL Pools and the more SQL-friendly side of things (ELT and Power BI) rather than Spark pools.

Comments closed