Press "Enter" to skip to content

Author: Kevin Feasel

When Totals in Power BI Look Inaccurate

Marco Russo and Alberto Ferrari ask who you believe, them or your lying eyes:

When looking at a report, it is natural to double-check the numbers produced. The simplest and most intuitive way is to verify whether the total equals the sum of individual rows. This behavior is extremely natural and mostly effective. Nonetheless, the total is the sum of rows only for additive measures, which are measures that are naturally computed as a sum.

When working with business intelligence solutions, sooner or later a developer will author a calculation that is non-additive. At that point, the total can no longer be computed by summing the rows for a very good reason: it would be inaccurate. When users complain about the fact that the rows do not sum up, seasoned BI developers offer a rational explanation of the reasons why the number are not summed: this process often provides a better understanding of how values are computed. Choosing the easy way out of introducing additivity in a naturally non-additive calculation means losing the opportunity to generate accurate calculations, and relying on inaccurate values.

Read on for examples and how to understand how to deal with non-additive or semi-additive features.

Comments closed

Bitemporal Modeling and Running Totals

John Mount solves a running total problem in Python:

An example of this is wanting to know any many reservations for a San Francisco Symphony concert scheduled for December 4th 2022 are known to have been made by October 22nd 2022. This could be used as part of an attendance demand model that is evaluated on October 22nd 2022. The “fifty-cent word” for this is “bitemporal” modeling or data.

As I read through the solution, my initial thought is that, if the data is in a relational database, a running total operation SUM(reservation_count) OVER (PARTITION BY target_date ORDER BY action_date ROWS BETWEEN UNBOUNDED PRECEDING TO CURRENT ROW) would form the basis of a solution. Still, this is an interesting exercise in translating a SQL operation into equivalent Python and just how much we get to take for granted.

Comments closed

Designing Event Streams for Kafka

Dave Shook announces a new course:

Properly designing your events and event streams is essential for any event-driven architecture. Precisely how you design and implement them will significantly affect not only what you can do today, but what you can do tomorrow. For such a critical part of any data infrastructure, most event streaming tutorials gloss over event design.

In the new course on Confluent Developer, events and event streams are put front and center. We’re going to look at the dimensions of event and event stream design and how to apply them to real-world problems. But dimensions and theory are nothing without best practices, so we are also going to take a look at these to help keep you clear of pitfalls and set you up for success. This course also includes hands-on exercises, during which you will work through use cases related to the different dimensions of event design and event streaming.

Click through to learn more about what’s in the course and to check it out–it is free, after all.

Comments closed

Transferring Data between Dedicated SQL and Spark Pools in Synapse

Sidney Cirqueria shows off a connector available to us in Azure Synapse Analytics:

Usually, customers do this kind of operation using Synapse Apache Spark to load data to Dedicated Pool within Azure Synapse Workspace, but today, I would like to reproduce a different scenario that I was working on one of my support cases.  Consider a scenario where you are trying to load data from Synapse Spark to Dedicated pool (formerly SQL DW) using Synapse Pipelines, and additionally you are using Synapse Workspace deployed with Managed Virtual Network.

The intention of this guide is to help you with which configuration will be required if you need to load data from Azure Synapse Apache Spark to Dedicated SQL Pool (formerly SQL DW). If you prefer take advantage of the new feature-rich capabilities now available via the Synapse workspace and Studio and load data directly from Azure Apache Spark to Dedicated Pool in Azure Synapse Workspace is recommended that you enable Synapse workspace features on an existing dedicated SQL pool (formerly SQL DW).

Read on for a few tips a nd a step-by-step walkthrough of the process.

Comments closed

Fine-Tuning Hugging Face for Named Entity Recognition in Japanese

Tsuyoshi Matsuzaki tries out a named entity recognition project with the Hugging Face library:

Now a lot of AI companies (such as, OpenAI, NLP Cloud, Google, NVIDIA, etc) are providing pre-trained large language models including methods that tune to enable models trained. Among such tools and framework, HuggingFace is widely used and providing over 20,000 transformer-based models.

In this post, I’ll show you brief fine-tuned example of transformer models in Hugging Face for your beginning.
In the last part of this post, I’ll also optimize training with DeepSpeed which is well integrated with HuggingFace transformers.

Click through for the results of this analysis.

Comments closed

Diagnosing Performance Problems with EvaluateAndLog

Chris Webb wants to sort out some performance issues on calculation groups:

A few weeks ago I wrote a post showing how you can use the new EvaluateAndLog DAX function to diagnose performance problems relating to the use of the Switch function. Did you know that calculation groups can experience similar performance problems though? In some scenarios limited evaluation takes place for all calculation items, not just the one you expect; luckily you can use EvaluateAndLog to diagnose this too. In this post I’ll show you a simple example.

Read on for the example.

Comments closed

Azure SQL DB GP Compute Optimized Performance Comp

Reitse Eskens continues a series on comparing the performance of different Azure SQL DB tiers:

The compute optimized tier starts at 8 cores as a minimum and goes up to 72.
The maximum storage starts at 1 TB and  goes up to 4 TB. The number of TempDB files is undisclosed but the size starts at 37 GB and goes up to 333 GB. A huge difference with the regular provisioned one! The disk limitations are disclosed as well. Log rate starts at 36 Mbps and maxing out at 50. The same goes for the data; starting at 2560 iops and maxing out at 12.800. One iop (or Input Output oPeration) is connected to the disk cluster size. As these are 4 kb, reading or writing one data page (8Kb) equals two iops. 2560 Iops equals something of 10 MB per second. The top end goes to about 50 MB per second. Keep this in mind when you’re working out which tier you need, because usually disk performance can be essential.

Read on to see how it performs on Reitse’s standardized test workload.

Comments closed

Shiny App Dockerfile Automation

Jamie Owen and Colin Gillespie don’t have time to write dockerfiles:

For creating a production deployment of a {shiny} application it is often useful to be able to provide a Docker image that contains all the dependencies for that application. Here we explore how one might go about automating the creation of a Dockerfile that will allow us to build such an image for a {shiny} application.

There are some neat tricks in here.

Comments closed

Working with Transformer Models for Machine Translation

Stefania Cristina continues a series on transformer models. First up is plotting loss curves:

We have previously seen how to train the Transformer model for neural machine translation. Before moving on to inferencing the trained model, let us first explore how to modify the training code slightly, in order to be able to plot the training and validation loss curves that can be generated during the learning process. 

The training and validation loss values provide important pieces of information, because they allow us to have a better insight on how the learning performance is changing over the number of epochs, and help us diagnose any problems with learning that can lead to an underfit or an overfit model. They will also inform us about the epoch at which to use the trained model weights at the inferencing stage.

Then we get to try it out:

We have seen how to train the Transformer model on a dataset of English and German sentence pairs, as well as how to plot the training and validation loss curves in order to diagnose the model’s learning performance and decide at which epoch to inference the trained model. We are now ready to inference the trained Transformer model for the purpose of translating an input sentence.

In this tutorial, you will discover how to inference the trained Transformer model for neural machine translation. 

Click through for the results and to see exactly why there’s so much computational effort dumped into high-end trained models.

Comments closed

Lock Escalation outside of Repeatable Read

Paul White continues a series on lock escalation:

When SQL Server reads data under locking read committed isolation (the default for non-cloud offerings) and chooses row-level locking granularity, it normally only locks one row at a time. The engine processes the row through parent operators before releasing the shared lock prior to locking and reading the next row.

When the engine chooses page-level locking granularity, it takes and holds a shared page lock while processing rows on the current page and then releases the lock right before acquiring a lock on the next page.

In general, SQL Server is careful to take and hold locks only when needed and only long enough to guarantee correct results, taking account of the current isolation level. When locking at row granularity, the engine can skip row locks entirely when it’s safe to do.

Read on to understand when SQL Server either gets it wrong or when exigent factors alter the story.

Comments closed