Press "Enter" to skip to content

Author: Kevin Feasel

Top 5 and All Others in Power BI

Marco Russo and Alberto Ferrari show how you can include the top N rows and include an “Others” aggregate at the end:

Power BI offers the ability to apply a Top N constraint in a visual level filter, so that only a certain number of items are visible based on the evaluation of a measure. A common requirement is to show an additional row that accumulates the “other” items, which are those that are not visible in the report like in the following figure.

In order to solve this scenario you cannot use the Top N filter of Power BI. Instead, you apply the filter in a special measure (TopN Sales) and you use a calculated table to accommodate for the additional row named Others. Moreover, you need an additional column to let the Others row appear at the bottom of the table.

Read on to see how you can solve the problem.

Comments closed

Understanding PERCENTILE_CONT

Kathi Kellenberger takes us through the PERCENTILE_CONT window function:

I was recently playing with the analytical group of windowing functions, and I wanted to understand how they worked “under the covers.” I ran into a little logic puzzle with PERCENTILE_CONT by trying to write a query that returned the same results using pre-2012 functionality.

Given a list of ranked values, you can use the PERCENTILE_CONT function to find the value at a specific percentile. For example, if you have the grades of 100 students, you can use PERCENTILE_CONT to locate the score in the middle of the list, the median, or at some other percent such as the grade at 90%. This doesn’t mean that the score was 90%; it means that the position of the score was at the 90th percentile. If there is not a value at the exact location, PERCENTILE_CONT interpolates the answer.

I’m a bit disappointed with how poorly PERCENTILE_CONT performs against large data sets, especially if you need multiple percentiles. It’s bad enough that going into ML Services and getting percentiles with R is usually faster for me. But for datasets of less than 100K or so rows, it’s the easiest non-CLR method to get the median (with the easiest CLR method being SQL#).

Comments closed

Running Big Data Clusters on VS Subscriptions

Kevin Chant has a few tips for people wanting to try out Big Data Clusters with their Visual Studio subscriptions to Azure:

In order to present the right results for various outcomes I attempted to deploy Big Data Clusters multiple times.

When I say multiple times, I mean the number of deployments easily went into double figures. Because I was testing deploying various virtual machine sizes in multiple regions.

Hence, I spent many hours testing and verifying the results in order to present them properly.

Read on to see Kevin’s notes and recommendations.

Comments closed

Refreshing Power BI Dataflows with Powershell

Craig Porteous shows how to use the Power BI Dataflows REST API with Powershell:

I like to use my favourite scripting language to do this – PowerShell. Although we have the Power BI Management PowerShell module (MicrosoftPowerBIMgmt) to interact with Power BI, the cmdlets aren’t yet there to refresh or retrieve the history of a dataflow (or even a dataset) but the module can still help us get what we need without jumping through too many hoops (and as long as we aren’t automating the authentication, that’s another post.).

Click through to see how it’s done.

Comments closed

When Power Query Hits Data Sources Repeatedly

Chris Webb answers an age-old question:

If you’re developing in Power BI Desktop and you think that refresh is taking a long time, you should definitely check whether the Power Query engine is hitting your data source more than once. There are lots of ways to do this. Some data sources have tools that show when they are queried, such as the Run History screen in Microsoft Flow that I show in the video or SQL Server Profiler. Other ways include using Fiddler for web services or Process Monitor for files.

Read the whole thing.

Comments closed

Backing Up SQL Server on Azure VMs

Arun Sirpal looks at three techniques for backing up SQL Server running on Azure virtual machines:

In the previous blog post I did a quick overview building a SQL VM (imaged) in Azure. It is now time to clarify some backup techniques because it can get confusing.

At a high level there are 3 techniques.
– Automated backup.
– Azure backup for SQL VM (that’s what MS call it).
– Manual backup, for example backup to URL.

Read on to learn more about each.

Comments closed

MLFlow on Databricks Community Edition

Jules Damji and Siddharth Murching have an interesting announcement:

Today, we are excited to extend Databricks Community Edition with hosted MLflow for free, as part of our ongoing commitment to help developers learn about machine learning lifecycle. With the Community Edition, you can try tutorials that demonstrate how to track results and experiments as you build machine learning models—a crucial stage in the machine learning model’s development lifecycle.

MLflow is an open-source platform for the machine learning lifecycle with four components: MLflow TrackingMLflow ProjectsMLflow Models, and MLflow Registry. MLflow is now included in Databricks Community Edition, meaning that you can utilize its Tracking and Model APIs within a notebook or from your laptop just as easily as you would with managed MLflow in Databricks Enterprise Edition.

I like showing off Databricks Community Edition, and I’m glad to see them extend it a bit.

Comments closed

Streaming ETL of Rail Data with Kafka

Robin Moffatt has an interesting architecture and implementation for Kafka:

Trains are an excellent source of streaming data—their movements around the network are an unbounded series of events. Using this data, Apache Kafka® and Confluent Platform can provide the foundations for both event-driven applications as well as an analytical platform. With tools like KSQL and Kafka Connect, the concept of streaming ETL is made accessible to a much wider audience of developers and data engineers. The platform shown in this article is built using just SQL and JSON configuration files—not a scrap of Java code in sight.

The code is also available in a GitHub repo.

Comments closed

Determining Your Isolation Level

Bob Pusateri explains different transaction isolation levels and where you can figure out your current settings:

Isolation level settings for the entire database exist only as a default. All connections created will inherit the database’s isolation level. Furthermore, a database only has two possible isolation levels:
– read committed
– read committed snapshot (also known as “RCSI”)

There are four additional isolation levels (read uncommitted, repeatable read, serializable, snapshot) which can only be selected at the connection or statement level. In SQL Server and Azure SQL Database Managed Instance, the default database isolation level is read committed. In Azure SQL Database, the default is read committed snapshot.

Read the whole thing.

Comments closed

Upgrading Servers in an Availability Group

Thomas Rushton has a checklist for upgrading servers connected by an Availability Group:

This is a checklist put together and followed for an upgrade of a pair of physical SQL Server 2012 servers which hosted a single availability group of several terabytes of data with minimal downtime.

The availability group was configured with synchronous commit and automatic failover.

Minimizing downtime here is great, but it’s not automatic: you still need to do work on your end to get this right.

Comments closed