Press "Enter" to skip to content

Month: November 2022

Optimizing Async Sinks in Flink

Hong Liang Teoh speeds things up:

When designing a Flink data processing job, one of the key concerns is maximising job throughput. Sink throughput is a crucial factor because it can determine the entire job’s throughput. We generally want the highest possible write rate in the sink without overloading the destination. However, since the factors impacting a destination’s performance are variable over the job’s lifetime, the sink needs to adjust its write rate dynamically. Depending on the sink’s destination, it helps to tune the write rate using a different RateLimitingStrategy.

This post explains how you can optimise sink throughput by configuring a custom RateLimitingStrategy on a connector that builds on the AsyncSinkBase (FLIP-171). In the sections below, we cover the design logic behind the AsyncSinkBase and the RateLimitingStrategy, then we take you through two example implementations of rate limiting strategies, specifically the CongestionControlRateLimitingStrategy and TokenBucketRateLimitingStrategy.

Read on for some tips on creating a rate limiting strategy for a sink.

Comments closed

The Importance of Proper Data Modeling in Power BI

Paul Turley avoids “big, wide tables”:

Power BI is architected to consume data in a dimensional model, with narrow fact tables and related dimensions. Introducing a big, wide table in a tabular model is extremely inefficient. It takes up space and memory resources, impacts performance, and complicates measure coding. Flattening records into a flat table is one of the worst things you can do in Power BI and a common mistake made by novice Power BI users.

This is a conversation I’ve had with many customers. We want our cake, and we want to eat it too. We want to have all the analytic capabilities, interactivity and high performance but we also want the ability to drill-down to a lot of details. What if we have a legitimate need to report on transaction details and/or a large table with many columns? It is well-known that the ideal shape is a star schema but what if we need to shape data for detail reporting? The answer is that you can have it both ways, but just not in one table.

Read on for a better model design (hint: the Kimball style) as well as several tips and tricks.

Comments closed

Automatic Partition Maintenance in Power BI

Shabnam Watson answers an attendee question:

During one of my presentations on Incremental Refresh (IR) in Power BI, someone asked what happens during a Power BI automatic partition maintenance window when Power BI has an opportunity to merge smaller partitions into larger ones. Does Power BI use the data that is already imported into Power BI for the smaller partitions and combine it into a bigger one or does it re-read the data for those smaller partitions again. For example, if a dataset has an IR policy to refresh the last 1 day, and it has read data for all the days in a previous month, one day for each, when the new month arrives, it has an opportunity to merge the smaller day partitions into a month partition for the previous month. Does it re-read the previous month’s data from the source again or does it use what it already has in its memory?

Read on for the answer.

Comments closed

Installing the SSIS 2022 Preview for Visual Studio

Koen Verbeeck does a bit of installation:

For those of you that have been working on an older version of SSIS/SQL Server (2014-2016, something like that), the BI components (SSIS/SSAS/SSRS) are now extensions in Visual Studio. SQL Server Data Tools (SSDT) is no longer available as a separate download.

So you’ll need a full-blown version of Visual Studio (make sure you only install the workflows you actually need). The good news is that you can use the community edition of VS if you’re just using it for BI development. Anyway, install VS 2022 on your machine and download the SSIS extension here.

Read on for the full installation process and a couple of warnings.

Comments closed

Extended Event Performance Metrics in SQL Server 2022

Mitchell Sternke looks at some new extended events:

Running Extended Event (XEvent) sessions on SQL Server has a cost. Since XEvents was designed for high-performance, this is usually unnoticeable. However, it can become an issue depending on which events, actions, and other XEvent features are in use. New metrics available in SQL Server 2022, in Azure SQL Database, and in Azure SQL Managed Instance can help you understand the performance impact of using XEvents in your database.

The sys.dm_xe_session_events DMV (sys.dm_xe_ database_session_events on Azure SQL Database) provides one row for each event found in an active session definition, allowing you to see all events that are currently publishing on your SQL Server instance. Four new columns have been added to help with troubleshooting performance:

Read on to learn more about these columns and what they can do for you.

Comments closed

Content Endorsement in Power BI

Soheil Bakhshi helps us find the best data:

One of the key aspects of users’ experience in Power BI is their ability to collaborate in creating and sharing content, making it an easy-to-use and convenient platform. But the convenience comes with a cost of having a lot of shared content in large organisations raising concerns about the content’s quality and trustworthiness. It would be hard, if not impossible, to identify the quality of the contents without a mechanism to identify the quality of the contents. Content endorsement is the answer to this.

“I’m Commander Shepherd and this is my favorite dataset on the Power BI Citadel.”

Comments closed

MLflow 2.0 Now Available

Mike Cornell announces MLflow 2.0:

Today, we are thrilled to announce the availability of MLflow 2.0. Building upon MLflow’s strong platform foundation, MLflow 2.0 incorporates extensive user feedback to simplify data science workflows and deliver innovative, first-class tools for MLOps. Features and improvements include extensions to MLflow Recipes (formerly MLflow Pipelines) such as AutoML, hyperparameter tuning, and classification support, as well modernized integrations with the ML ecosystem, a streamlined MLflow Tracking UI, a refresh of core APIs across MLflow’s platform components, and much more.

I like a lot of what MLflow does; it’ll be interesting to see how quickly different products adopt 2.0.

Comments closed

Fun with Decision Trees

Holger von Jouanne-Diedrich explains the value of decision trees, using predictive maintenance as an example:

Predictive Maintenance is one of the big revolutions happening across all major industries right now. Instead of changing parts regularly or even only after they failed it uses Machine Learning methods to predict when a part is going to fail.

If you want to get an introduction to this fascinating developing area, read on!

Click through for an example of how it works.

Comments closed

Reading Serverless SQL Pool Data with Data Factory

Koen Verbeeck wants to read from the serverless SQL pool in Azure Synapse Analytics:

We have some data we can query using the serverless SQL pools in Azure Synapse Analytics. For this blog post, I’m querying data that is stored in Azure Cosmos DB. Read the blog post How to Store Normalized SQL Server Data into Azure Cosmos DB to learn more about how that data got there.

Suppose I now want to read the data using Azure Data Factory. You can read data from Cosmos DB directly, but let’s pretend I want to do some transformations first using my favorite language: SQL. How can we do this?

Read on to learn how.

Comments closed

Hyperconverged Storage and Trace Flags

David Klee has a tip for us:

We all (should) know that running SQL Server in hyperconverged virtual environments, both on-premises and in the cloud, has some interesting trade-offs. The biggest is write latency from the hyperconverged storage platform underneath the database. We find that write latency suffers compared to traditional all-flash storage, even if the hyperconverged layer is all-flash as well, due to how the hyperconverged layer handles the larger block write that the SQL Server engine drops on it.

Read on for a trace flag which could help here.

Comments closed