2022-11-29 – Curated SQL

Extracting JSON from a Spark DataFrame

Published 2022-11-29 by Kevin Feasel

Unmesha Sreeveni digs into some JSON:

Let’s see how we can extract a Json object from a spark dataframe column

This is an example data frame

Unemsha takes it one step at a time, breaking down each element of the DataFrame and showing how it all works.

Comments closed

Unity Catalog in Azure Databricks

Published 2022-11-29 by Kevin Feasel

Meagan Longoria makes a recommendation:

Unity Catalog in Databricks provides a single place to create and manage data access policies that apply across all workspaces and users in an organization. It also provides a simple data catalog for users to explore. So when a client wanted to create a place for statisticians and data scientists to explore the data in their data lake using a web interface, I suggested we use Databricks with Unity Catalog.

Read on to learn more about what the Unity Catalog does.

Comments closed

Optimizing Async Sinks in Flink

Published 2022-11-29 by Kevin Feasel

Hong Liang Teoh speeds things up:

When designing a Flink data processing job, one of the key concerns is maximising job throughput. Sink throughput is a crucial factor because it can determine the entire job’s throughput. We generally want the highest possible write rate in the sink without overloading the destination. However, since the factors impacting a destination’s performance are variable over the job’s lifetime, the sink needs to adjust its write rate dynamically. Depending on the sink’s destination, it helps to tune the write rate using a different RateLimitingStrategy.

This post explains how you can optimise sink throughput by configuring a custom RateLimitingStrategy on a connector that builds on the AsyncSinkBase (FLIP-171). In the sections below, we cover the design logic behind the AsyncSinkBase and the RateLimitingStrategy, then we take you through two example implementations of rate limiting strategies, specifically the CongestionControlRateLimitingStrategy and TokenBucketRateLimitingStrategy.

Read on for some tips on creating a rate limiting strategy for a sink.

Comments closed

The Importance of Proper Data Modeling in Power BI

Published 2022-11-29 by Kevin Feasel

Paul Turley avoids “big, wide tables”:

Power BI is architected to consume data in a dimensional model, with narrow fact tables and related dimensions. Introducing a big, wide table in a tabular model is extremely inefficient. It takes up space and memory resources, impacts performance, and complicates measure coding. Flattening records into a flat table is one of the worst things you can do in Power BI and a common mistake made by novice Power BI users.

This is a conversation I’ve had with many customers. We want our cake, and we want to eat it too. We want to have all the analytic capabilities, interactivity and high performance but we also want the ability to drill-down to a lot of details. What if we have a legitimate need to report on transaction details and/or a large table with many columns? It is well-known that the ideal shape is a star schema but what if we need to shape data for detail reporting? The answer is that you can have it both ways, but just not in one table.

Read on for a better model design (hint: the Kimball style) as well as several tips and tricks.

Comments closed

Automatic Partition Maintenance in Power BI

Published 2022-11-29 by Kevin Feasel

Shabnam Watson answers an attendee question:

During one of my presentations on Incremental Refresh (IR) in Power BI, someone asked what happens during a Power BI automatic partition maintenance window when Power BI has an opportunity to merge smaller partitions into larger ones. Does Power BI use the data that is already imported into Power BI for the smaller partitions and combine it into a bigger one or does it re-read the data for those smaller partitions again. For example, if a dataset has an IR policy to refresh the last 1 day, and it has read data for all the days in a previous month, one day for each, when the new month arrives, it has an opportunity to merge the smaller day partitions into a month partition for the previous month. Does it re-read the previous month’s data from the source again or does it use what it already has in its memory?

Read on for the answer.

Comments closed

Installing the SSIS 2022 Preview for Visual Studio

Published 2022-11-29 by Kevin Feasel

Koen Verbeeck does a bit of installation:

For those of you that have been working on an older version of SSIS/SQL Server (2014-2016, something like that), the BI components (SSIS/SSAS/SSRS) are now extensions in Visual Studio. SQL Server Data Tools (SSDT) is no longer available as a separate download.

So you’ll need a full-blown version of Visual Studio (make sure you only install the workflows you actually need). The good news is that you can use the community edition of VS if you’re just using it for BI development. Anyway, install VS 2022 on your machine and download the SSIS extension here.

Read on for the full installation process and a couple of warnings.

Comments closed

Extended Event Performance Metrics in SQL Server 2022

Published 2022-11-29 by Kevin Feasel

Mitchell Sternke looks at some new extended events:

Running Extended Event (XEvent) sessions on SQL Server has a cost. Since XEvents was designed for high-performance, this is usually unnoticeable. However, it can become an issue depending on which events, actions, and other XEvent features are in use. New metrics available in SQL Server 2022, in Azure SQL Database, and in Azure SQL Managed Instance can help you understand the performance impact of using XEvents in your database.

The sys.dm_xe_session_events DMV (sys.dm_xe_ database_session_events on Azure SQL Database) provides one row for each event found in an active session definition, allowing you to see all events that are currently publishing on your SQL Server instance. Four new columns have been added to help with troubleshooting performance:

Read on to learn more about these columns and what they can do for you.

Comments closed

Content Endorsement in Power BI

Published 2022-11-29 by Kevin Feasel

Soheil Bakhshi helps us find the best data:

One of the key aspects of users’ experience in Power BI is their ability to collaborate in creating and sharing content, making it an easy-to-use and convenient platform. But the convenience comes with a cost of having a lot of shared content in large organisations raising concerns about the content’s quality and trustworthiness. It would be hard, if not impossible, to identify the quality of the contents without a mechanism to identify the quality of the contents. Content endorsement is the answer to this.

“I’m Commander Shepherd and this is my favorite dataset on the Power BI Citadel.”

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Day: November 29, 2022