Curated SQL – Page 408 – A Fine Slice Of SQL Server

WITHIN GROUP in STRING_AGG()

Published 2023-09-26 by Kevin Feasel

When was the last time you wrote a SQL query and knew something was possible but just couldn’t remember how? I had one of those moments this week with STRING_AGG and ordering data, and although it was frustrating, I knew it would make a worthwhile blog post. Let’s look at some examples using STRING_AGG and WITHIN GROUP (aka the clause that slipped my mind).

There’s a perfectly good reason why WITHIN GROUP might slip your mind: STRING_AGG() is known as an ordered set function (versus a window function which uses an OVER() clause). It’s also the only ordered set function SQL Server supports, so you don’t get too many opportunities to use the key phrase.

Comments closed

Setting a Spark Compute Pool Size in Microsoft Fabric

Published 2023-09-26 by Kevin Feasel

Reitse Eskens manages compute pools:

This next blog won’t be a long one and will probably serve most as a reminder for myself where to find the settings for the Spark compute pool.

When you create a workspace, you get the default starter pool and it has taken me way longer than I care to admit to find where to find the setting and, more importantly, how to change it.

Read on to learn more about how to create a Spark pool of the size you desire. The sizing method is essentially the same as with Azure Synapse Analytics.

Comments closed

Generating Reproducible Reports with Jupyter and Quarto

Published 2023-09-25 by Kevin Feasel

Parisa Gregg and Myles Mitchell don’t need to copy and paste for their TPS reports:

Quarto is a free-to-use, open-source software based on Pandoc that enables users to convert plain text files into a range of formats, including PDF, HTML and powerpoint presentations. These documents can contain a mixture of narrative text, Python code, and figures that are dynamically generated by the embedded code.

This has many use-cases:

Your company may have a weekly board meeting to go over the latest sales figures. By having a Quarto presentation that pulls in the latest company sales data, you can regenerate the presentation slides each week at the click of a button.

As a researcher you may be preparing a report for publication. By having the code that generates data tables and figures embedded within the report, regenerating the draft as the experimental data floods in is a breeze!

Read on for a fun example of how you could automated a research-driven report.

Comments closed

Creating Confidence Intervals on a Linear Model in R

Published 2023-09-25 by Kevin Feasel

Steven Sanderson goes frequentist on us:

Linear regression is a fundamental statistical technique used to model the relationship between a dependent variable and one or more independent variables. While fitting a linear model is relatively straightforward in R, it’s also essential to understand the uncertainty associated with our model’s predictions. One way to visualize this uncertainty is by creating confidence intervals around the regression line. In this blog post, we’ll walk through how to perform linear regression and plot confidence intervals using base R with the popular Iris dataset.

Click through to see how, even if you’re a Bayesian who considers confidence intervals to overstate precision in reality.

Comments closed

Maintaining Existing Power BI Data while Loading More with Fabric

Published 2023-09-25 by Kevin Feasel

Chris Webb looks back on an older post:

To be honest I’m slightly ashamed of this fact because, as I say in the post, the solution I describe is a bit of a hack – but at the same time, the post is popular because a lot of people have the problem of needing to add new data to the data that’s already there in their Power BI dataset and there’s no obvious way of doing that. As I also say in that post, the best solution is to stage the data in a relational database or some other store outside Power BI so you have a copy of all the data if you ever need to do a full refresh of your Power BI dataset.

Why revisit this subject? Well, with Fabric it’s now much easier for you as a Power BI developer to build that place to store a full copy of your data outside your Power BI dataset and solve this problem properly.

Read on for an example of the new solution.

Comments closed

Dynamic Highlighting of Data Points Based on Slicer Selection

Published 2023-09-25 by Kevin Feasel

Nikola Ilic shines a light on the data:

To quickly explain: when a user selects, for example, Contoso in the slicer, the Contoso bar should be highlighted by using a different color. As much as this sounds like a very basic and common business request, there is no straightforward solution in Power BI (or, at least, I’m not aware of it:))

However, the client’s wish is (almost) always our command – so, let’s see how this feature can be implemented with a little bit of data model tweaking and leveraging some DAX code.

Spoilers: there is a solution, though it does involve quite a few steps.

Comments closed

Azure SQL Managed Instance Business Critical and Port Ranges

Published 2023-09-25 by Kevin Feasel

Jose Manuel Jurado Diaz opens a few ports:

Our customer began to encounter Error 10060 after scaling up to the Business Critical tier. Upon investigation, we realized that the “Proxy” connection type was working, but the “Redirect” type was causing issues.

Read on to see why this happened, as well as what the fix was.

Comments closed

Attacks on Row-Level Security

Published 2023-09-25 by Kevin Feasel

Ben Johnston continues a series on row-level security in SQL Server:

As mentioned in previous sections, RLS is an addition to security and should not be used as the primary method to limit access to data. It is a supplementary layer, useful in specific scenarios. There are also instances where RLS can be defeated by an unauthorized user. The attacks listed below are broken down into direct attacks, indirect attacks, and side-channel attacks. The categorizations could be changed, but the important part of each is the vulnerability discussed.

The one scenario I’m a bit surprised about is the divide by zero attack, as I had figured the filter predicate would apply before the computation leading to a divide by zero scenario.

Comments closed

Using DAX to Find Products Missing Sales

Published 2023-09-25 by Kevin Feasel

Marco Russo and Alberto Ferrari observe the dog that didn’t bark:

What products did not sell in a specific area, store, or time period? This may be an important analysis for several businesses. There are multiple ways to obtain the desired result. Some specific implementations might be needed because of the user or model requirements, whereas developers can choose any formula in several cases. Or you might just find a solution on the web and blindly implement it without questioning whether there is a better way to achieve what you want.

It turns out that different formulas perform very differently. Choosing the right one in your scenario can make a slow report fast. This article analyzes the performance of different formulations of one same algorithm.

It’s interesting to see the performance profile here: most are reasonably close together, although you can still get a 2x gain from using the fastest approach versus the second-slowest. And then there’s the slowpoke.

Comments closed

Kafka Message Compression

Published 2023-09-22 by Kevin Feasel

Lucia Cerchie and Dave Troiano give us the rundown on compression of individual messages in Apache Kafka:

Topic partitions are the main “unit of parallelism” in Kafka. What’s a unit of parallelism? It’s like having multiple cashiers in the same store instead of one. Multiple purchases can be made at once, which increases the overall amount of purchases made in the same amount of time (this is like throughput). In this case, the cashier is the unit of parallelism.

In Kafka, each partition leader can live on a different broker in a cluster, and a producer can send multiple messages, each with a different destination topic partition; that is, a producer can send them in parallel. While this is the main reason Kafka enables high throughput, compression can also be a tool to help improve throughput and efficiency by reducing network traffic due to smaller messages. A well-executed compression strategy also means better disk utilization in Kafka, since stored messages on disk are smaller.

Click through for the various options and some guidance on using each.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Curated SQL Posts