Press "Enter" to skip to content

Author: Kevin Feasel

Installing Third-Party WHL Packages in Synapse with DEP

Sabyasachi Samaddar walks through what I consider a real difficulty:

It is really challenging when you need to install third-party .whl packages into a DEP-enabled Azure Synapse Spark Instance.

According to the documentation, https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-portal-add-librar… Installing packages from PyPI is not supported within DEP-enabled workspaces. Hence we cannot just upload the .whl packages into the workspace. We need to upload all the dependencies along with the .whl package and it will be an offline installation. Now Synapse spark clusters come with in-built packages and hence we may find some conflicts when we try to install some third-party packages.

Read on to see what you need to do.

Comments closed

Using the T-SQL OUTPUT Clause

Chad Callihan doesn’t make two calls:

Are you familiar with the OUTPUT clause in SQL Server? It can be used with commands like INSERT, UPDATE, DELETE, and MERGE. It can also be used with setting variables in stored procedures. Using the tried and true StackOverflow2013, we’ll narrow it down today to focus on how INSERT/DELETE are typically used for logging table changes as well as an example of how to use OUTPUT with stored procedures.

For really busy transactional systems, this provides a nice boost over making an update and then selecting the new values.

Comments closed

Snapshot Fact Tables in a Data Warehouse

Alex Crampton explains how snapshot fact tables work in data warehousing:

The typical fact table measures activities and is known as a transaction fact table. They support a wide variety of analytic possibilities and can be used to capture detailed information about a particular process. Certain facts cannot be studied easily using this kind of design, if at all.

This blog will outline the characteristics of a transaction fact table vs those of a snapshot fact table, and when the need for a snapshot fact table arises.

Snapshot-based fact tables aren’t ideal for data load times (especially as the table gets large) but they are useful in specific circumstances, as Alex points out.

Comments closed

Difficulties around A/B Testing

John Cook asks which is clearer, 1 or 2? 3 or 4? 4 or 6?

One problem with A/B testing is that your results may depend on the order of your tests.

Suppose you’re testing three options: XY, and Z. Let’s say you have three market segments, equal in size, each with the following preferences.

This is known as the Condorcet paradox of voting.

John also introduces the problem of interaction effects:

Suppose you’re debating between putting a photo of a car or a truck on your web site, and you’re debating between whether the vehicle should be red or blue. You decide to use A/B testing, so you test whether customers prefer a red truck or a blue truck. They prefer the blue truck. Then you test whether customers prefer a blue truck or a blue car. They prefer the blue truck.

Maybe customers would prefer a red car best of all, but you didn’t test that option. By testing vehicle type and color separately, you didn’t learn about the interaction of vehicle type and color. 

Click through for both posts as well as some good insights.

Comments closed

Working with strcat in KQL

Robert Cain has a post dedicated to the strcat() function in KQL:

The strcat function has been shown in previous articles, but it’s so useful it deserves a post all of its own.

As usual, the samples in this post will be run inside the LogAnalytics demo site found at https://aka.ms/LADemo. This demo site has been provided by Microsoft and can be used to learn the Kusto Query Language at no cost to you.

Read on to (re-)learn the power of string concatenation, in Kusto form.

Comments closed

Parallel Loading of Tables in Power BI Dataset Refresh

Chris Webb hits the turbo button:

Do you have a a large dataset in Power BI Premium or Premium Per User? Do you have more than six tables that take a significant amount of time to refresh? If so, you may be able to speed up the performance of your dataset’s refresh by increasing the number of tables that are refreshed in parallel, using a feature that was released in August 2022 but which you may have missed.

Click through for that tip.

Comments closed

Azure Synapse Analytics R Language Support

Ryan Majidimehr has a short list of updates for Azure Synapse Analytics but it includes a big one:

Azure Synapse Analytics provides built-in R support for Apache Spark. As part of this, data scientists can leverage Azure Synapse Analytics notebooks to write and run their R code. This also includes support for SparkR and SparklyR, which allows users to interact with Spark using familiar Spark or R interfaces. To learn more read the official how-to Use R for Apache Spark with Azure Synapse Analytics (Preview).

That it took this long for R support was a bit weird, but I’m glad it’s there now.

Comments closed