Press "Enter" to skip to content

Author: Kevin Feasel

Sharing Results between Notebooks with MSSparkUtils

Liliam Leme provides an answer to a common Synapse Spark pool question:

I’ve been reviewing customer questions centered around “Have I tried using MSSparkUtils to solve the problem?”

One of the questions asked was how to share results between notebooks. Every time you hit “run” in a notebook, it starts a new Spark cluster which means that each notebook would be using different sessions. Making it impossible to share results between executions of notebooks. MSSparkUtils offers a solution to handle this exact scenario. 

Read on to see what MSSparkUtils is and how it helps in this case.

Comments closed

File Seeding with dbt

Ust Oldfield sneaks in a file:

File seeding is a useful way of maintaining and deploying static, or very slowly changing, reference data for a data model to use repeatably and reliably across environments, whilst benefitting from source control.

If you aren’t in the Databricks world, this also feels like a job for DVC.

Comments closed

Becoming Familiar with MLflow

Tomaz Kastrun continues an advent series on Azure ML:

MLFlow is an open-source framework for registering, managing and tracking machine learning models. It is multiplatform, bringing consistent model training and model consumption across different platforms. This means, that training a model locally and uploading it to Azure or training a model on remote compute instances and downloading it, is a great feature for MLflow.

You can use MLflow with Azure CLI, Azure Python SDK or in the studio and it will deliver a consistent experience (note, some functionalities are limited to the language).

Click through for a quick overview of MLflow.

Comments closed

Running Power BI Report Server

Reza Rad stays on-premises:

Power BI is not only a cloud-based reporting technology. Due to the demand for some businesses to have their data and reporting solutions on-premises, Power BI also has the option to be deployed fully on-premises. Power BI on-premises hosting is called Power BI Report Server. This post concerns using Power BI in a fully on-premises solution with Power BI Report Server.

This post will teach you everything you need about the on-premises world of Power BI. You will learn how to install Power BI Report Server, learn all requirements and configurations for the Power BI Report Server to work correctly, and see all the pros and cons of this solution. At the end of this post, you will be able to decide if Power BI on-premises is the right choice for you, and if it is, then you will be able to set a Power BI on-premises solution up and running easily.

I used Power BI Report Server for a few years. My short version is that it’s really useful if you aren’t allowed to use Power BI Online (as was my case) but if you know what’s in the Online version, you’ll see just how much you’re missing out on.

Comments closed

Checking if Cross-Database Ownership Chaining is On

Tom Collins performs a check:

Cross-database ownership chaining is a SQL Server  security feature allowing database users  access to other database objects hosted on the same SQL server instance, in the case where database  users don’t have access granted explicitly

Tom shows us whether it is on as well as how to enable it. I’d recommend not enabling it at all and using module signing instead.

Comments closed

Fixing VertiPaq Analyzer Dictionary Size Errors

Marco Russo troubleshoots an issue:

There are cases where the dictionary size reported by VertiPaq Analyzer (used by DAX Studio, Bravo for Power BI, and Tabular Editor 3) does not correspond to the actual memory required by the dictionary. However, the number reported is technically correct because it represents the memory currently allocated for the dictionary. The issue is that – after a refresh – this memory amount is larger than the actual memory required for hash-encoded columns.

Read on to learn what the consequences are and how you can resolve this in Power BI Desktop as well as in Analysis Services.

Comments closed

Synapse Runtime for Spark 3.3 Now in Public Preview

Estera Kot has an announcement:

We are excited to announce the preview availability of Apache Spark™ 3.3 on Synapse Analytics. The essential changes include features which come from upgrading Apache Spark to version 3.3.1 and upgrading Delta Lake to version 2.1.0.

Check out the official release notes for Apache Spark 3.3.0 and Apache Spark 3.3.1 for the complete list of fixes and features. In addition, review the migration guidelines between Spark 3.2 and 3.3 to assess potential changes to your applications, jobs and notebooks.

There’s a lot in there, though I did snicker a bit at log4j 2 being more secure than log4j v1 given what we saw last year, though that gaping hole was fixed.

Comments closed

Bringing Order to a Columnstore Index

Tibor Karaszi puts columnstore ducks in a row:

Data for a columnstore index is divded in groups of approximate 1 million rows, rowgroups. Each rowgroup has a set of pages for each column. The set of pages for a column in a rowgroup is called a segment. SQL Server has meta-data for the lowest and highest value for a segment. There are no SEEKs in a columnstore index. But, SQL Server can use this meta-data to skip reading segments, with the knowledge that “this segment cannot contain any data that I need based on my predicates in my WHERE clause”.

Also, you might want to do these operations using MAXDOP 1, so we don’t have several threads muddling our neat segment alignment.

I’m not sure I actually set the ORDER BY clause on columnstore indexes all that often—a quick mental survey says maybe once, though that could be my own failing rather than a statement on the utility of ordered columnstore indexes.

Comments closed

Encryption Recommendations for SQL Server

Matthew McGiffen shares some thoughts:

We looked earlier at the various places our data lives – on disk, in memory, in-transit across the network. A good strategy will protect all of these locations – sometimes with multiple layers. Exactly how you go about doing that for your applications may vary – but as long as you are on SQL Server 2016 or higher there is a default strategy that you should consider. This combines a number of the available SQL Server features to provide the best protection.

This story might look a bit different for Azure SQL Database (though you can still use Always Encrypted there) and will look very different on Azure Synapse Analytics dedicated SQL pools, which don’t have Always Encrypted at all.

Comments closed

Window Functions in DAX

Jeffrey Wang is speaking my language:

The December 2022 release of Power BI Desktop includes three new DAX functions: OFFSETINDEX, and WINDOW. They are collectively called window functions because they are closely related to SQL window functions, a powerful feature of the SQL language that allows users to perform calculations on a set of rows that are related to the current row. Because these functions are often used for data analysis, they are sometimes called analytical functions. In contrast, DAX, a language invented specifically for data analysis, had been missing similar functionalities. As a result, users found it hard to write cross-row calculations, such as calculating the difference of the values of a column between two rows or the moving average of the values of a column over a set of rows.

Read on to learn more about how these functions work and how they differ from their SQL Server counterparts.

Comments closed