Press "Enter" to skip to content

Curated SQL Posts

Synapse Runtime for Spark 3.3 Now in Public Preview

Estera Kot has an announcement:

We are excited to announce the preview availability of Apache Spark™ 3.3 on Synapse Analytics. The essential changes include features which come from upgrading Apache Spark to version 3.3.1 and upgrading Delta Lake to version 2.1.0.

Check out the official release notes for Apache Spark 3.3.0 and Apache Spark 3.3.1 for the complete list of fixes and features. In addition, review the migration guidelines between Spark 3.2 and 3.3 to assess potential changes to your applications, jobs and notebooks.

There’s a lot in there, though I did snicker a bit at log4j 2 being more secure than log4j v1 given what we saw last year, though that gaping hole was fixed.

Comments closed

Bringing Order to a Columnstore Index

Tibor Karaszi puts columnstore ducks in a row:

Data for a columnstore index is divded in groups of approximate 1 million rows, rowgroups. Each rowgroup has a set of pages for each column. The set of pages for a column in a rowgroup is called a segment. SQL Server has meta-data for the lowest and highest value for a segment. There are no SEEKs in a columnstore index. But, SQL Server can use this meta-data to skip reading segments, with the knowledge that “this segment cannot contain any data that I need based on my predicates in my WHERE clause”.

Also, you might want to do these operations using MAXDOP 1, so we don’t have several threads muddling our neat segment alignment.

I’m not sure I actually set the ORDER BY clause on columnstore indexes all that often—a quick mental survey says maybe once, though that could be my own failing rather than a statement on the utility of ordered columnstore indexes.

Comments closed

Window Functions in DAX

Jeffrey Wang is speaking my language:

The December 2022 release of Power BI Desktop includes three new DAX functions: OFFSETINDEX, and WINDOW. They are collectively called window functions because they are closely related to SQL window functions, a powerful feature of the SQL language that allows users to perform calculations on a set of rows that are related to the current row. Because these functions are often used for data analysis, they are sometimes called analytical functions. In contrast, DAX, a language invented specifically for data analysis, had been missing similar functionalities. As a result, users found it hard to write cross-row calculations, such as calculating the difference of the values of a column between two rows or the moving average of the values of a column over a set of rows.

Read on to learn more about how these functions work and how they differ from their SQL Server counterparts.

Comments closed

Encryption Recommendations for SQL Server

Matthew McGiffen shares some thoughts:

We looked earlier at the various places our data lives – on disk, in memory, in-transit across the network. A good strategy will protect all of these locations – sometimes with multiple layers. Exactly how you go about doing that for your applications may vary – but as long as you are on SQL Server 2016 or higher there is a default strategy that you should consider. This combines a number of the available SQL Server features to provide the best protection.

This story might look a bit different for Azure SQL Database (though you can still use Always Encrypted there) and will look very different on Azure Synapse Analytics dedicated SQL pools, which don’t have Always Encrypted at all.

Comments closed

Paginated Reports in Power BI Pro

Andy Jones has some good news for us:

Microsoft recently announced that Power BI Paginated Reports is now available to users with a Pro licence. Previously, you required a Premium licence to take advantage of Paginated Reports in Power BI.

Read on to learn how paginated reports work in Power BI (hint: almost like SSRS). Even with Power BI Premium Per User being a fairly reasonable $20 per user per month, it’s nice to see Pro get a bit more value.

Comments closed

Isolated Spark Testing with lakeFS

Adi Polak demonstrates lakeFS:

This tutorial demonstrates how to build a development and testing environment for validating your logic on a full-blown production data volume and variety, working with lakeFS and Spark. You will walk through the journey of creating a repository and building a Spark application while using lakeFS capabilities. You will learn how to data changes, revert them in cases of mistakes or other hiccups, and lately merge separate branches to reflect data changes from the isolated environments.

Not too long ago, I had a couple conversations with developers and data engineers about decentralized development and devs having their own environments and data. This seems like it would be a good approach to that common problem, and it works for Azure Synapse Analytics as well.

Comments closed

AutoML and Model Registration in AML

Tomaz Kastrun continues an advent of Azure Machine Learning. Day 13 covers the topic of Automated ML:

Automated ML is a no-code automated machine learning task. It iterates over many combinations of algorithms and hyperparameters in order to find the best model for your dataset and your prediction variable(s). The final solution is a model, that can be downloaded and later reused. So Automated ML is not just giving you the best model out of a family of algorithms, but lets you use the model, generate the scripts and create the artefacts.

Day 14 concerns model registration:

Important asset is the “Models” in navigation bar. This feature allows you to work with different model types -> custom, MLflow, and Triton. What you do here is, you register a model from different locations (e.g.: local file, AzureML Datastore, AzureML Job, MLflow Job, Model asset in AzureML workspace, and Model asset in AzureML Registry).

Once you open the Models asset, you will see, that you can do many things here. I have already model register from the running the notebook on day4.

Comments closed

Column Exclusion and Rename in Snowflake

Kevin Wilkie plays duck-duck-goose with columns:

With Snowflake, we could do many different things that we’re not used to seeing with a SELECT statement. We’re all used to seeing this – SELECT * and it shows all kinds of columns.

With Snowflake, we can tell Snowflake NOT to show certain columns by using the EXCLUDE operator.

Read on to see how it works and specific requirements around operation. In addition, Kevin shows a way to perform aliasing.

Comments closed

Identifying Rows in sp_wait_for_database_copy)_sync

Jose Manuel Jurado Diaz troubleshoots a problem:

As you could see in this public documentation Auto-failover groups overview & best practices – Azure SQL Database | Microsoft Learn about sp_wait_for_database_copy_sync “sp_wait_for_database_copy_sync prevents data loss after geo-failover for specific transactions, but does not guarantee full synchronization for read access. The delay caused by a sp_wait_for_database_copy_sync procedure call can be significant and depends on the size of the not yet transmitted transaction log on the primary at the time of the call.”

Our customer asked about several scenarios to understand this behaviour and also, verify if there is possible to identify the rows that have not been synced. For this, I developed a POC to test it:

Read on to see what you’d need to do.

Comments closed