Press "Enter" to skip to content

Month: July 2022

Python UDFs in Databricks SQL

Martin Grund, et al, announce a new preview feature in Databricks:\

To define the Python UDF, all you have to do is a CREATE FUNCTION SQL statement. This statement defines a function name, input parameters and types, specifies the language as PYTHON, and provides the function body between $$.

The function body of a Python UDF in Databricks SQL is equivalent to a regular Python function, with the UDF itself returning the computation’s final value. Dependencies from the Python standard library and Databricks Runtime 10.4, such as the json package in the above example, can be imported and used in your code. You can also define nested functions inside your UDF to encapsulate code to build or reuse complex logic.

I think my biggest concern here would be performance, though I say that without having used the feature.

Comments closed

The Cure to Scalar Functions

Tom Zika has a cure:

In the first two parts, we have seen why the Scalar functions (UDFs) are a problem for the performance. So how do we deal with it now that we know it’s a problem?

There is only one solution:

I say we take off and nuke the entire site from orbit. It’s the only way to be sure.
— Ellen Ripley

Tom’s ideas are intriguing to me and I wish to subscribe to his newsletter.

Comments closed

Row Mode vs Batch Mode Plan Timings

Paul White notes a change in SQL Server 2022:

Batch mode plans report CPU and elapsed time spent within each individual operator.

Row mode plans report cumulative CPU and elapsed time for each operator including all its child operators.

This is confusing enough, but modern versions of SQL Server can produce execution plans with a mixture of batch mode and row mode operators in the same execution plan.

But there are some nice changes in SQL Server 2022 to improve the situation a bit.

Comments closed

Against Next-Next-Next Installs

Lee Markup explains why you want to pay attention during SQL Server installations:

There’s one sure-fire way to set yourself and your applications up for failure when using SQL Server. I’ve seen all kinds of people do this.

It will cost you performance and create maintenance headaches down the road.

Let me explain what it is and why it causes trouble.

Click through for some of the specific problems you might run into.

Comments closed

Reviewing Oracle Database Service on Azure

Kellyn Pot’vin-Gorman has a tough talk:

If we were to ask any DBA to separate the database in one cloud and the application tier in another without the context of a marketing announcement, they would look at us like we’d grown a third head. I’m incredibly surprised that anyone even considers the OCI Interconnect for this use, let alone the 150 that are currently using it.  Oracle applications, like E-business Suite, Peoplesoft, JD Edwards and Hyperion are incredibly network latency sensitive and to recommend separating their tiers in two separate clouds just is alien to me.  When we deploy these in Azure, we place all tiers in a proximity placement group to let Azure know that they are connected and this ensures that when a resource comes online after changes are made, redeployments, etc. the resources stay close to each other.

Definitely worth a read.

Comments closed

sp_PressureDetector Updates

Erik Darling continues to be busy:

I haven’t run into a bug with this thing in just about… Well, ever. This proc has been pretty solid.

As I use it though, especially in front of clients, I find myself wanting some additional data points all in one place. With this release, I’ve added a few small things.

Click through for the list of updates.

Comments closed

Power BI and Synapse Book Roundup

Chris Webb checks out some books:

I like free stuff and I like books, so of course I like free books – and it seems that the more I provide free publicity for relevant books here the more free books I get sent. I’ve now got enough to merit writing another post covering those I’ve received recently from various publishers and authors. As always these are not reviews, just short summaries of books you might want to check out.

Read on for the not-reviews.

Comments closed

Pre-Processing Data Explorer Data with Spark

Hauke Mallow does some data engineering:

We often see customer scenarios where historical data has to be migrated to Azure Data Explorer (ADX). Although ADX has very powerful data-transformation capabilities via update policies, sometimes more or less complex data engineering tasks must be done upfront. This happens if the original data structure is too complex or just single data elements being too big, hitting data explorer limits of dynamic columns of 1 MB or maximum ingest file-size of 1 GB for uncompressed data (see also Comparing ingestion methods and tools) .

Let’s think about an Industrial Internet-of-Things (IIoT) use-case where you get data from several production lines. In the production line several devices read humidity, pressure, etc. The following example shows a scenario where a one-to-many relationship is implemented within an array. With this you might get very large columns (with millions of device readings per production line) that might exceed the limit of 1 MB in Azure Data Explorer for dynamic columns. In this case you need to do some pre-processing.

Click through to see how you can do this with an Azure Synapse Analytics Spark pool prior to ingesting it with a Data Explorer pool.

Comments closed

Replacing Error Values on All Columns in Power BI

Kristyna Hughes needs to clear out some errors:

End users that are not trained in data governance but are actively involved in maintaining a data set can easily make data entry mistakes or create inconsistent data types within columns. For example, you may have a column in the dataset called “Sales” and instead of 0, someone may type “None” or “NA”. When this gets loaded into Power BI, Power BI will not know how to convert the text value “None” to a number, and it will throw an error on the refresh of the report.

One way to mitigate the impact of user-entered data is to replace errors with null values. This is not ideal since it doesn’t fix the data entry issues, but it does enable reports to still be refreshed and used while the data issues are addressed. In Power Query, you can manually replace the errors with null by going to the “Transform” tab then selecting the drop down for “Replace Values” and choosing “Replace Errors”.

Click through for a script which does this for all columns in Power Query.

Comments closed

CETAS and the Serverless SQL Pool

Liliam Leme takes us through the Create External Table as Select operation in the Azure Synapse Analytics serverless SQL pool:

Serverless SQL pool  has a very cool way to handle CREATE EXTERNAL TABLE AS SELECT (CETAS).  I mean, you can use a serverless SQL pool to create a CETAS which will materialize the query results. It means a heavy query in which, for example, the results would be part of future large join with other queries or aggregations that can be consolidated for reporting purposes. Those scenarios are examples that could be used as a CETAS. CETAS will be stored as a parquet file accessed by an external table in your storage and the performance is awesome. Later I am suggesting how to recreate the CETAS using pipeline against serverless SQL Pool.

Click through to see it in action.

Comments closed