Press "Enter" to skip to content

Author: Kevin Feasel

Generating a Multi-Aggregate Pivot in Spark

Richard Swinbank troubleshoots an issue:

I’m using a stream watermark to handle late arriving data – basically1) my watermark enables the stream to accept data arriving up to 10 seconds late …and that’s where the problem shows up.

When I run this streaming query – in Azure Databricks I can do this simply with display(df_pivot) – I receive the error:

AnalysisException: Detected pattern of possible ‘correctness’ issue due to global watermark. The query contains stateful operation which can emit rows older than the current watermark plus allowed late record delay, which are “late rows” in downstream stateful operations and these rows can be discarded. Please refer the programming guide doc for more details. If you understand the possible risk of correctness issue and still need to run the query, you can disable this check by setting the config `spark.sql.streaming.statefulOperator.checkCorrectness.enabled` to false.

Read on to learn more about the scenario, the issue, and the solution.

Comments closed

A Primer on Database Sharding

Adrien Payong covers one technique to scale out databases:

Companies of all sizes and across industries are struggling to cope with an explosion of data never before seen in the short history of computing. As applications reach new levels of sophistication and become deeply interconnected, these companies find themselves increasingly overworked, overheated, and at their wits’ end, desperately trying to squeeze just a bit more performance and availability out of their aging database architectures.

Enter sharding, a powerful database architecture pattern that offers a solution to these challenges. Sharding scales out databases as data volume and user load grow, providing performance and high availability by spreading a database’s data across multiple servers.

Read on to learn more about it. Adrien mentions MongoDB, Cassandra, MySQL, and Postgres, though the real trick of sharding is in the client, so it also works for other data platform technologies as well, including SQL Server.

Comments closed

Frequently Asked Microsoft Purview Questions

James Serra has answers:

Microsoft Purview is now the combination of multiple Microsoft products.  Can you explain the differences?

Let’s break Microsoft Purview down into three sections of features that were formerly other products to clarify things:

  • Data governance:  This deals with data catalog, data quality (preview), data lineage, data management, and data estate insights (preview).  The product that had these features was formerly called Azure Purview
  • Data security: Covers data loss prevention, insider risk management, information protection, and adaptive protection.  The product that had these features was formerly called Microsoft Information Protection (MIP)
  • Data compliance: This covers compliance manager, eDiscovery and audit, communication compliance, data lifecycle management, and records management.  The product that had these features was formerly called Microsoft Information Governance

My question is, why is it so incomprehensibly expensive? It’s a really neat tool that a lot of organizations could make great use of, but it has at least one and maybe two too many zeroes on the bill, causing limited adoption.

Comments closed

Azure SQL DB Hyperscale Elastic Pools now GA

Arvind Shyamsundar has an announcement:

Azure SQL Database is the preferred database technology for hundreds of thousands of customers. Built on top of the rock-solid SQL Server engine and leveraging leading cloud-native architecture and technologies, Azure SQL Database Hyperscale offers leading performance, scalability and elasticity with one of the lowest TCO in the industry .

While you may start with a standalone Hyperscale database, chances are that as your fleet of databases grows, you want to optimize price and performance across a set of Hyperscale databases. Elastic pools offer the convenience of pooling resources like CPU, memory, IO, while ensuring strong security isolation between those databases.

Read on to learn more about what it offers and what it costs.

Comments closed

Working with lapply() in R

Steven Sanderson applies a function:

R is a powerful programming language primarily used for statistical computing and data analysis. Among its many features, the lapply() function stands out as a versatile tool for simplifying code and reducing redundancy. Whether you’re working with lists, vectors, or data frames, understanding how to use lapply() effectively can greatly enhance your programming efficiency. For beginners, mastering lapply() is a crucial step in becoming proficient in R.

Read on to see how lapply() works.

Comments closed

The Importance of Versioning Data

John Mount demonstrates an important concept:

Our business goal is to build a model relating attendance to popcorn sales, which we will apply to future data in order to predict future popcorn sales. This allows us to plan staffing and purchasing, and also to predict snack bar revenue.

In the above example data, all dates in August of 2024 are “in the past” (available as training and test/validation data) and all dates in September of 2024 are “in the future” (dates we want to make predictions for). The movie attendance service we are subscribing to supplies

  • past schedules
  • past (recorded) attendance
  • future schedules, and
  • (estimated) future attendance.

John’s example scenario covers the problem of future estimations interfering with model quality. Another important scenario is when the past changes. As one example, digital marketing providers (think Google, Bing, Amazon, etc.) will provide you impression and click data pretty quickly, and each day they close the books on a prior day’s data at some normal time. For some of these providers, that prior day’s data is yesterday’s data—on Tuesday, provider X closes the books on Monday’s data and promises that it won’t change after that. But for other providers, they might change data over the course of the next 10 days. This means that the data you’re using for model training might change from under you, and you might never know if you don’t keep track of the actual data you used for training at the time of training.

Comments closed

A Reminder for Server Consistency

Chad Callihan resolves an issue:

I connected to the latest SQL Server, opened SSMS, and tried to restore from there. Sure enough, I was presented with the error:

Cannot access the specified path or file on the server. Verify that you have the necessary security privileges and that the path or file exists.

If you know that the service account can access a specific file, type in the full path for the file in the File Name control in the Locate dialog box.

Read on for the solution, which was easy enough, but serves as a reminder that having (and occasionally running!) idempotent configuration scripts can be quite useful.

Comments closed

Finding Missing Indexes in SQL Server

Jared Westover goes searching for where those missing indexes got off to:

In the past, while using the missing index Dynamic Management Views (DMVs), something always seemed to be missing from the results. It was hard to put my finger on it then, but looking back, it now seems obvious. Why can’t we see the queries prompting SQL Server to give suggestions? Did you know Microsoft added a DMV with the query text? Since discovering this gem, we no longer need to search through Plan Cache or Query Store.

Click through for the article, but do especially read the list of limitations Jared links to in the summary section before going off and creating a bunch of indexes.

Comments closed

Migrating Power BI Dataflows from Gen1 to Gen2

Reza Rad talks migration:

Unfortunately, there isn’t a migration tool to convert your Power BI dataflow (gen1) to Microsoft Fabric dataflow (gen2). If you have Fabric capacity licenses, it just makes sense to do that migration because Dataflow Gen2 gives you data destinations into four destinations, which we don’t have in Dataflow Gen1. However, converting Gen1 to Gen2 isn’t that complicated. The process is explained in this blog and video.

Click through for the blog post and the video.

Comments closed

Implicit Conversions in SQL Server

Vlad Drumea explains what it means implicitly to convert:

If you’re here, you most likely know what a data type conversion is, but, in short, it’s the operation of converting a value from one data type to another.

There are two types of conversions in SQL Server:

  • explicit – which are done by explicitly applying the CAST and CONVERT functions on a column, variable, or value.
  • implicit – when CAST and CONVERT are not used explicitly, but SQL Server ends up doing the conversation behind the scenes due to two distinct data types being compared.

Read on to learn more about which types of implicit conversion are relevant for performance and what you can do instead.

Comments closed