Press "Enter" to skip to content

Month: September 2024

Working with Excel Files in Databricks

Chen Hirsh deals with truly big data:

Excel is one of the most common data file formats, and, as data engineers, we are required to read data from it on almost every project. Excel is easy to use, and you can customize it quickly, like adding a column and changing data. But the same things that made it the go-to format for users, make it hard to read by Data platforms. Adding a column might break a pipeline, and changing datatypes, for example, adding text to a column that only held numeric data before, might cause a nasty error downstream.

Working in Databricks, you can read and write Excel files, but you need to pay attention to some pitfalls. So let’s get started, working with Excel files on Databricks!

Click through for a way to do this using PySpark. H/T Madeira Data Solutions blog.

Comments closed

Finding the SQL Power BI DirectQuery Mode Generates

Chris Webb finds a way:

If you’re performance tuning a DirectQuery mode semantic model in Power BI, one of the first things you’ll want to do is look at the SQL that Power BI is generating. That’s easy if you have permissions to monitor your source database but if you don’t, it can be quite difficult to do so from Power BI. I explained the options for getting the SQL generated in DirectQuery mode and why it’s so complicated in a presentation here, but I’ve recently found a new way of doing this in Power BI Desktop (but not the Service) that works for some M-based connectors, for example Snowflake.

Click through for the solution.

Comments closed

External References in Data-Tier Applications

Andy Brownsword needs to make a call out:

One method for transferring a database to a different environment is using a Data-Tier Application – in the form of a DACPAC (for schema) or BACPAC (for schema and data).

Trying to use this approach with multi-database solutions is a challenge though as Data-Tier Applications don’t play nicely with cross-database objects.

Let’s look at how we can ease that pain.

Read on for the solution.

Comments closed

Printing a Table in R via table()

Steven Sanderson builds a table:

Tables are an essential part of data analysis, serving as a powerful tool to summarize and interpret data. In R, the table() function is a versatile tool for creating frequency and contingency tables. This guide will walk you through the basics and some advanced applications of the table() function, helping you understand its usage with clear examples.

Click through for more information and several examples.

Comments closed

ISNULL vs COALESCE in SQL Server

Erik Darling has a video for us:

A Difference Between ISNULL And COALESCE You Might Care About In SQL Server

There’s nothing for me to snip as the graf. I don’t often link to videos without any sort of text accompaniment, but it’s been too long since I’ve linked to Erik and this was an interesting topic.

Bonus points for using “case expression” instead of the more common but technically incorrect “case statement.”

Comments closed

A Primer on Database Sharding

Adrien Payong covers one technique to scale out databases:

Companies of all sizes and across industries are struggling to cope with an explosion of data never before seen in the short history of computing. As applications reach new levels of sophistication and become deeply interconnected, these companies find themselves increasingly overworked, overheated, and at their wits’ end, desperately trying to squeeze just a bit more performance and availability out of their aging database architectures.

Enter sharding, a powerful database architecture pattern that offers a solution to these challenges. Sharding scales out databases as data volume and user load grow, providing performance and high availability by spreading a database’s data across multiple servers.

Read on to learn more about it. Adrien mentions MongoDB, Cassandra, MySQL, and Postgres, though the real trick of sharding is in the client, so it also works for other data platform technologies as well, including SQL Server.

Comments closed

Generating a Multi-Aggregate Pivot in Spark

Richard Swinbank troubleshoots an issue:

I’m using a stream watermark to handle late arriving data – basically1) my watermark enables the stream to accept data arriving up to 10 seconds late …and that’s where the problem shows up.

When I run this streaming query – in Azure Databricks I can do this simply with display(df_pivot) – I receive the error:

AnalysisException: Detected pattern of possible ‘correctness’ issue due to global watermark. The query contains stateful operation which can emit rows older than the current watermark plus allowed late record delay, which are “late rows” in downstream stateful operations and these rows can be discarded. Please refer the programming guide doc for more details. If you understand the possible risk of correctness issue and still need to run the query, you can disable this check by setting the config `spark.sql.streaming.statefulOperator.checkCorrectness.enabled` to false.

Read on to learn more about the scenario, the issue, and the solution.

Comments closed

Azure SQL DB Hyperscale Elastic Pools now GA

Arvind Shyamsundar has an announcement:

Azure SQL Database is the preferred database technology for hundreds of thousands of customers. Built on top of the rock-solid SQL Server engine and leveraging leading cloud-native architecture and technologies, Azure SQL Database Hyperscale offers leading performance, scalability and elasticity with one of the lowest TCO in the industry .

While you may start with a standalone Hyperscale database, chances are that as your fleet of databases grows, you want to optimize price and performance across a set of Hyperscale databases. Elastic pools offer the convenience of pooling resources like CPU, memory, IO, while ensuring strong security isolation between those databases.

Read on to learn more about what it offers and what it costs.

Comments closed

Frequently Asked Microsoft Purview Questions

James Serra has answers:

Microsoft Purview is now the combination of multiple Microsoft products.  Can you explain the differences?

Let’s break Microsoft Purview down into three sections of features that were formerly other products to clarify things:

  • Data governance:  This deals with data catalog, data quality (preview), data lineage, data management, and data estate insights (preview).  The product that had these features was formerly called Azure Purview
  • Data security: Covers data loss prevention, insider risk management, information protection, and adaptive protection.  The product that had these features was formerly called Microsoft Information Protection (MIP)
  • Data compliance: This covers compliance manager, eDiscovery and audit, communication compliance, data lifecycle management, and records management.  The product that had these features was formerly called Microsoft Information Governance

My question is, why is it so incomprehensibly expensive? It’s a really neat tool that a lot of organizations could make great use of, but it has at least one and maybe two too many zeroes on the bill, causing limited adoption.

Comments closed