Press "Enter" to skip to content

Day: September 13, 2024

Printing a Table in R via table()

Steven Sanderson builds a table:

Tables are an essential part of data analysis, serving as a powerful tool to summarize and interpret data. In R, the table() function is a versatile tool for creating frequency and contingency tables. This guide will walk you through the basics and some advanced applications of the table() function, helping you understand its usage with clear examples.

Click through for more information and several examples.

Comments closed

ISNULL vs COALESCE in SQL Server

Erik Darling has a video for us:

A Difference Between ISNULL And COALESCE You Might Care About In SQL Server

There’s nothing for me to snip as the graf. I don’t often link to videos without any sort of text accompaniment, but it’s been too long since I’ve linked to Erik and this was an interesting topic.

Bonus points for using “case expression” instead of the more common but technically incorrect “case statement.”

Comments closed

A Primer on Database Sharding

Adrien Payong covers one technique to scale out databases:

Companies of all sizes and across industries are struggling to cope with an explosion of data never before seen in the short history of computing. As applications reach new levels of sophistication and become deeply interconnected, these companies find themselves increasingly overworked, overheated, and at their wits’ end, desperately trying to squeeze just a bit more performance and availability out of their aging database architectures.

Enter sharding, a powerful database architecture pattern that offers a solution to these challenges. Sharding scales out databases as data volume and user load grow, providing performance and high availability by spreading a database’s data across multiple servers.

Read on to learn more about it. Adrien mentions MongoDB, Cassandra, MySQL, and Postgres, though the real trick of sharding is in the client, so it also works for other data platform technologies as well, including SQL Server.

Comments closed

Generating a Multi-Aggregate Pivot in Spark

Richard Swinbank troubleshoots an issue:

I’m using a stream watermark to handle late arriving data – basically1) my watermark enables the stream to accept data arriving up to 10 seconds late …and that’s where the problem shows up.

When I run this streaming query – in Azure Databricks I can do this simply with display(df_pivot) – I receive the error:

AnalysisException: Detected pattern of possible ‘correctness’ issue due to global watermark. The query contains stateful operation which can emit rows older than the current watermark plus allowed late record delay, which are “late rows” in downstream stateful operations and these rows can be discarded. Please refer the programming guide doc for more details. If you understand the possible risk of correctness issue and still need to run the query, you can disable this check by setting the config `spark.sql.streaming.statefulOperator.checkCorrectness.enabled` to false.

Read on to learn more about the scenario, the issue, and the solution.

Comments closed

Azure SQL DB Hyperscale Elastic Pools now GA

Arvind Shyamsundar has an announcement:

Azure SQL Database is the preferred database technology for hundreds of thousands of customers. Built on top of the rock-solid SQL Server engine and leveraging leading cloud-native architecture and technologies, Azure SQL Database Hyperscale offers leading performance, scalability and elasticity with one of the lowest TCO in the industry .

While you may start with a standalone Hyperscale database, chances are that as your fleet of databases grows, you want to optimize price and performance across a set of Hyperscale databases. Elastic pools offer the convenience of pooling resources like CPU, memory, IO, while ensuring strong security isolation between those databases.

Read on to learn more about what it offers and what it costs.

Comments closed

Frequently Asked Microsoft Purview Questions

James Serra has answers:

Microsoft Purview is now the combination of multiple Microsoft products.  Can you explain the differences?

Let’s break Microsoft Purview down into three sections of features that were formerly other products to clarify things:

  • Data governance:  This deals with data catalog, data quality (preview), data lineage, data management, and data estate insights (preview).  The product that had these features was formerly called Azure Purview
  • Data security: Covers data loss prevention, insider risk management, information protection, and adaptive protection.  The product that had these features was formerly called Microsoft Information Protection (MIP)
  • Data compliance: This covers compliance manager, eDiscovery and audit, communication compliance, data lifecycle management, and records management.  The product that had these features was formerly called Microsoft Information Governance

My question is, why is it so incomprehensibly expensive? It’s a really neat tool that a lot of organizations could make great use of, but it has at least one and maybe two too many zeroes on the bill, causing limited adoption.

Comments closed