Press "Enter" to skip to content

Day: May 3, 2024

Checking for Duplicate Rows with TidyDensity

Steven Sanderson looks for dupes:

Today, we’re diving into a useful new function from the TidyDensity R package: check_duplicate_rows(). This function is designed to efficiently identify duplicate rows within a data frame, providing a logical vector that flags each row as either a duplicate or unique. Let’s explore how this function works and see it in action with some illustrative examples.

Read on to see how it works. Though I am curious about whether there’s an option to ignore certain columns, such as row IDs or other “non-essential” columns you don’t want to include for comparison. Also, checking how it handles NA or NULL would be interesting.

Comments closed

Comparing SQL Server to Databricks

Paul Andrew makes a comparison:

Microsoft SQL Server and Azure Databricks over the many years I’ve been working in the data/IT industry have easily become my two favourite data processing tools. When Databricks became a first-class resource in Microsoft Azure, it was a big moment for the evolution of the data platform architectures I’ve designed and built (but architecture isn’t the focus for this blog). That said, rather than considering the tooling and technology as an evolution, I find a lot of people drawing comparisons between the products. This often leads to confusion and friction, as they are ultimately offering a lot of different capabilities, with only some common areas where comparisons could be made.

Read on for Paul’s thoughts. Spoilers: I agree with pretty much all of it.

Comments closed

Getting Row Counts of All Tables in a Microsoft Fabric Warehouse

Koen Verbeeck busts out the tally counter:

It says the data is 352MB in size, but after loading the data I was curious about how many rows were actually in that sample data set. Unfortunately, it’s not as straight forward as with a “normal” SQL Server database to get the row counts. First of all, when you connect with SSMS to the database there’s sadly no option to get the row counts report:

The post is a little depressing, really. But still worth the read.

Comments closed

Building an Elastic Job with Bicep

Josephine Bush flexes some muscles:

Bicep is an open-source Domain-Specific Language (DSL) that simplifies the process of deploying Azure resources. It is an abstraction layer on top of Azure Resource Manager (ARM) templates, making it easier to write and understand infrastructure code. Bicep lets you describe your Azure infrastructure using a cleaner and more concise syntax than traditional ARM templates.

It’s definitely easier to read and work with Bicep than directly with ARM template JSON. Larger Bicep scripts can still be pretty confusing, but it’s definitely easier to write and maintain.

Comments closed

Third-Party Applications and Poor Database Design

Kevin Hill talks about a bugbear of mine:

As a SQL DBA, what do you do when a vendor application has performance problems that are code related?

Server settings don’t generally seem to be an issue.

Queries and vendor code…total hands off. I just point at code and say “There’s a great choice for optimizing in your next update!”

Indexes are the “Sticky Bits” in between client data and vendor code.

To an extent, I do feel for software vendors, who have to write software that works in a variety of environments with a variety of workloads. That can be a real challenge, especially because the people developing those databases don’t get to monitor them in real time and observe what’s going on until someone reports an issue.

That’s the empathy part. The other side of the coin is, there are a bunch of vendors who have garbage-tier database designs and awful queries. And as a user of this third-party application or a consultant trying to help users of the application, that can be frustrating. The reason is, like Kevin mentions, you really don’t want to go mucking with the queries or database design, because with my luck, the change I make to improve a query’s performance will affect a trigger nested three levels deep in some totally unrelated process that somehow manages the data integrity of the entire application.

Comments closed

Fabric Workload Items in the Scanner API

Gilbert Quevauvilliers checks out the latest changes to the Scanner API:

All Fabric Workload Items are now available from the Scanner API

I was working with the customer and was looking for some information in the Scanner API.

For a change I went into Power Query and expanded the workspaces item.

Read on for more information. And if you’d like to learn more about the Scanner API, here’s a sample application showing how to use it.

Comments closed

Central Management Servers and SSMS 20

Greg Low works around an issue:

I’ve recently been doing work with a site that makes extensive use of Central Management Servers. And that’s an issue if you upgrade past v19.3 of SSMS.

Here’s my counter-argument: how frequent is it to find organizations that have enough SQL Server instances to make a Central Management Server worthwhile and also do not have any sort of certificate management process?

And more importantly, why don’t they have certificate management processes in place for SQL Server? This isn’t 2008 anymore—everybody (for some slight exaggeration of the term “everybody”) has certificate management in place for websites. It’s incredibly rare to find websites without TLS certificates, so somebody in your organization is managing certificates somehow. Why are these people not also managing certificates for SQL Server? Because once you have proper certificates in place rather than self-signed certs, there is no SSMS problem.

And if money is the issue, money is not the issue. Note that Daniel’s post is over 6 years old (and here’s me self-linking for street cred), meaning any company without the budget for proper certificates could have put this into place anytime over the past 6 years.

Self-signed certificates are okay for debugging purposes on personal machines. But they should not be acceptable for connecting to SQL Server in any environment. Certificate-driven encryption is a critical part of securing data movement over the wire, and a trusted certificate chain is critical for ensuring attackers cannot sit in the middle of that connection and read the data.

Comments closed