Press "Enter" to skip to content

Month: May 2024

Security and the Weakest Link

Brian Kelley explains:

When I look at a system and think about its security model, the first thing I start poking around at is where I think security is weakest. For instance, if my target is a Microsoft SQL Server box, I don’t generally look for a weakness in SQL Server itself. I start looking at the operating system, I look at accounts that may have access, and since I’m really worried about the data being taken, I look to see how backups are handled and where they are written to.

Read on for more information, as well as a link to Brian’s 2019 video on attacking SQL Server.

Comments closed

SQL Server Drivers and Access Tokens for Connection Pooling

David Engel gives us a deep dive:

We often get support escalations related to failures around expired access tokens when using Microsoft Entra ID (formerly Azure Active Directory) authentication. There is a lot of nuance in the various drivers about how and when access tokens are used and when they get renewed after expiring. I’m going to do a deep-dive and try to explain some of the details here.

This post is mainly targeted at people who have to troubleshoot issues in this area or those who are building large services and need to understand the impact of their authentication choices.

Read on to learn more about how the ODBC, JDBC, and .NET drivers differ.

Comments closed

MCMC Sampling with TidyDensity

Steven Sanderson performs some sampling:

In the area of statistical modeling and Bayesian inference, Markov Chain Monte Carlo (MCMC) methods are indispensable tools for tackling complex problems. The new tidy_mcmc_sampling() function in the TidyDensity R package simplifies MCMC sampling and visualization, making it accessible to a broader audience of data enthusiasts and analysts.

Read on for a brief primer on MCMC and an example of how the tidy_mcmc_sampling() function works.

Comments closed

Digging into Cursors

Hugo Kornelis gives us a primer on cursors:

And yes, I know the mantra. Do not use cursors. They are slow. There is (almost) always a faster set-based alternative available. So why would I even waste blog space on cursors, when the only smart thing to do is to rip them out and replace them with a set-based alternative?

Well, there are, in fact, many reasons. The “almost” above suggests that there are still cases where row by row processing is in fact the most efficient method. There are cases where we use a cursor, even though set-based is slower, because we need to call a stored procedure for each row returned, and that stored procedure is too complex to be changed to process an entire set at once. Or, perhaps, you just inherited existing code that uses a cursor, and you need to fix the immediate performance issues now, so you cannot afford the time investment to rewrite the row by row logic to set-based logic.

Another example of this was calculating running totals prior to SQL Server 2012’s support of aggregate window functions. The cursor approach was considerably faster than the self-join approach because the self-join approach required joining on an inequality statement. Granted, there was the “quirky update” technique, but that depended on an accident of SQL Server internals that Microsoft never officially supported and could have broken at any time.

Comments closed

Migrating DATETIME Data to DATETIMEOFFSET

William Assaf adds some time zones:

I recently reviewed, worked on, and added a similar example to the DATETIMEOFFSET Microsoft Learn Docs article at the recommendation of my colleague Randolph West, who guessed (accurately) I would enjoy such a task. It was a nice pre-Build diversion. 

This topic is one that I have co-presented on in the past and hounded project capstone review presentations about. If you’re not storing time zone offset in your date/time data, you’re setting yourself up for future pain. That future pain is not what this blog post is about.

My preference is not to store time zone offset but instead store everything in UTC and perform any time zone switcharoos in the UI. But if you are storing local dates and times, I completely agree that you should keep track of the time zone. I worked for an east coast US company that bought a west coast US company, and both stored local dates and times in their SQL Server databases, making data consolidation a real challenge.

Comments closed

Listen and Notify in Postgres

Brandur Leach shows how to use PostgreSQL’s listen/notify capabilities:

Listen/notify in Postgres is an incredible feature that makes itself useful in all kinds of situations. I’ve been using it a long time, started taking it for granted long ago, and was somewhat shocked recently looking into MySQL and SQLite to learn that even in 2024, no equivalent exists.

In a basic sense, listen/notify is such a simple concept that it needs little explanation. Clients subscribe on topics and other clients can send on topics, passing a message to each subscribed client. The idea takes only three seconds to demonstrate using nothing more than a psql shell:

Read on to learn more about the notifier pattern. What’s interesting is that the notifier patter, which adds a fair bit of structure to this very simple process, makes it work a good bit like SQL Server’s Service Broker.

Comments closed

Number of Fabric Workspaces and the Medallion Architecture

Kevin Chant opens a can of worms:

Since I got asked about it this week during the Learn Together session I did alongside Shabnam Watson (l/X). Plus, it is a highly debated topic in our community, and I wanted to share my thoughts about it.

Due to the fact that my personal opinion is that it depends. However, the number you choose depends on a variety of reasons which I intend to cover in this post.

By the end of this post, you will know my personal opinions as to why. Plus, plenty of things to consider when deciding on the number of workspaces to implement.

Read on for Kevin’s thoughts. My quick opinion is, one workspace per layer. Just from a logistical standpoint, keeping the several layers separated in one workspace is an immense challenge and typically requires exposing data engineering details (like what “gold”/”silver” or “curated”/”refined” actually means) with end users.

Comments closed

Max Offline Semantic Model Size Error

Chris Webb continues a series on Power BI semantic model memory errors:

In the Power BI Service, Import mode models are stored offline and paged into memory only when they are needed – for example because someone runs a report that uses the model as its source. As discussed in my last post, though, there’s a limit on the amount of memory that a model can use which varies by the size and type of capacity you’re using using. There’s also an extra complication in that Import mode models are stored offline in a compressed format that means the Power BI Service doesn’t know exactly how much memory will be needed if the entire model needs to be held in memory. As a result there is an additional limit enforced on the size of the offline copy of Import mode models to ensure they don’t use too much memory when they are eventually paged in.

Read on to learn more about this setting and what it looks like if you trigger the error.

Comments closed

SQL Server FAST N Query Hint

Chad Callihan hits the Turbo button:

How familiar are you with the OPTION (FAST N) query hint? It’s not one that I had ever used, so I decided to experiment with it a bit. Let’s look at what it does and how to apply it to a query.

My recollection is that this query hint highly prioritizes “streaming” operators and de-prioritizes operators that work on the whole dataset as a block, such as sorts. This means, for example, that you’ll get more nested loops joins and natural merge joins, but few sort + merge or hash match joins. It may also make some queries take considerably longer as a result.

Comments closed

Checking for Duplicate Rows with TidyDensity

Steven Sanderson looks for dupes:

Today, we’re diving into a useful new function from the TidyDensity R package: check_duplicate_rows(). This function is designed to efficiently identify duplicate rows within a data frame, providing a logical vector that flags each row as either a duplicate or unique. Let’s explore how this function works and see it in action with some illustrative examples.

Read on to see how it works. Though I am curious about whether there’s an option to ignore certain columns, such as row IDs or other “non-essential” columns you don’t want to include for comparison. Also, checking how it handles NA or NULL would be interesting.

Comments closed