Press "Enter" to skip to content

Month: October 2023

DISTINCT Papers up Problems

Aaron Bertrand wants to solve the actual problem:

I’ve quietly resolved performance issues by re-writing slow queries to avoid DISTINCT. Often, the DISTINCT is there only to serve as a “join-fixer,” and I can explain what that means using an example.

I’ve seen this a lot as well, and it usually comes from people not understanding the data model or not understanding how to use subqueries (or common table expressions, the APPLY operator, etc.) to define subsets of data.

Comments closed

A Critique (and Defense) of Generic Programming Languages for ETL/ELT

Teo Lachev doesn’t like general programming languages for ETL and ELT operations:

Someone asked the other day for my opinion about the open-source dbt tool for ETL. I hadn’t heard about it. Next thing I’ve noticed was that Fabric Warehouse added support for it, so I got inspired to take a first look. Seems like an ELT-oriented tool. Good, I’m a big fan of the ELT pattern whose virtues I extolled I discussed many times here. But a Python-based tool that requires writing custom code for orchestration in a dev environment, such as Visual Studio Code? Yuck!

My reasoning is simple: complexity. Bespoke ETL/ELT tools like SQL Server Information Services, Informatica, Azure Data Factory, Airflow, and the like are good when you fit into their primary use cases: moving data from a few data sources into a destination, perhaps with some level of transformation in between.

But here are areas off the top of my head where I’ve seen these tools not work well:

  • Wide scale. In one environment, we had to move contents from a couple thousand databases (with identical schemas) across 50-60 instances of SQL Server into a warehouse, including some facts and dimensions we needed within a minute or two. Even assuming those packages don’t change frequently—not a reasonable assumption—the pains of orchestrating that would be enormous. I don’t think we could have used metadata-driven approach and foreach loops in your ADF workflows, either, as that would not satisfy the time requirements. There are also resource limitation requirements on the other side—you don’t want to overwhelm the warehouse by trying to process a couple thousand clients’ worth of data all at once, so you’ve got to stagger this work using an orchestration engine with enough smarts to limit concurrent processes.
  • Limiting copy-paste efforts and drudgery. Going back to SSIS, it sucks having to maintain dozens of packages, especially common components you need to update in each one. I got to be pretty good at Biml, but a) that has its limits, and b) that’s C# development with SSIS packages as an output, so I’m claiming that for the generic programming languages side of the argument.
Comments closed

Oracle OCI Labeling with Bounding Boxes

Brendan Tierney continues a series on image classification:

In a previous post, I gave examples of how to label data using OCI Data Labeling. It was a simple approach to data labeling images for input to AI Vision. In that post, we just gave a label for the image to indicate if the image contained a Cat or a Dog. Yes, that’s a very simple approach, and we can build image classification models, and use the resulting model to predict a label for new images. These would be labeled as a Cat or a Dog with a degree of certainty. Although this simple approach can give OK-ish results, we typically want a more detailed model and predictions. For a more detailed approach, we can use Object Detection. For this, we need to prepare our data set in a slightly different way and Yes it does take a bit more time to prepare. Or perhaps it takes a lot more time to prepare the data. But this extra time in preparing the data should (in theory) give us a more accurate model.

This post will focus on creating a new labeled dataset using bounding boxes, and in a later post, we’ll examine the resulting model to see if it gives better or more accurate results.

Read on for the process.

Comments closed

Finding the ACTIVE_TRANSACTION Culprits

Thamires Lemes digs into high transaction log utilization:

The transaction log in SQL Server records all changes made to a database, allowing for data recovery and consistency. When a transaction is initiated, it acquires space in the transaction log to record its activities. Long running transactions have the potential to hold the transaction log, and, depending on database write activity, cause errors and disruptions in the SQL Server environment.

It is important to point out that the transaction that is holding the transaction log might not be performing any write activities to consume additional log space, but subsequent transactions that writes to the transaction log will cause its utilization to increase, even if they are fast. The log space won’t be released until the oldest transaction concludes its execution.

Click through for a few queries on the topic. I’d also highly recommend sp_whoisactive for this kind of work.

Comments closed

Using Tableau with Power BI and Fabric

Kurt Buhler crosses the streams:

If you use Power BI, Fabric, or Excel, connecting to Power BI datasets is straightforward. However if you use other BI tools like Tableau, it’s not obvious how you can leverage a Power BI semantic model in your workflow. In this article, I’ll explain how to connect to and use a Power BI dataset from Tableau Desktop.

Read on to see how. Also check out the notes in drill-down sections, as there’s a lot of content in there.

Comments closed

The Rise of Single-Purpose ML Frameworks

Pete Warden describes a phenomenon:

The GGML framework is just over a year old, but it has already changed the whole landscape of machine learning. Before GGML, an engineer wanting to run an existing ML model would start with a general purpose framework like PyTorch, find a data file containing the model architecture and weights, and then figure out the right sequence of calls to load and execute it. Today it’s much more likely that they will pick a model-specific code library like whisper.cpp or llama.cpp, based on GGML.

This isn’t the whole story though, because there are also popular model-specific libraries like llama2.cpp or llama.c that don’t use GGML, so this movement clearly isn’t based on the qualities of just one framework. The best term I’ve been able to come up with to describe these libraries is “disposable”. I know that might sound derogatory, but I don’t mean it like that, I actually think it’s the key to all their virtues! They’ve limited their scope to just a few models, focus on inference or fine-tuning rather than training from scratch, and overall try to do a few things very well. They’re not designed to last forever, as models change they’re likely to be replaced by newer versions, but they’re very good at what they do.

Pete calls them disposable ML frameworks, though I’d call them single-purpose frameworks to contrast with general-purpose ML frameworks like PyTorch and TensorFlow.

Comments closed

An Overview of Event-Driven Architecture

Yaniv Ben Hemo explains what event-driven architecture is:

First things first, Event-driven architecture. EDA and serverless functions are two powerful software patterns and concepts that have become popular in recent years with the rise of cloud-native computing. While one is more of an architecture pattern and the other a deployment or implementation detail, when combined, they provide a scalable and efficient solution for modern applications.

Click through for a primer on event-driven architecture. This is a pattern that I find quite useful for optimizing cloud pricing, assuming your normal business processes can run asynchronously—that is, people are not expecting near-real-time performance and you can start and stop processes periodically in order to “re-use” the same compute for multiple services. The alternative use of EDA is that your services need to be running all the time, but you also have multiple teams working together on the solution and you want to decouple team efforts. In that case, you define queues or Kafka-style topics and let those act as the mechanism for service integration.

This is definitely an architecture that works better for cloud-based systems than on-premises systems.

Comments closed

Capacity Options in Microsoft Fabric

Reitse Eskens is at capacity:

Now I’m going to do something scary and try to explain some things. I tried to pay attention during the precon and had the pleasure of talking with Ljubica Vujovic Boskovic on the capacity usage. She, very patiently, helped me out where my mind completely lost all track. Her explanations were great, any errors are all mine and I will correct this blogpost if there are mistakes. If you want to know more, you can also read this blog by Chris Novak who digs a bit deeper into smoothing and bursting.

So let me give you a very quick and simple introduction into the capacity challenges we’re going to face.

Read on for an overview of how Microsoft Fabric capacity planning works and one concern with this style of “one capacity to rule them all.”

Comments closed

Substrings in MySQL

Rahul Mehta describes how substrings work in MySQL:

MySQL is an open-source relational database management system. It is a widely used relational database management system in the entire world. MySQL like any other database can store different types of data. One of the most used data types is “String”. Developers widely use it in storing data as well as in different formatting operations. One of the key requirements we will always come across is to derive a part of the string. MySQL provides a “SUBSTRING” function to extract a substring from a string. My SQL has below options for extracting the string:

  1. SUBSTRING
  2. SUBSTR (A SYNONYM FOR SUBSTRING)
  3. SUBSTRING_INDEX

Read on to see how these two functions work. They differ a bit from SQL Server in terms of functionality, though there’s a lot of overlap between the two.

Comments closed

A Primer on SQL Server Security

Ben Johnston gives us a high-level overview of SQL Server security options:

SQL Server security structure, mechanisms and methods are very thoroughly documented in the Microsoft documentation, but it is quite daunting if you don’t already know about the functionality. I recently had a request to explain some security features of SQL Server so that internal audits could be completed. While thinking about the request and preparing for the meeting, I realized how many security features are available in SQL Server. The purpose of this post is not to thoroughly explain how all of these items work but to give an introduction to these features and a few recommendations. Given how many security-centered features are available, I’m sure I missed a few, and new features are added all the time, but these are the main features at the time of this writing.

This is solid as a view into what options are available. I do have at least one moderate-to-large qualm with the article: cross-database ownership chaining is something you should never enable; use module signing instead.

Comments closed