Press "Enter" to skip to content

Month: February 2024

Metadata-Driven Pipelines in Microsoft Fabric

John Miner returns to the old ways:

What is a metadata driven pipeline? Wikipedia defines metadata as “data that provides information about other data”. As a developer, we can create a non parameterized pipeline and/or notebook to solve a business problem. However, if we have to solve the same problem a hundred times, the amount of code can get unwieldly. A better way to solve this problem is to store metadata in the delta lake. This data will drive how the Azure Data Factory and Spark Notebooks execute.

Read on to see how you can accomplish this task.

Comments closed

Executing via Proxy in SSIS

Andy Brownsword submits a series of requests through a proxy:

When executing packages for SSIS the default option would be to use the SQL Server Agent service account. We might not want to share an account between our services and Integration Services packages due to security risks.

Let’s take a common example: Suppose we have a package which reads from a file share. If permissions aren’t set up correctly this can fail. Even if access is corrected, the service account can cache that failure. The only way to resolve may be to restart the service. This isn’t something we’d want to be doing on production systems.

Read on to see how you can use a proxy to bypass this problem.

Comments closed

Indexes in pgvector

Semab Tariq explains index options for Postgres’s vector database:

This blog is part of our pgvector blog series. If you haven’t checked out the first blog, I recommend going through it first, where I dive into important concepts of pgvector and AI applications in detail. I provided a real-world example illustrating how you can perform searches based on the meaning of words rather than the words themselves. You can find it on the link here

In this blog, We will explore additional details about the indexes supported in pgvector. We will discuss how indexes are built in the backend, and the various parameters associated with these indexes, and guide you on selecting the most suitable index based on your requirements. Finally, we will assess which index offers the best recall rate for our search query across our dataset of one million records sourced from Wikipedia. Let’s dive into that

Click through to learn more about the two index types available.

Comments closed

An Overview of Data Types in R

Steven Sanderson talks data types:

Imagine your data as a diverse collection of individuals. Some might be numbers (like age or weight), while others might be text (like names or addresses). These different categories are called data types, and R recognizes several key ones:

Click through for that list. It’s a bit different from what you’d expect if you come at this from a SQL or C-based programming language background. But they all make good sense when you remember that R is a domain-specific language for statistics, so it’s going to emphasize the things that make the most sense for statisticians and data scientists.

Comments closed

Installing or Upgrading SQL Server on a Workstation

Aaron Bertrand shares some advice:

Lately, I’ve seen many people struggling to upgrade their workstation to the latest version of SQL Server. The main source of the problem is usually the web installer/wrapper, which can fail for a variety of reasons that aren’t always made clear by the error messages. Today, I’m going to walk through one way you can approach a workstation upgrade and avoid these web installer issues altogether. I’ll focus on SQL Server 2022, but most of the information applies to any modern major version. Most of the information applies if you’re installing a brand-new instance, too.

Click through for several good tips and practices.

Comments closed

dataConvergeDefinition and DirectQuery Partitions

Chris Webb talks about hybrid tables:

Hybrid tables – tables which contain both Import mode and DirectQuery mode partitions to hold data from different time periods – have been around for a while. They are useful in cases where your historic data doesn’t change but your most recent data changes very frequently and you need to reflect those changes in your reports; you can also have “reverse hybrid tables” where the latest data is in Import mode but your historic data (which may not be queried often but still needs to be available) is in DirectQuery mode. Up to now they had a problem though: even when you were querying data that was in the Import mode partition, Power BI still sent a SQL query to the DirectQuery partition and that could hurt performance. That problem is now solved with the new dataCoverageDefinition property on the DirectQuery partition.

Read on to see what dataCoverageDefinition does.

Comments closed

Improving Aurora Postgres Performance by Lowering random_page_cost

Shayon Mukherjee does some performance tuning:

Recently I have been working with some queries in Postgres where I noticed either it has decided not to use an index and perform a sequential scan, or it decided to use an alternative index over a composite partial index. This was quite puzzling, especially when you know there are indexes in the system that can perform these queries faster.

So what gives?

After some research, I stumbled upon random_page_cost (ref).

Read on to learn more about what this is, how you can change it, and a warning before you go wild in changing it.

Comments closed

Accessibility in UI Design

Benedict Ampea-Badu shares some tips:

In this last segment of the series, we’ll delve into the crucial elements of navigation and flow, exploring further the part of web accessibility that ensures users can effortlessly navigate online content. Additionally, we’ll shine a spotlight on the indispensable practice of regular testing, an ongoing commitment that goes beyond compliance, providing invaluable insights into real user experiences and fostering continuous improvement.

Ensuring a seamless online experience for everyone and facilitating smooth navigation boils down to adhering to the fundamentals of web accessibility. It boils down to carefully planning how websites are laid out, making it easy for users to explore different things effortlessly. When developers grasp how users interact and tackle possible issues, they go beyond just ticking boxes on a checklist. Instead, they create websites that truly put users first, considering how people actually use and prefer to browse.

This is particularly about accessibility in web applications but many of the same principles apply to things like Power BI reports.

Comments closed

Using Databricks System Tables

Dustin Vannoy has a primer on system tables in Databricks:

Monitoring is important, so I’ve covered the topic a few times in the past. I’ve talked about collecting your Spark application logs and Spark metrics. These are a good way to track what is happening and what is going wrong as your code runs. In the video related to this post I focus on a different side of monitoring. The evolving capabilities offered by Databricks System Tables. I have some sample queries and links to help you get started and begin to get value from system tables. This will need to be updated (I’ll try) as new tables go into public preview status. So let’s discuss the questions I had when I first started researching this feature:
1) What do the Databricks system tables offer me for monitoring?
2) How much does this overlap with the application logs and metrics?

Click through for a video and a walkthrough.

Comments closed

Looping in Python

Jack Wallen has us spinning in circles:

What is a loop? Other than a fruity breakfast cereal, a loop is a portion of code that repeats a set number of times until a desired process is complete.

Here’s an example of a loop that many can relate to. You go to the ATM, insert your card, and start with your transaction. First, you check your balance, then you remove money from your account. The loop keeps running with each transaction until you tell the ATM you are finished.

Read on for loop examples, though for is definitely much more common and popular than while. I’d also recommend learning a bit about list comprehensions, which are a way of turning a for loop into a one-line operation: [f(x) for x in my_list] is a simple example: for each row in my_list, perform function f() with that row’s value as the input. Then, take all of the results of f(x) and stick them together in a new list.

Comments closed