Press "Enter" to skip to content

Curated SQL Posts

Minimizing Latency in Kafka Streaming Applications using APIs

Abhishek Goswami doesn’t want to slow down the stream:

Kafka is widely adopted for building real-time streaming applications due to its fault tolerance, scalability, and ability to process large volumes of data. However, in general, Kafka streaming consumers work best only in an environment where they do not have to call external APIs or databases. In a situation when a Kafka consumer must make a synchronous database or API call, the latency introduced by network hops or I/O operations adds up and accumulates easily (especially when the streaming pipeline is performing an initial load of a large volume of data before starting CDC). This can significantly slow down the streaming pipeline and result in the blowing of system resources impacting the throughput of the pipeline. In extreme situations, this may even become unsustainable as Kafka consumers may not be able to commit offsets due to increased latency before the next polling call and get continuously rebalanced by the broker, practically not processing anything yet incrementally consuming more system resources as time passes.

This is a real problem faced by many streaming applications. In this article, we’ll explore some effective strategies to minimize latency in Kafka streaming applications where external API or database calls are inevitable. We’ll also compare these strategies with the alternative approach of separating out the parts of the pipeline that require these external interactions into a separate publish/subscribe-based consumer.

Read on to understand the causes of this latency and several patterns you can use to limit it.

Comments closed

Updating Microsoft Fabric Warehouses via Power Apps

Shabnam Watson troubleshoots an issue:

One of my recent explorations with Microsoft Fabric was integrating Power Apps with a Fabric Warehouse—both in a standalone Power Apps app and as an embedded visual within a Power BI report to enable writeback. My goal was simple: to enable Power Apps to display and update records from a table in the Fabric Warehouse. Initially, I turned to the three-screen template apps to get started, however; while it displayed the records, it failed to update them. This led me to dive deeper into how Fabric Warehouse differs from other SQL data sources when it comes to Power Apps and to find a workaround.

Read on to see how it all works.

Comments closed

Don’t Trust TRUSTWORTHY

Chad Callihan talks about an untrustworthy setting:

TRUSTWORTHY is a database property change that can have far-reaching security consequences when turned ON. Let’s take a brief look at what the TRUSTWORTHY property is and if it’s worth turning on, even when it is a potential fix to your problems.

Chad links to a DBA Stack Exchange post from Solomon Rutzky concerning module signing, which is a good opportunity for me to plug Solomon’s modulesigning.info. This is the correct answer, not TRUSTWORTHY or any of its ilk (EXECUTE AS, cross-database ownership chaining, etc.).

Comments closed

Glyphs and Character Encodings in PostgreSQL

Cristophe Pettis has a two-parter (so far). First up is an explanation of several linguistic terms:

This is part one of a series on PostgreSQL and collations, and how to use them without tears. This is an introduction to the general concepts of glyphs, character encodings, collations, and locales.

Part two covers character encodings in PostgreSQL:

From the point of view of the computer, a character string is just a sequence of bytes (maybe terminated by a zero byte, maybe with a length). If the only thing PostgreSQL had to do with character strings was store them and return them to the client, it could just ignore that character encodings even exist.

However, databases don’t just store character strings: they also compare them, build indexes on them, change them to upper case, do regex searches on them, and other things that mean they need know what the characters are and how to manipulate them. So, PostgreSQL needs to know what character encoding applies to the text it is storing.

Read on for a detailed explanation for PostgreSQL. A lot of this also applies to SQL Server, though there are some inevitable differences that pop up.

Comments closed

Finding a Good Cost Threshold for Parallelism

Jared Westover goes on a quest:

Given modern hardware, you might hear that the default setting of 5 for the Cost Threshold for Parallelism (CTFP) is far too low. However, people are left with a decision: Should they change it or leave it alone? If I change it and the performance gets worse, I’ll be left with egg on my face. What exactly is the benefit of increasing it, especially for smaller-cost queries?

Read on to learn more about what Cost Threshold for Parallelism is, how you can set it, and a simple example of how the setting can affect you. Jared also has some links to great resources that I highly recommend you check out.

Comments closed

Moving items from “My Workspace” in Microsoft Fabric

Matt Collins is on the move:

A common issue I’ve seen recently when working with Microsoft Fabric is managing items in the “My Workspace” Workspace. This is often the playground for many users who sign up for a free trial but can result in some administrative overhead when resources developed here are now ready for wider use and need to be moved to a shared location.

In this article we will discuss how to move workspace items in Microsoft Fabric from “My Workspace” to other workspaces, using our understanding of item dependencies and some metadata to speed up the process.

Read on to learn how, as well as some of the issues you can run into along the way.

Comments closed

Map and FlatMap in PySpark

Vipul Kumar does a bit of work with resilient distributed datasets:

PySpark, the Python API for Apache Spark, is widely used for big data processing and distributed computing. It enables data engineers and data scientists to efficiently process large datasets using resilient distributed datasets (RDDs) and DataFrames. Two commonly used transformations in PySpark are map() and flatMap(). These functions allow users to perform operations on RDDs and are pivotal in distributed data processing.

In this blog, we’ll explore the key differences between map() and flatMap(), their use cases, and how they can be applied in PySpark.

The DataFrame approach has all but obviated having developers use the original Hadoop-like map-reduce approach to writing code in Spark. Even so, I do think it’s useful to know how it all works.

Comments closed

A Survey of Predictive Analytics Techniques

Akmal Chaudhri tries a bunch of things:

In this short article, we’ll explore loan approvals using a variety of tools and techniques. We’ll begin by analyzing loan data and applying Logistic Regression to predict loan outcomes. Building on this, we’ll integrate BERT for Natural Language Processing to enhance prediction accuracy. To interpret the predictions, we’ll use SHAP and LIME explanation frameworks, providing insights into feature importance and model behavior. Finally, we’ll explore the potential of Natural Language Processing through LangChain to automate loan predictions, using the power of conversational AI.

Click through for the notebook, as well as an overview of what the notebook includes. I don’t particularly like word clouds as the “solution” in the BERT example, though without real data to perform any sort of NLP, there’s not much you can meaningfully do.

Comments closed

Source Control Tips

Aamir Khan shares some tips on source control:

In software development, version control is an essential practice that helps manage changes to code and collaborates effectively with team members. A well-organized repository not only streamlines the development process but also enhances productivity and minimizes errors. In this blog post, we’ll explore best practices for maintaining a clean and organized repository, including branch naming conventions and crafting effective commit messages.

The primary audience for this is software developers, but if you create and modify SQL assets (tables, stored procedures, views, functions, etc.), source control is a great thing for many reasons, and this still applies to you. And if you write Powershell scripts because you’re “not a coder,” well, I have some shocking news for you.

Comments closed

Fixing Timeout Issues with Azure SQL Database

Reitse Eskens shares some knowledge:

The customer can connect to the Azure Sql database with Sql Server Management Studio (SSMS) but not with a specific client application.
When digging into the logs (all logs were activated for this database), nothing shows up for the specific login used by the client application. The application itself returns a connection error caused by a time-out.

The application resides outside of Azure and can’t use a VPN connection, the Azure Sql Server has a specific firewall rule to allow incoming traffic from this specific IP address. Not a situation I’m really happy with, but it happens.

Read on for the solution. It was not one I had anticipated. But it did land in my “When in doubt, blame the network” policy.

Comments closed