Press "Enter" to skip to content

Month: September 2022

Using the ShortCircuitOperator in Airflow

Lior Gavish shows off a useful operator in Apache Airflow:

But what happens when Airflow testing doesn’t catch all of your bad data? What if “unknown unknown” data quality issues fall through the cracks and affect your Airflow jobs? 

One helpful but underutilized solution is to leverage the Airflow ShortCircuitOperator to create data circuit breakers to prevent bad data from flowing across your data pipelines.

Data circuit breakers are powerful, but as with most data quality tactics, the nuances of how they are implemented are critical. Otherwise, you can make a bad problem worse.

Read on to learn more about the operator and how you can use it. The code block images are a bit fuzzy but still readable enough. It might be a little clearer on the original post.

Comments closed

Deploying a Streamlit App to RStudio Connect

Parisa Gregg wraps up a series:

RStudio Connect is a platform which is well known for providing the ability to deploy and share R applications such as Shiny apps and Plumber APIs as well as plots, models and R Markdown reports. However, despite the name, it is not just for R developers (hence their recent announcement). RStudio Connect also supports a growing number of Python applications, API services including Flask and FastAPI and interactive web based apps such as Bokeh and Streamlit.

In this post we will look at how to deploy a Streamlit application to RStudio Connect. Streamlit is a framework for creating interactive web apps for data visualisation in Python. It’s API makes it very easy and quick to display data and create interactive widgets from just a regular Python script.

Click through for the step-by-step process.

Comments closed

Inserting into Azure Blob Storage from SQL Server 2022

I continue a series on data virtualization in SQL Server 2022:

Several years ago, I wrote a blog post on how to insert data into Azure Blob Storage from SQL Server using PolyBase. That technique used PolyBase V1: the Java connector for Hadoop. With SQL Server 2022 eliminating that connector, we’re going to learn the new method.

This is one of the larger practical differences in data virtualization with SQL Server 2022.

Comments closed

Backup Options for Cosmos DB

Manvendra Singh takes a backup:

This article will explore backup options available in the Azure Cosmos DB service. Backups are very important to safeguard our data in case of data corruption, data deletion, system failure, or any unforeseen circumstances like DR. We have planned, configured, and managed it for our on-prem databases whether it is SQL Server, Oracle, DB2, or system files on various machines. DBAs and Infrastructure admins have ensured to keep a backup of all these systems to safeguard their data. Similarly, we must also secure our data hosted in a cloud environment for any services whether it is Azure VMs, Azure SQL, Cosmos Db accounts, or any other services. Today we will talk about backup options available to secure cosmos DB databases and their contents.

Click through for those two options.

Comments closed

SQL Server 2022 Query Store Hints

David Pless takes us through some new query hints:

Query Store hints provide a direct method for developers and DBAs to shape query plans without changing application code.  

Query Store hints are a new feature that extends the power of Query Store—but this means that Query Store hints does require the Query Store feature to be enabled and that your query and query plan are captured in the Query Store.

Just like plan guides, Query Store hints are persisted and will survive restarts, but Query Store hints are much easier to use than plan guides.

Read on to see which options are available and how they work.

Comments closed

Inverted Indexes for Full-Text Search

Maria Zakourdaev twists some text inside-out:

Sometimes there are properties in the document with unstructured text, like newspaper articles, blog posts, or book abstracts. The inverted index is easy to build and is similar to data structures search engines use.

Such document structures can help in various complex search patterns, like common word detection, full-text searches, or document similarity searches, using humming distance or l2distance algorithms. Inverted indexes are useful when the number of keywords is not too large and when the existing data is either totally immutable or rarely changed, but frequently searched.

This post and Maria’s MSSQLTips post both cover the high-level concept, focusing on tradeoffs between different data models. I like this sort of idea a lot and like telling people that sometimes, the right answer in a relational database involves thinking backwards.

Comments closed

Oracle on Azure Frequently Asked Questions

Kellyn Pot’vin-Gorman spreads information:

A lot of DBAs aren’t as familiar with Oracle DataGuard as many would think.  Even though it’s a phenomenal product, they may have never used it, so knowing the ins and outs of the best Oracle product to use with Oracle on Azure is important.

I highly recommend the following documentation and guidelines from Oracle.  The Product team in charge of DataGuard is fantastic at Oracle, so why go anywhere else to learn about this?

Oracle Data Guard Concepts and Administration, 19c

If you are in the situation where you’re thinking about moving your Oracle servers to Azure, this is a good starting point.

Comments closed

The Joy of Treemaps

Simon Rowe answers describes one of my favorite often-inappropriate visuals:

Dr Shneiderman developed the “treemap” in order to visualise this large amount of data—with multiple levels of folders and subfolders—in an efficient way, without taking up too much screen real estate. The treemap uses a series of nested rectangles, sized proportionally to the corresponding data value, to deliver an organised and multi-level view into any hierarchical data set.

Treemaps get misused a lot but are really valuable in specific scenarios. Click through to learn when (and when not) to use a treemap.

Comments closed

Visualizing Delay Times on Subway Stations

Benjamin Smith looks for delays:

Any Torontonian who has commuted regularly on the TTC has probably experienced their fair share of delays on the subway. Having experienced a few recently I was inspired to visualize the average delay times across all stops on the subway. What are the stations with the longest delays on average this past year? Could we make a nice visual with it?

Click through for the end result as well as the process to get there.

Comments closed

PolyBase and Named Instances

I show how to connect to a named instance using PolyBase in SQL Server 2019 or 2022:

We have two SQL Server instances running on the same machine. Before we get started, I do want to point out one thing: PolyBase can only work on one instance for a given server (physical machine or virtual machine) because the PolyBase engine and data movement services are system-level services. This means you cannot have PolyBase installed on your main instance as well as your named instance.

Click through for two methods.

Comments closed