Press "Enter" to skip to content

Day: September 9, 2022

Airflow and Akka

Chesnay Schepler responds to an announcement:

On September 7th Lightbend announced a license change for the Akka project, the TL;DR being that you will need a commercial license to use future versions of Akka (2.7+) in production if you exceed a certain revenue threshold.

Within a few hours of the announcement several people reached out to the Flink project, worrying about the impact this has on Flink, as we use Akka internally.

The purpose of this blogpost is to clarify our position on the matter.

Read on for what this means for Apache Flink.

Comments closed

Using the ShortCircuitOperator in Airflow

Lior Gavish shows off a useful operator in Apache Airflow:

But what happens when Airflow testing doesn’t catch all of your bad data? What if “unknown unknown” data quality issues fall through the cracks and affect your Airflow jobs? 

One helpful but underutilized solution is to leverage the Airflow ShortCircuitOperator to create data circuit breakers to prevent bad data from flowing across your data pipelines.

Data circuit breakers are powerful, but as with most data quality tactics, the nuances of how they are implemented are critical. Otherwise, you can make a bad problem worse.

Read on to learn more about the operator and how you can use it. The code block images are a bit fuzzy but still readable enough. It might be a little clearer on the original post.

Comments closed

Deploying a Streamlit App to RStudio Connect

Parisa Gregg wraps up a series:

RStudio Connect is a platform which is well known for providing the ability to deploy and share R applications such as Shiny apps and Plumber APIs as well as plots, models and R Markdown reports. However, despite the name, it is not just for R developers (hence their recent announcement). RStudio Connect also supports a growing number of Python applications, API services including Flask and FastAPI and interactive web based apps such as Bokeh and Streamlit.

In this post we will look at how to deploy a Streamlit application to RStudio Connect. Streamlit is a framework for creating interactive web apps for data visualisation in Python. It’s API makes it very easy and quick to display data and create interactive widgets from just a regular Python script.

Click through for the step-by-step process.

Comments closed

Inserting into Azure Blob Storage from SQL Server 2022

I continue a series on data virtualization in SQL Server 2022:

Several years ago, I wrote a blog post on how to insert data into Azure Blob Storage from SQL Server using PolyBase. That technique used PolyBase V1: the Java connector for Hadoop. With SQL Server 2022 eliminating that connector, we’re going to learn the new method.

This is one of the larger practical differences in data virtualization with SQL Server 2022.

Comments closed

Backup Options for Cosmos DB

Manvendra Singh takes a backup:

This article will explore backup options available in the Azure Cosmos DB service. Backups are very important to safeguard our data in case of data corruption, data deletion, system failure, or any unforeseen circumstances like DR. We have planned, configured, and managed it for our on-prem databases whether it is SQL Server, Oracle, DB2, or system files on various machines. DBAs and Infrastructure admins have ensured to keep a backup of all these systems to safeguard their data. Similarly, we must also secure our data hosted in a cloud environment for any services whether it is Azure VMs, Azure SQL, Cosmos Db accounts, or any other services. Today we will talk about backup options available to secure cosmos DB databases and their contents.

Click through for those two options.

Comments closed

SQL Server 2022 Query Store Hints

David Pless takes us through some new query hints:

Query Store hints provide a direct method for developers and DBAs to shape query plans without changing application code.  

Query Store hints are a new feature that extends the power of Query Store—but this means that Query Store hints does require the Query Store feature to be enabled and that your query and query plan are captured in the Query Store.

Just like plan guides, Query Store hints are persisted and will survive restarts, but Query Store hints are much easier to use than plan guides.

Read on to see which options are available and how they work.

Comments closed

Inverted Indexes for Full-Text Search

Maria Zakourdaev twists some text inside-out:

Sometimes there are properties in the document with unstructured text, like newspaper articles, blog posts, or book abstracts. The inverted index is easy to build and is similar to data structures search engines use.

Such document structures can help in various complex search patterns, like common word detection, full-text searches, or document similarity searches, using humming distance or l2distance algorithms. Inverted indexes are useful when the number of keywords is not too large and when the existing data is either totally immutable or rarely changed, but frequently searched.

This post and Maria’s MSSQLTips post both cover the high-level concept, focusing on tradeoffs between different data models. I like this sort of idea a lot and like telling people that sometimes, the right answer in a relational database involves thinking backwards.

Comments closed

Oracle on Azure Frequently Asked Questions

Kellyn Pot’vin-Gorman spreads information:

A lot of DBAs aren’t as familiar with Oracle DataGuard as many would think.  Even though it’s a phenomenal product, they may have never used it, so knowing the ins and outs of the best Oracle product to use with Oracle on Azure is important.

I highly recommend the following documentation and guidelines from Oracle.  The Product team in charge of DataGuard is fantastic at Oracle, so why go anywhere else to learn about this?

Oracle Data Guard Concepts and Administration, 19c

If you are in the situation where you’re thinking about moving your Oracle servers to Azure, this is a good starting point.

Comments closed