Press "Enter" to skip to content

Month: July 2023

Contrasting Spark and Flink for Streaming Use Cases

Deepthi Mohan and Karthi Thyagarajan contrast two products:

Apache Flink and Apache Spark are both open-source, distributed data processing frameworks used widely for big data processing and analytics. Spark is known for its ease of use, high-level APIs, and the ability to process large amounts of data. Flink shines in its ability to handle processing of data streams in real-time and low-latency stateful computations. Both support a variety of programming languages, scalable solutions for handling large amounts of data, and a wide range of connectors. Historically, Spark started out as a batch-first framework and Flink began as a streaming-first framework.

In this post, we share a comparative study of streaming patterns that are commonly used to build stream processing applications, how they can be solved using Spark (primarily Spark Structured Streaming) and Flink, and the minor variations in their approach. Examples cover code snippets in Python and SQL for both frameworks across three major themes: data preparation, data processing, and data enrichment. If you are a Spark user looking to solve your stream processing use cases using Flink, this post is for you. We do not intend to cover the choice of technology between Spark and Flink because it’s important to evaluate both frameworks for your specific workload and how the choice fits in your architecture; rather, this post highlights key differences for use cases that both these technologies are commonly considered for.

Read on for an analysis of the two products.

Comments closed

Tips for Limiting Redis Failures

Phil Booth provides the ammo and we provide the feet:

Production outages are great at teaching you how not to cause production outages. I’ve caused plenty and hope that by sharing them publicly, it might help some people bypass part one of the production outage learning syllabus. Previously I discussed ways I’ve broken prod with PostgreSQL and with healthchecks. Now I’ll show you how I’ve done it with Redis too.

For the record, I absolutely love Redis. It works brilliantly if you use it correctly. The gotchas that follow were all occasions when I didn’t use it correctly.

My one addition here is to be really careful if you use Redis as persistent storage rather than a cache. Redis as a cache is easy: if the server goes down or you have trouble, you simply have more database calls than normal. Redis as persistent storage is a much more complicated beast which seems to fall over a lot more often and is significantly more finicky about drivers.

Comments closed

Mitigating Dynamic Data Masking Side-Channel Attacks

Ben Johnston wraps up a series on dynamic data masking:

This is the fifth and final part of this series on SQL Server Dynamic Data Masking. The first part in the series was a brief introduction to dynamic data masking, completing solutions, and use cases. The second part covered setting up masking and some examples. The third and fourth sections explored side channel attacks against dynamic data masking.

This final part covers mitigations to side channel attacks, additional architectural considerations and an analysis of the overall solution.

Throughout the entire series, Ben has done a good job of laying out exactly what dynamic data masking is good for—and what it isn’t good for. I tend to harp a lot on the latter but Ben keeps a reasonable approach throughout this series.

Comments closed

Trying Fabric Data Wrangler

Reza Rad looks at a new tool:

There is a tool (or you can consider it as an editor) in Fabric for data scientists. As a data scientist, you must work with the data, clean it, group it, aggerate it, and do other data preparation work. This might be needed to understand the data or be part of the process you do to prepare the data and load it into a table for further analysis. Data Wrangler is a tool that gives you such ability. You can use it to transform data and prepare and even generate Python code to make this process part of a bigger data analytics project.

Data Wrangler has a simple-to-use graphical user interface that makes the job of a data scientist easier.

Read on for a video as well as a demo in written format.

Comments closed

MAXDOP by Username in Azure SQL DB

Jose Manuel Jurado Diaz comes up with a solution:

Azure SQL Database is a powerful platform that provides managed database services with built-in intelligence and robust resource management. While Azure SQL Database doesn’t have a direct implementation of the traditional Resource Governor feature available in SQL Server, we can explore a pseudo-Resource Governor approach using user-defined functions and custom tables. In this article, we’ll discuss the concept, present a sample implementation using a custom function, and highlight the possibilities it opens up for controlling CPU resources in Azure SQL Database.

Click through for the UDF and how to use it. My first inclination was to say that I couldn’t see it working well at all under load, though on second thought, performance won’t be bad like having a UDF execute for each row in a table, so it’s probably more of a manageable overhead.

Comments closed

Tracking Power BI Import Throughput Variance

Chris Webb continues a series on using Log Analytics with Power BI:

In the second post in this series I discussed a KQL query that can be used to analyse Power BI refresh throughput at the partition level. However, if you remember back to the first post in this series, it’s actually possible to get much more detailed information on throughput by looking at the ProgressReportCurrent event, which fires once for every 10000 rows read during partition refresh.

Here’s yet another mammoth KQL query that you can use to analyse the ProgressReportCurrent event data:

Click through for the KQL query, an explanation of how it works, and some practical examples.

Comments closed

Cumulative Means in R

Steven Sanderson performs a moving average:

The cumulative mean, also known as the running mean or moving average, provides us with a dynamic view of how the average value of a dataset changes as new observations are added incrementally. It is an invaluable tool in time-series analysis, trend identification, and smoothing noisy data.

Imagine you have a series of numeric values, and you want to find the average of the first observation, then the average of the first two observations, followed by the average of the first three, and so on. This iterative process generates the cumulative mean, painting a picture of how the data behaves over time.

Often times, we care about the moving average over a specific window, such as the last n periods. This particular post covers the moving average over the entire set of data.

Comments closed

Full-Text Search in Cosmos DB via Cognitive Services

Hasan Savran performs a search:

Incorporating Full-Text Search functionality into your application can enable users to locate what they are searching for effortlessly. Searching for specific words or phrases within a database has always been a difficulty, particularly for relational databases. Throughout my career, I’ve had countless discussions/arguments with DBAs about the importance of implementing full-text search in a relational database. We are in totally different times, now users want to search by voice, image, or video.

     Full-Text Search functionality is not part of Azure Cosmos DB’s Database Engine. Firstly, we must establish the Azure Cognitive Search service and link the data from Azure Cosmos DB to the Search Service. The process of setting up Azure Cognitive Search is relatively straightforward. Like other Azure services, you will need to answer similar types of questions beforehand. (Subscription, Resource Group, a name for the service, region, and tier)

By the way, Azure Cognitive Search is very similar to Elasticsearch, for those of you familiar with that technology.

Comments closed

A Primer on Postgres Database Security

Murtaza Umair provides guidance:

Keeping your database up to date with the latest PostgreSQL release is vital in maintaining the security of your database. Once every year, PostgreSQL comes out with a new release, which includes new features, security enhancements, and performance improvements. Each major release is supported for five years, during which PostgreSQL releases quarterly minor updates to fix bugs and patch security issues. The schedule for new updates and more information is given on PostgreSQL’s website, at https://www.postgresql.org/developer/roadmap/

Nothing in this is earth-shattering but it is a solid overview.

Comments closed

Azure SQL MI and the WAF: Performance Pillar

Niko Neugebauer looks at one of the pillars of the Well-Architected Framework with respect to Azure SQL Managed Instance:

baseline is a known value against which later measurements and performance can be compared. Baseline helps us define what is a normal database performance and thus comparing against the baseline provides us with insights into any abnormalities. Ideally, one should take performance measurements at regular intervals over time, even when no problems occur, to establish a server performance baseline. Compare each new set of measurements with those taken earlier.

Click through for additional guidance and recommendations.

Comments closed