Press "Enter" to skip to content

Author: Kevin Feasel

Connecting Kafka to Elasticsearch

Danny Kay and Liz Bennett build an example of writing Kafka topic data to Elasticsearch:

The Elasticsearch sink connector helps you integrate Apache Kafka® and Elasticsearch with minimum effort. You can take data you’ve stored in Kafka and stream it into Elasticsearch to then be used for log analysis or full-text search. Alternatively, you can perform real-time analytics on this data or use it with other applications like Kibana.

For some background on what Elasticsearch is, you can read this blog post by Sarwar Bhuiyan. You can also learn more about Kafka Connect in this blog post by Tiffany Chang and in this presentation from Robin Moffatt.

This is a demo-heavy walkthrough, so check it out.

Comments closed

Metadata-Driven ADF Pipelines

Paul Andrew wraps up a series on metadata-driven processing of Azure Data Factory pipelines. Part 3 covers the ADF wrapper necessary for our custom pipelines:

Firstly, to help guide this post below is a mock up of our Data Factory pipelines and activities to show the end goal. Hopefully this view informs how things are going to be connected using what I call a pipeline hierarchy system and how they will work in the overall framework. For our metadata processing framework we can make the following category distinctions about the activities represented:

Grandparent – This is the top level orchestration of our wider data platform solution. Here a scheduled trigger could be connected or processing in our solution grouped into natural areas. Technically this level isn’t required for our processing framework, but I’ve included it as good practice.
Parent – Our parent pipelines primary purpose (try saying that fast 3 times 🙂 ) is to handle the stages of our processing framework. The stages will then be passed off sequentially to our child pipeline using another execute pipeline activity.
Child – At this level in the framework the child is hitting the Azure Function to call the lowest level executors, or the pipelines that we want to actually do the work in our data platform solution. In my previous post I added some example metadata to call pipelines name Stage X-X. These in turn relate to a set of empty place holder pipelines that contained only Wait activities.

Part 4 puts it all together:

For the end to end run of the framework we have a few options to see progress once its been triggered. Before that its worth pointing out that in the below I’ve used the sample metadata provided with the database scripts in GitHub. Then for each execution pipeline I’ve added a single Wait activity with a random time delay of a few seconds. The point here is to test the framework execution, not the pipelines being called. To further clarify, the Stage X-X pipelines names should be replaced with your actual pipeline names in your data platform solution.

Definitely worth the read.

Comments closed

DISTINCT is not a Function

Lukas Eder corrects a misconception:

A very common misconception I often encounter with SQL users is the idea that DISTINCT is something like a function, and that it can take parenthesised arguments. Just recently, I’ve seen this Stack Overflow question where the OP was looking for a way to express this in jOOQ:

SELECT DISTINCT (emp.id), emp.fname, emp.name FROM employee emp;

Notice the parentheses around (emp.id), which look as though this is some special kind of DISTINCT usage, which is akin to a DISTINCT function. The idea is often that:

– The behaviour is somewhat different from omitting the parentheses
– The performance is faster, because only the ID needs to be considered for distinctness

Both of these ideas are (mostly) wrong, as Lukas shows.

Comments closed

Syncing Mobile Apps via Change Tracking

Davide Mauri shows how we can perform data synchronization using change tracking in Azure:

Sending data from the cloud to the app is way more tricky. You want to do it in the most efficient way, to spare bandwidth and device battery life, so you need a way to know what has changed since the last time that specific user and device synced. As data is surely stored in a database of some sort, you also need some efficient method on the database side to make sure you can quickly get everything that is new or changed and that is in the scope for that specific user/device. If your mobile application is successful, this means that you may literally have millions and millions of rows or documents to scan and check for changes.

Not an easy task: all hope is lost then? Just send back the whole data set and that’s it? Of course not! We don’t want to just be developers, but better developers, right?

Modern databases can help a lot in tackling this challenge. Azure SQL, for example, has a feature called Change Tracking that, guess what?, will take care of keeping track of changes for you.

Davide includes a lot of detail and even a sample application on GitHub.

Comments closed

Currency Conversion in Power BI

Marco Russo takes us through some tips when performing currency conversions using Power BI:

The rule of thumb is to apply the currency exchange conversion upfront. Therefore, it is a good idea to solve all the scenarios requiring a single currency in the report by converting all amounts at the time of data import into the data model. When you need to create a report in multiple currencies, computing data in advance could be challenging and expensive. Therefore, a dynamic solution based on DAX measures and a properly designed data model makes more sense.

In this article, we only consider the third scenario, “Data in a single currency, report with multiple currencies”. The second scenario could be implemented by transforming data so that it is imported in the model in a single currency, moving the challenge over to the very scenario we describe in this article. An extended description of the three scenarios with other dynamic solutions is included in a chapter of the Analyzing Data with Microsoft Power BI and Power Pivot for Excel book.

This is quite a useful article if you work with multiple currencies.

Comments closed

Managing Azure SQL Database with Golang

Silvano Coriani shows us that it’s possible to manage Azure SQL Databases with Go:

Goal for this post is to introduce how to start interacting with Azure SQL through Go (https://golang.org/), an open source programming language gaining lots of traction in developers’ community thanks to its simplicity and efficiency in scenarios like microservices and server apps (did I mention that Kubernetes itself is written in Go?).  

Azure SQL provides full support for Go developers on both control plane (deploy, manage and configure Azure SQL servers and databases) and data plane activities (connect, execute commands and queries against Azure SQL instances) through Azure SDK for Go and Microsoft SQL Server Driver for Go.

A companion code sample for this article, written using VS Code, can be found here.

To me, this is like the articles written back in 2007-2008 about how to manage SQL Server with Python. It’s cool and I hope it helps some people learn the language or find another good use for it, but I’m not sure it moves the needle.

Comments closed

Implicit Type Conversions with Spark SQL

Manoj Pandey walks us through an unexpected error with Spark SQL:

While working on some data analysis I saw one Spark SQL query was not getting me expected results. The table had some good amount of data, I was filtering on a value but some records were missing. So, I checked online and found that Spark SQL works differently compared to SQL Server, in this case while comparing 2 different datatypes columns or variables.

Read on to learn more about the issue. This is the downside of Feasel’s Law: just because both system interfaces are SQL doesn’t mean that they’re equivalent or that the assertions and assumptions you can make for one follow through to the next.

Comments closed

When to Index Temp Tables

Erik Darling thinks about what phase in a procedure one should index a temp table:

You already know that your temp table needs an index. Let’s say there’s some query plan ouchie from not adding one. You’ve already realized that you should probably use a clustered index rather than a nonclustered index. Adding a nonclustered index leaves you with a heap and an index, and there are a lot of times when nonclustered indexes won’t be used because they don’t cover the query columns enough.

Good. We’ve fixed you.

But, like, when should you create the index?

I try to do as many inline operations as I can with temp tables because doing so means you might be able to take advantage of temp table reuse, and on a frequently-running procedure, that can make a difference.

Comments closed