Press "Enter" to skip to content

Month: July 2020

Custom Windows in Apache Flink

Alexander Fedulov walks us through window options with Apache Flink:

In the previous articles of the series, we described how you can achieve flexible stream partitioning based on dynamically-updated configurations (a set of fraud-detection rules) and how you can utilize Flink’s Broadcast mechanism to distribute processing configuration at runtime among the relevant operators. 

Following up directly where we left the discussion of the end-to-end solution last time, in this article we will describe how you can use the “Swiss knife” of Flink – the Process Function to create an implementation that is tailor-made to match your streaming business logic requirements. Our discussion will continue in the context of the Fraud Detection engine. We will also demonstrate how you can implement your own custom replacement for time windows for cases where the out-of-the-box windowing available from the DataStream API does not satisfy your requirements. In particular, we will look at the trade-offs that you can make when designing a solution which requires low-latency reactions to individual events.

This article will describe some high-level concepts that can be applied independently, but it is recommended that you review the material in part one and part two of the series as well as checkout the code base in order to make it easier to follow along.

It’s worth giving this a careful read.

Comments closed

Finding Distance Between Cities using SQL Server

Hasan Savran wants to find geographic distances with SQL Server:

I wrote about finding distance between two location in my older post. I have been getting question about how to make the search by using data in SQL Server. In this post, I will try to answer all these questions. You do not need to have latest version of SQL Server to do any of these examples. SQL Server has been supporting Geospatial data since 2008.
     First, we need some data. Not just some data, some free spatial data. I want to show you how to find distance between cities in this post, so I need at least names of the cities and their latitude and longitude. I downloaded this data from SimpleMaps website in CSV Format.

Read on for the solution.

Comments closed

Sending Messages from SQL Agent to Microsoft Teams

Rob Sewell is waiting for a message and it comes in two parts. First up, sending SQL Agent results to a Teams channel:

Using dbatools we can create a simple script to gather the results of Agent Jobs form a list of instances. Maybe it would be good to be able to get the job runs results every 12 hours so that at 6am in the morning the early-bird DBA can quickly identify if there are any failures that need immediate action and at 6pm , the team can check that everything was ok before they clock off.

But that’s not enough for Rob:

Following on from yesterdays post about creating an overview of SQL Agent Job Results and sending it to a Teams channel, I was given another challenge

Can you write a job step that I can add to SQL Agent jobs that can send the result of that job to a Teams Channel

The use case was for some migration projects that had steps that were scheduled via SQL Agent Jobs and instead of the DBA having to estimate when they would finish and keep checking so that they could let the next team know that it was time for their part to start, they wanted it to notify a Teams channel. This turned out especially useful as the job finished earlier than expected at 3am and the off-shore team could begin their work immediately.

Read the whole thing, as Rob has some detailed code examples.

Comments closed

X Functions in DAX

Matt Allington explains what that “X” means in functions like MEDIANX and also builds out a problem to solve:

An X function is a class of functions in DAX that are also known as “iterators” (note, there are other iterating functions too, like FILTER, but I am only referring to the X functions here). It is a big topic on its own, and this article is not going to be the definitive guide to X functions. But I will give you a couple of insights.

I have learnt a lot about how to teach people DAX over the last 6 years, and my teaching methods have evolved over that time. I remember fondly speaking at the Microsoft Data Insights Summit with Will Thompson on the topic “DAX 50 – DAX for the rest of us“. Will said to me “don’t mention the word ‘iterator’ as it is too confusing.”. I didn’t agree with Will at the time, but his comment stuck with me. Over time I have changed the day way I teach DAX. These days I show people how to add a calculated column in a table (everyone can do that – its dead easy). Then I explain that an X function does exactly the same thing, it’s just that you can’t see the interim results materialised in front of their eyes.

Click through to see what Matt means and stay tuned for the next episode of the X Functions, where Matt reveals that there are aliens by use of bi-directional relationships.

Comments closed

Reusing a Recordset Stored as an SSIS Object Variable

Tim Mitchell appeases the masses:

A few years back, I wrote a blog post about using an SSIS object variable as a data flow source. In that post, I described how you could load a set of query results into an object-typed variable in SQL Server Integration Services and then use that in-memory data as a source within a data flow. In the comments and the feedback I got on that post, the same question kept coming up: what is the process for reusing a recordset in an SSIS object variable in the same package?

In this post, I’ll show how you can modify the scripts within your SSIS package to allow reprocessing of the same set of results in an object variable.

Tim has a nice workaround for the problem, so check it out.

Comments closed

Converting Hexidecimal to Decimal with Power BI

Soheil Bakhshi has a great function for us:

A while ago I wrote a blogpost on how to use Unicode characters in Power BI. In that blogpost I used a recursive Power Query function to convert Hex values to Dec values. A few weeks back one of my site visitors kindly shared his non-recursive version of Power Query function which beautifully does the job. A big shout out to Rocco Lupoi for sharing his code. So, I decided to share it with everyone so more people can leverage his nice Power Query function. I have touched his code a bit though, but it was more cosmetic change, so all the credits of post goes to Rocco. 

Click through for the details.

Comments closed

Postgres Change Data Capture into Kafka

Abhishek Gupta walks us through an example of change data capture to track events:

Change Data Capture (CDC) is a technique used to track row-level changes in database tables in response to create, update and delete operations. Different databases use different techniques to expose these change data events – for example, logical decoding in PostgreSQLMySQL binary log (binlog) etc. This is a powerful capability, but useful only if there is a way to tap into these event logs and make it available to other services which depend on that information.

Debezium does just that! It is a distributed platform that builds on top of Change Data Capture features available in different databases. It provides a set of Kafka Connect connectors which tap into row-level changes (using CDC) in database table(s) and convert them into event streams. These event streams are sent to Apache Kafka which is a scalable event streaming platform – a perfect fit! Once the change log events are in Kafka, they will be available to all the downstream applications.

Click through for the demo, using Azure components.

Comments closed

Spark Director Reader in Hive

Anishek Agarwal, et al, announce a new reader for Hive Warehouse Connector:

Apache Hive supports transactional tables which provide ACID guarantees. There has been a significant amount of work that has gone into hive to make these transactional tables highly performant. Apache Spark provides some capabilities to access hive external tables but it cannot access hive managed tables. To access hive managed tables from spark Hive Warehouse Connector needs to be used. 

We are happy to announce Spark Direct Reader mode in Hive Warehouse Connector which can read hive transactional tables directly from the filesystem. This feature has been available from CDP-Public-Cloud-2.0 (7.2.0.0) and CDP-DC-7.1 (7.1.1.0) releases onwards.

Hive Warehouse Connector (HWC) was available to provide access to managed tables in hive from spark, however since this involved communication with LLAP there was an additional hop to get the data and process it in spark vs the ability of spark to directly read the data from FileSystem for External tables. This leads to performance degradation in accessing data from managed tables vs external tables. Additionally a lot of use cases for HWC were associated with ETL jobs where a super user was running these jobs to update data in multiple tables hence authorization was not a strong business need for this case. HWC Spark Direct Reader is an additional mode available in HWC which tries to address the above concerns. This article describes the usage of spark direct reader to consume hive transactional table data in a spark application. It also introduces the methods and APIs to read hive transactional tables into spark dataframes. Finally, it demonstrates the transaction handling and semantics while using this reader.

Click through to learn how it works and see it in action.

Comments closed

Halloween Problem and Inserts

Jared Poche continues a dive into the Halloween Problem:

I would have expected us to scan the temp table, then have a LEFT JOIN to the base table. The Table Spool is the red flag that we have an issue with the plan, and is frequently seen with Halloween protections.

The index scan on the base table seems to be overkill since we’re joining on the primary key columns (the key lookup isn’t much of a concern). But we’re likely doing the scan because of the spool; it’s SQL Server’s way of getting all relevant records in one place at one time, breaking the normal flow of row mode operation, to make sure we don’t look up the same record multiple times.

Read on to see the execution plan as well as Jared’s fix.

Comments closed