Press "Enter" to skip to content

Day: July 30, 2020

Postgres Change Data Capture into Kafka

Abhishek Gupta walks us through an example of change data capture to track events:

Change Data Capture (CDC) is a technique used to track row-level changes in database tables in response to create, update and delete operations. Different databases use different techniques to expose these change data events – for example, logical decoding in PostgreSQLMySQL binary log (binlog) etc. This is a powerful capability, but useful only if there is a way to tap into these event logs and make it available to other services which depend on that information.

Debezium does just that! It is a distributed platform that builds on top of Change Data Capture features available in different databases. It provides a set of Kafka Connect connectors which tap into row-level changes (using CDC) in database table(s) and convert them into event streams. These event streams are sent to Apache Kafka which is a scalable event streaming platform – a perfect fit! Once the change log events are in Kafka, they will be available to all the downstream applications.

Click through for the demo, using Azure components.

Comments closed

Spark Director Reader in Hive

Anishek Agarwal, et al, announce a new reader for Hive Warehouse Connector:

Apache Hive supports transactional tables which provide ACID guarantees. There has been a significant amount of work that has gone into hive to make these transactional tables highly performant. Apache Spark provides some capabilities to access hive external tables but it cannot access hive managed tables. To access hive managed tables from spark Hive Warehouse Connector needs to be used. 

We are happy to announce Spark Direct Reader mode in Hive Warehouse Connector which can read hive transactional tables directly from the filesystem. This feature has been available from CDP-Public-Cloud-2.0 (7.2.0.0) and CDP-DC-7.1 (7.1.1.0) releases onwards.

Hive Warehouse Connector (HWC) was available to provide access to managed tables in hive from spark, however since this involved communication with LLAP there was an additional hop to get the data and process it in spark vs the ability of spark to directly read the data from FileSystem for External tables. This leads to performance degradation in accessing data from managed tables vs external tables. Additionally a lot of use cases for HWC were associated with ETL jobs where a super user was running these jobs to update data in multiple tables hence authorization was not a strong business need for this case. HWC Spark Direct Reader is an additional mode available in HWC which tries to address the above concerns. This article describes the usage of spark direct reader to consume hive transactional table data in a spark application. It also introduces the methods and APIs to read hive transactional tables into spark dataframes. Finally, it demonstrates the transaction handling and semantics while using this reader.

Click through to learn how it works and see it in action.

Comments closed

Halloween Problem and Inserts

Jared Poche continues a dive into the Halloween Problem:

I would have expected us to scan the temp table, then have a LEFT JOIN to the base table. The Table Spool is the red flag that we have an issue with the plan, and is frequently seen with Halloween protections.

The index scan on the base table seems to be overkill since we’re joining on the primary key columns (the key lookup isn’t much of a concern). But we’re likely doing the scan because of the spool; it’s SQL Server’s way of getting all relevant records in one place at one time, breaking the normal flow of row mode operation, to make sure we don’t look up the same record multiple times.

Read on to see the execution plan as well as Jared’s fix.

Comments closed

Storing SQL Server Database Files in Blob Storage

Tomaz Kastrun has a wacky idea:

Storing SQL Server database files in Azure blob storage is a great solution for all the databases that are often migrated between instances, servers, virtual machines, or would have been divided between instances. This scenario also has the positive aspect to it, since the ability to create snapshot backups to Azure is seamless.

Following the steps, we will create a Azure Blob storage, where MSSQL Server database files will reside with MSSQL Server running on-prem. Assuming, that you already have the Azure account (if not, you can get a free Azure account), let’s proceed by opening the Windows Terminal in PowerShell mode.

I’m impressed that it worked and could see it being an option for small demo databases, but I can’t imagine performance would be good enough for a production scenario.

Comments closed

Loading Data from S3 into Power BI

Gilbert Quevauvilliers loves a challenge:

I really enjoy a good challenge, and with my customer they have all their data stored in AWS S3. Whilst there is no native connector, I thought there must be a way for me to get the data from AWS S3 into Power BI.

I did a bit of Googling and could not find any suitable solution. I also found and learnt that I could use AWS Athena to query the data living in S3. (I am definitely an expert of have a lot of knowledge in the AWS space. I am fortunate that I have other people who know AWS and were able to setup, configure and give me the details to connect to S3 via AWS Athena)

Below are the steps on how I got this working.

Why they don’t have a proper connector is a bit of a head-scratcher to me given the sheer amount of data stored in S3 and the sheer number of connectors in Power BI.

Comments closed

Creating Power BI Measures via Visual Studio Code

Phil Seamark goes one step further with TOM:

My last blog introduced the idea of using Microsoft Visual Studio Code to work with Power BI Models. For this article, I build on that idea by showing how you can use a TOM based script to automatically generate measures in your model Power BI (or Azure Analysis Services) model.

For simplicity, the example in this blog will do the following:

– Connect to an instance of Power BI Desktop
– Iterate through every Table in the model
– Iterate through every Column in the “current” table from the outer loop
– If the Column is numeric and not hidden, create a simple [Sum of <column>] measure

Read on for demonstration code and a walkthrough of the process.

Comments closed

Ignoring security_error_ring_buffer_recorded Events

Erin Stellato recommends putting the system health extended event on a diet:

In Aaron’s post he refers to the security_error_ring_buffer_recorded event as “unactionable noise”, and frankly I can’t think of a better term for it.  I’ve never used it to troubleshoot it any authentication/security issue, and I’m very confident that Microsoft isn’t using it in its current state either.  In terms of the volume of that event, for the new client system that I looked at recently the five system_health files covered about four (4) hours of time.  Of the 1,851,741 million events captured in that time frame, the security_error_ring_buffer_recorded event showed up 1,838,882 times.  That’s 99.3% of the events…absolute noise.

Erin shows you how to turn this off and get rid of a mess of unhelpful messages.

Comments closed

Adding a Last Updated Time to Power BI Reports

Ed Hansenberry adds useful information to a Power BI report:

It is often useful to tell your users when the report refreshed so they have some idea of how current, or stale, the data is. Unfortunately, you cannot just add a TODAY() formula to your report as that will change each time they open the report. This method will give you the refresh time stamp and compensate for Daylight Savings Time.

This frustrates me a bit—the time when the report was generated is really easy to do in Reporting Services and is certainly a good practice to follow when building reports, and yet there’s a multi-step process involving writing M code to do something which ought to be trivial.

Comments closed