Press "Enter" to skip to content

Category: Cloud

Streaming Pipelines in AWS with Flink and Kinesis Data Analytics

Steffen Hasumann shows us how to put together a streaming ETL pipeline in AWS using Apache Flink and Amazon Kinesis Data Analytics:

The remainder of this post discusses how to implement streaming ETL architectures with Apache Flink and Kinesis Data Analytics. The architecture persists streaming data from one or multiple sources to different destinations and is extensible to your needs. This post does not cover additional filtering, enrichment, and aggregation transformations, although that is a natural extension for practical applications.

This post shows how to build, deploy, and operate the Flink application with Kinesis Data Analytics, without further focusing on these operational aspects. It is only relevant to know that you can create a Kinesis Data Analytics application by uploading the compiled Flink application jar file to Amazon S3 and specifying some additional configuration options with the service. You can then execute the Kinesis Data Analytics application in a fully managed environment. For more information, see Build and run streaming applications with Apache Flink and Amazon Kinesis Data Analytics for Java Applications and the Amazon Kinesis Data Analytics developer guide.

Click through for the walkthrough.

Comments closed

Receiving Notifications when Azure Function Apps Fail

Gilbert Quevauvilliers shares how to receive notification e-mails when an Azure Function App fails:

Below are the steps to enable error notifications on Azure Function Apps

Follows on from my previous blog post How you can store All your Power BI Audit Logs easily and indefinitely in Azure, where every day it extracts the Audit logs into Azure Blob storage. One of the key things when working with any job that runs, is that I want to know when the job fails. If I do not have this and I assume that the data is always where, I could fall into a situation where there is missing data that I cannot get back.

Below explains how to create an alert with a notification email if an Azure Function App fails.

Read on for the step-by-step instructions.

Comments closed

Understanding Azure SQL Database Elastic Jobs

Kate Smith takes us through some important concepts around Elastic Jobs in Azure SQL Database:

It is very important that the T-SQL scripts being executed by Elastic Jobs be idempotent.  This means that if they are run multiple times (by accident or intentionally) they won’t fail and won’t produce unintended results. If an elastic job has some side effects, and gets run more than once, it could fail or cause other unintended consequences (like consuming double the resources needed for a large statistics update).  One way to ensure idempotence is to make sure that you check if something already exists before trying to create it.

This takes some getting used to, but once you’re in the habit, you are much better off. Read on for more details on other key concepts.

Comments closed

MR3: Hive on Kubernetes

Alex Woodie reports on a DataMonad production:

MR3 is a software product developed by a team led by Sungwoo Park. The software, which is not open source, is sold by a Delaware-based software company called DataMonad. After prototyping a Java-based execution engine called MR2 in the 2013 timeframe, development of Scala-based MR3 began in 2015. The first release of MR3 was delivered in early 2018, and version 1.0 was released yesterday.

According to DataMonad, MR3 is an execution engine for big data processing, and Hive is the first and main application that’s been configured to run on it (Tez is also supported). The company says MR3 offers comparable performance to the latest release of Hive, dubbed LLAP, but without the technical complexity.

The closed-sourcedness is a bit of a downer, but I like having more competition in the space.

Comments closed

Executing Azure Data Factory Pipelines with Azure Functions

Paul Andrew wants to execute an Azure Data Factory pipeline via an Azure Function call:

For the function itself, hopefully this is fairly intuitive once you’ve created your DataFactoryManagementClient and authenticated.

The only thing to be careful of is not using the CreateOrUpdateWithHttpMessagesAsync method by mistake. Make sure its Create Run. Sounds really obvious, but when you get code drunk names blur together and the very different method overloads will have you confused for hours!…. According to a friend 🙂

Read the whole thing.

Comments closed

Reading Azure DevOps Results in Powershell

Mark Broadbent doesn’t let the lack of an official Powershell module get in the way:

In my post Using Azure CLI to query Azure DevOps I explained how you can use the Azure CLI to query Azure DevOps so you can obtain useful information on builds, releases, and other useful information. The solution required a certain level of skill with JMESPath to manipulate your result sets -which as explained can be a little confusing.

However once you have a bare bones result set, it is likely that you will want to consume these results in a more user-friendly environment such as PowerShell so that you can build upon these data sets. I thought this would be an easy thing to do, but as you will see below it was anything but.

Read on for some thoughts and a sample script.

Comments closed

Storing Power BI Audit Logs in Blob Storage

Gilbert Quevauvilliers works around a built-in constraint with Power BI Audit Logs:

With the new Power BI Get-PowerBIActivityEvent I wanted to find a way where I could automate the entire process where it all runs in the cloud.

One of the current challenges with the Audit logs is that they only store 90 days, so if you want to do analysis for longer than 90 days the log files have to be stored somewhere. Why not use Azure Blob Storage?

Whilst these steps might appear to be rather technical if you follow them and you have access to an Azure Subscription you can do this too.

Gilbert warns us up-front that this will be a lengthy post and that is quite true. But if you need to hold those audit logs more than 90 days, this is a great way of doing so.

Comments closed

Connecting to Snowflake with Power BI

Gilbert Quevauvilliers shows us how we can connect from a Snowflake DB instance to Power BI using DirectQuery:

The first thing I did was to install the ODBC Drivers.

I installed the 64bit drivers where I had my Power BI Desktop installed, and I also installed it on all the Servers where I had the On-Premise Data gateway installed.

Below is the link that I used which should always be the latest version

https://sfc-repo.snowflakecomputing.com/odbc/win64/latest/index.html

One thing to note is all that I did was I installed the ODBC driver I did not actually do any configuration of the ODBC driver, this is because it will be configured in Power BI Desktop.

Read on for the configuration instructions as well as getting past “it works in Power BI Desktop.”

Comments closed

Quick Hits on Azure Databricks Performance

Rayis Imayev has a few thoughts on optimizing delta table-based workloads in Azure Databricks:

2) Enable the Delta cache – spark.databricks.io.cache.enabledtrue
There is a very good resource available on configuring this Spark config setting: https://docs.microsoft.com/en-us/azure/databricks/delta/optimizations/delta-cache

And this will be very helpful in your Databricks notebook’s queries when you try to access a similar dataset multiple times. Once you read this dataset for the first time, Spark places it into internal local storage cache and will speed up the process of further referencing it for you.

Click through for several more along these lines.

Comments closed

Fun with Metaphors: Data Lakehouses

Ben Lorica, et al, have a new metaphor to try out:

Over the past few years at Databricks, we’ve seen a new data management paradigm that emerged independently across many customers and use cases: the lakehouse. In this post we describe this new paradigm and its advantages over previous approaches.

The Data Lake’s Aristotelian counterpart is the Data Swamp. I’m working on a similar comp for the Data Lakehouse (Data Swampboat? Data Swamphouse is too easy), but in the meantime, that one person who goes and slaughters your application’s performance by butchering the data in your Data Lakehouse? That’s a Data Jason.

1 Comment