Press "Enter" to skip to content

Category: Cloud

Sort Keys and Join Types in Amazon Redshift

Derik Hammer takes us through query tuning a nasty job on Amazon Redshift:

My team built a process to load from a couple of base tables, in our Amazon Redshift enterprise data warehouse, into an other table which would act as a data mart entity. The data was rolled up and it included some derived fields. The SQL query had some complicity [complexity?, ed.] to it.

This process ran daily and was being killed by our operations team after running for 22 hours.

I stepped in to assist with performance tuning and discovered that join choices, such as INNER vs. OUTER joins have a big impact on whether Redshift can use its sort keys or not.

Click through for more details and what Derik ended up doing.

Comments closed

Loading Data into Delta Lake

Prakash Chockalingam takes us through auto-loading Delta Lake from various sources:

Auto Loader is an optimized file source that overcomes all the above limitations and provides a seamless way for data teams to load the raw data at low cost and latency with minimal DevOps effort. You just need to provide a source directory path and start a streaming job. The new structured streaming source, called “cloudFiles”, will automatically set up file notification services that subscribe file events from the input directory and process new files as they arrive, with the option of also processing existing files in that directory.

This does look interesting.

Comments closed

A Metadata-Driven Framework for ADF Pipelines

Paul Andrew has started a series on metadata-driven Azure Data Factory pipelines:

The concept of having a processing framework to manage our Data Platform solutions isn’t a new one. However, overtime changes in the technology we use means the way we now deliver this orchestration has to change as well, especially in Azure. On that basis and using my favourite Azure orchestration service; Azure Data Factory (ADF) I’ve created an alpha metadata driven framework that could be used to execute all our platform processes. Furthermore, at various community events I’ve talked about bootstrapping solutions with Azure Data Factory so now as a technical exercise I’ve rolled my own simple processing framework. Mainly, to understand how easily we can make it with the latest cloud tools and fully exploiting just how dynamic you can get a set of generational pipelines.

This first post lays out some of the architectural decisions and prep work needed for the series.

Comments closed

Creating an Elastic Jobs Agent

Kate Smith continues a series on elastic jobs in Azure SQL Database:

Having laid the conceptual groundwork for Elastic Jobs in two previous postings (12), I am now going to create an elastic job and associated credentials using PowerShell.  For this scenario, I have one or more databases with a table ‘T’ and statistics ‘tStats’. I want to enforce an update for these statistics every day. To do this, I need to check that my stats have been updated in the past day, and if not, update them. The T-SQL to update statistics on a table “T” with stats named “tStats” is simple:

Click through for the Powershell script.

Comments closed

Good Ideas for Designing Data Lakes

Prateek Shrivastava and Rangasayee Chandrasekaran share some advice on designing data lakes in the cloud:

Data generation and data collection across semi-structured and unstructured formats is both bursty and continuous. Inspecting, exploring and analyzing these datasets in their raw form is tedious, because the analytical engines scan the entire data set across multiple files. We recommend five ways to reduce data scanned and reduce query overheads –

Click through for the details.

Comments closed

Managing Performance on Azure SQL Managed Instances

Tim Radney has a few considerations for you if you want to start using Azure SQL Managed Instances:

Storage is a bit more difficult to plan and make considerations for, due to having to consider multiple factors. For storage you need to account for the overall storage requirement for both storage size, and I/O needs. How many GBs or TBs are needed for the SQL Server instance and how fast does the storage need to be? How many IOPS and how much throughput is the on-premises instance using? For that, you must baseline your current workload using perfmon to capture average and max MB/s and/or taking snapshots of sys.dm_io_virtual_file_stats to capture throughput utilization. This will give you an idea of what type of I/O and throughput you need in the new environment. Several customers I’ve worked with have missed this vital part of migration planning and have encountered performance issues due to selecting an instance level that didn’t support their workload.

Tim has a lot of good advice in here, so read the whole thing.

Comments closed

Streaming Pipelines in AWS with Flink and Kinesis Data Analytics

Steffen Hasumann shows us how to put together a streaming ETL pipeline in AWS using Apache Flink and Amazon Kinesis Data Analytics:

The remainder of this post discusses how to implement streaming ETL architectures with Apache Flink and Kinesis Data Analytics. The architecture persists streaming data from one or multiple sources to different destinations and is extensible to your needs. This post does not cover additional filtering, enrichment, and aggregation transformations, although that is a natural extension for practical applications.

This post shows how to build, deploy, and operate the Flink application with Kinesis Data Analytics, without further focusing on these operational aspects. It is only relevant to know that you can create a Kinesis Data Analytics application by uploading the compiled Flink application jar file to Amazon S3 and specifying some additional configuration options with the service. You can then execute the Kinesis Data Analytics application in a fully managed environment. For more information, see Build and run streaming applications with Apache Flink and Amazon Kinesis Data Analytics for Java Applications and the Amazon Kinesis Data Analytics developer guide.

Click through for the walkthrough.

Comments closed

Receiving Notifications when Azure Function Apps Fail

Gilbert Quevauvilliers shares how to receive notification e-mails when an Azure Function App fails:

Below are the steps to enable error notifications on Azure Function Apps

Follows on from my previous blog post How you can store All your Power BI Audit Logs easily and indefinitely in Azure, where every day it extracts the Audit logs into Azure Blob storage. One of the key things when working with any job that runs, is that I want to know when the job fails. If I do not have this and I assume that the data is always where, I could fall into a situation where there is missing data that I cannot get back.

Below explains how to create an alert with a notification email if an Azure Function App fails.

Read on for the step-by-step instructions.

Comments closed

Understanding Azure SQL Database Elastic Jobs

Kate Smith takes us through some important concepts around Elastic Jobs in Azure SQL Database:

It is very important that the T-SQL scripts being executed by Elastic Jobs be idempotent.  This means that if they are run multiple times (by accident or intentionally) they won’t fail and won’t produce unintended results. If an elastic job has some side effects, and gets run more than once, it could fail or cause other unintended consequences (like consuming double the resources needed for a large statistics update).  One way to ensure idempotence is to make sure that you check if something already exists before trying to create it.

This takes some getting used to, but once you’re in the habit, you are much better off. Read on for more details on other key concepts.

Comments closed

MR3: Hive on Kubernetes

Alex Woodie reports on a DataMonad production:

MR3 is a software product developed by a team led by Sungwoo Park. The software, which is not open source, is sold by a Delaware-based software company called DataMonad. After prototyping a Java-based execution engine called MR2 in the 2013 timeframe, development of Scala-based MR3 began in 2015. The first release of MR3 was delivered in early 2018, and version 1.0 was released yesterday.

According to DataMonad, MR3 is an execution engine for big data processing, and Hive is the first and main application that’s been configured to run on it (Tez is also supported). The company says MR3 offers comparable performance to the latest release of Hive, dubbed LLAP, but without the technical complexity.

The closed-sourcedness is a bit of a downer, but I like having more competition in the space.

Comments closed