Press "Enter" to skip to content

Day: December 2, 2022

Defining an Analytics Engineer

Ust Oldfield defines a term:

Analytics Engineering, along with Data Engineering and Report Engineering, is a specialised subset of skills that would previously be the preserve of a Business Intelligence (BI) Developer. The BI Developer was once a generalist data developer, whose overall responsibilities have been split out and shared among specialist developers as the prevalence of data across organisation has increased and the tools and technologies used to ingest, transform, and serve data have become more specialised and loosely integrated.

In the same way that Data Engineering borrowed and took inspiration from Software Engineering for applying repeatable and scalable patterns and techniques to the pipelines that ingest and cleanse data, as well as the rigorous testing of those pipelines, Analytics Engineering has borrowed and taken inspiration from Software Engineering too.

Click through for the specifics of what an Analytics Engineer does.

Comments closed

An Intro to Azure Machine Learning

Tomaz Kastrun has a new Advent challenge:

Azure Machine Learning (or Azure Machine Learning Service and abbreviation AML) is Azure’s cloud service for creating, managing and productionalising machine learning projects. It is a collaborative tool for Data Scientists, Machine Learning Engineers, and data engineers, covering their daily and operational tasks. From creating and training to deploying and managing predictive models and machine learning solutions.

Click through for the introduction.

Comments closed

Data Lake Exploration in AWS with Athena for Spark

Pathik Shah and Raj Devnath jetski the data lake:

Amazon Athena now enables data analysts and data engineers to enjoy the easy-to-use, interactive, serverless experience of Athena with Apache Spark in addition to SQL. You can now use the expressive power of Python and build interactive Apache Spark applications using a simplified notebook experience on the Athena console or through Athena APIs. For interactive Spark applications, you can spend less time waiting and be more productive because Athena instantly starts running applications in less than a second. And because Athena is serverless and fully managed, analysts can run their workloads without worrying about the underlying infrastructure.

Data lakes are a common mechanism to store and analyze data because they allow companies to manage multiple data types from a wide variety of sources, and store this data, structured and unstructured, in a centralized repository. Apache Spark is a popular open-source, distributed processing system optimized for fast analytics workloads against data of any size. It’s often used to explore data lakes to derive insights. For performing interactive data explorations on the data lake, you can now use the instant-on, interactive, and fully managed Apache Spark engine in Athena. It enables you to be more productive and get started quickly, spending almost no time setting up infrastructure and Spark configurations.

In this post, we show how you can use Athena for Apache Spark to explore and derive insights from your data lake hosted on Amazon Simple Storage Service (Amazon S3).

This feels a lot like the Spark pool in Azure Synapse Analytics, as well as some of what Databricks does

Comments closed

Apache Ranger on ElasticMapReduce

Laurence Geng looks at Ranger:

Whether you’ve successfully made it before or not, installing and integrating Windows AD/OpenLDAP + Ranger + EMR is a very hard job, it is complicated, error-prone, and time-consuming for the following reasons:

Read on for the list of reasons, some background on Ranger, and an automated installer intended to make life a bit easier.

Comments closed

Pre-Attentive Attributes and Visualization

Alex Velez hits on an important topic:

Have you ever wondered whether the graph or the slide you created is any good? Was the time you spent choosing colors, deleting gridlines, and wordsmithing slide titles, worth it, or for naught? While the answer is certainly more nuanced than a simple yes or no, there is a quick way to gain some insight into this: the where are your eyes drawn? test, also known as WAYED. It’s a simple question, but it can help to refine your own creations, and provides a construct for giving feedback to others.

Click through for the test and a bit more information about pre-attentive attributes.

Comments closed

Using the Kusto Time Pivot Chart

Chango Valtchev reminds us of Gantt charts:

This is the scenario: We have a job scheduler and a related job deployment manager, both implemented based on a state machines framework. One of the scheduler features is preemptable jobs: Jobs of that class can be suspended when a high-priority job needs to be scheduled and there is no available capacity. Effecting preemption requires some involved orchestration between the scheduler and the deployment manager, and we’ve had reliability issues in some cases – both due to incorrectly handled races and latency spikes in the cleanup of the suspended jobs from the cluster. Debugging such issues based on the raw logs has been very tedious – a typical log is 10-30K lines. This gets much worse with the number of dependencies. Given the concurrent processing of the suspensions, tracking the interactions with the new job’s deployment can be mentally taxing. The timeline visualization brought a breakthrough to our debugging ability and productivity. The following sample is a purposefully simplified case. In this scenario, things worked well. It shows the ‘Main’ job, at high priority, waiting on its dependencies to be suspended (while waiting, “Skipped schedule processing” is logged). Shortly after all the suspensions complete, the main job gets to Running state.

Read on to see the scenario in action.

Comments closed

Data Exfiltration Protection and Synapse Pipelines

Luke Moloney shuts it down:

Before we discuss how DEP applies to Synapse Pipelines, it is important to level-set on some Synapse Pipelines specific concepts – if you are familiar with Synapse Pipelines or Azure Data Factory you can skip over this section and jump to Synapse Pipeline connectivity without DEP enabled.

For a more generalized introduction to Synapse Pipelines check out this doc article.

Synapse Pipelines enables users to connect to a range of different data services, through what is called a Linked Service. 

The big trick, using self-hosted integration runtimes, is something Luke spends a fair amount of time on.

Comments closed

Physical and Logical Backups in MySQL

Lukas Vileikis continues a series on MySQL backups:

Everyone who has ever backed up data using any kind of RDBMS knows a thing or two about backups. Backups are a central part of data integrity – especially nowadays, when data breaches are happening left and right. Properly tested backups are crucial to any company: once something happens to our data, they help us quickly get back on track. However, some of you may have heard about the differences between backups in database management systems – backups are also classified into a couple of forms unique to themselves. We’re talking about the physical and logical forms – these have their own advantages and downsides: let’s explore those and the differences between the two. This tutorial is geared towards MySQL, but we will also provide some advice that is not exclusive to MySQL.

Click through to learn those differences.

Comments closed