Press "Enter" to skip to content

Month: December 2022

An Intro to Azure Machine Learning

Tomaz Kastrun has a new Advent challenge:

Azure Machine Learning (or Azure Machine Learning Service and abbreviation AML) is Azure’s cloud service for creating, managing and productionalising machine learning projects. It is a collaborative tool for Data Scientists, Machine Learning Engineers, and data engineers, covering their daily and operational tasks. From creating and training to deploying and managing predictive models and machine learning solutions.

Click through for the introduction.

Comments closed

Data Lake Exploration in AWS with Athena for Spark

Pathik Shah and Raj Devnath jetski the data lake:

Amazon Athena now enables data analysts and data engineers to enjoy the easy-to-use, interactive, serverless experience of Athena with Apache Spark in addition to SQL. You can now use the expressive power of Python and build interactive Apache Spark applications using a simplified notebook experience on the Athena console or through Athena APIs. For interactive Spark applications, you can spend less time waiting and be more productive because Athena instantly starts running applications in less than a second. And because Athena is serverless and fully managed, analysts can run their workloads without worrying about the underlying infrastructure.

Data lakes are a common mechanism to store and analyze data because they allow companies to manage multiple data types from a wide variety of sources, and store this data, structured and unstructured, in a centralized repository. Apache Spark is a popular open-source, distributed processing system optimized for fast analytics workloads against data of any size. It’s often used to explore data lakes to derive insights. For performing interactive data explorations on the data lake, you can now use the instant-on, interactive, and fully managed Apache Spark engine in Athena. It enables you to be more productive and get started quickly, spending almost no time setting up infrastructure and Spark configurations.

In this post, we show how you can use Athena for Apache Spark to explore and derive insights from your data lake hosted on Amazon Simple Storage Service (Amazon S3).

This feels a lot like the Spark pool in Azure Synapse Analytics, as well as some of what Databricks does

Comments closed

Apache Ranger on ElasticMapReduce

Laurence Geng looks at Ranger:

Whether you’ve successfully made it before or not, installing and integrating Windows AD/OpenLDAP + Ranger + EMR is a very hard job, it is complicated, error-prone, and time-consuming for the following reasons:

Read on for the list of reasons, some background on Ranger, and an automated installer intended to make life a bit easier.

Comments closed

Pre-Attentive Attributes and Visualization

Alex Velez hits on an important topic:

Have you ever wondered whether the graph or the slide you created is any good? Was the time you spent choosing colors, deleting gridlines, and wordsmithing slide titles, worth it, or for naught? While the answer is certainly more nuanced than a simple yes or no, there is a quick way to gain some insight into this: the where are your eyes drawn? test, also known as WAYED. It’s a simple question, but it can help to refine your own creations, and provides a construct for giving feedback to others.

Click through for the test and a bit more information about pre-attentive attributes.

Comments closed

Using the Kusto Time Pivot Chart

Chango Valtchev reminds us of Gantt charts:

This is the scenario: We have a job scheduler and a related job deployment manager, both implemented based on a state machines framework. One of the scheduler features is preemptable jobs: Jobs of that class can be suspended when a high-priority job needs to be scheduled and there is no available capacity. Effecting preemption requires some involved orchestration between the scheduler and the deployment manager, and we’ve had reliability issues in some cases – both due to incorrectly handled races and latency spikes in the cleanup of the suspended jobs from the cluster. Debugging such issues based on the raw logs has been very tedious – a typical log is 10-30K lines. This gets much worse with the number of dependencies. Given the concurrent processing of the suspensions, tracking the interactions with the new job’s deployment can be mentally taxing. The timeline visualization brought a breakthrough to our debugging ability and productivity. The following sample is a purposefully simplified case. In this scenario, things worked well. It shows the ‘Main’ job, at high priority, waiting on its dependencies to be suspended (while waiting, “Skipped schedule processing” is logged). Shortly after all the suspensions complete, the main job gets to Running state.

Read on to see the scenario in action.

Comments closed

Data Exfiltration Protection and Synapse Pipelines

Luke Moloney shuts it down:

Before we discuss how DEP applies to Synapse Pipelines, it is important to level-set on some Synapse Pipelines specific concepts – if you are familiar with Synapse Pipelines or Azure Data Factory you can skip over this section and jump to Synapse Pipeline connectivity without DEP enabled.

For a more generalized introduction to Synapse Pipelines check out this doc article.

Synapse Pipelines enables users to connect to a range of different data services, through what is called a Linked Service. 

The big trick, using self-hosted integration runtimes, is something Luke spends a fair amount of time on.

Comments closed

Physical and Logical Backups in MySQL

Lukas Vileikis continues a series on MySQL backups:

Everyone who has ever backed up data using any kind of RDBMS knows a thing or two about backups. Backups are a central part of data integrity – especially nowadays, when data breaches are happening left and right. Properly tested backups are crucial to any company: once something happens to our data, they help us quickly get back on track. However, some of you may have heard about the differences between backups in database management systems – backups are also classified into a couple of forms unique to themselves. We’re talking about the physical and logical forms – these have their own advantages and downsides: let’s explore those and the differences between the two. This tutorial is geared towards MySQL, but we will also provide some advice that is not exclusive to MySQL.

Click through to learn those differences.

Comments closed

The Problem of Reproducability in Neural Networks

Pete Warden explains a problem:

Last week I had a question from a colleague about reproducibility in TensorFlow, specifically in the 1.14 era. He wanted to be able to run the same training code multiple times and get exactly the same results, which on the surface doesn’t seem like an unreasonable expectation. Machine learning training is fundamentally a series of arithmetic operations applied repeatedly, so what makes getting the same results every time so hard? I had the same question when we first started TensorFlow, and I was lucky enough to learn some of the answers from the numerical programming experts on the team, so I want to share a bit of what I discovered.

Read on for that answer.

Comments closed

SQL Server 2022 Query Store Hints

Matthew McGiffen takes a hint:

Another neat little feature in SQL Server 2022 is Query Store Hints. This is the ability to apply a query hint through Query Store rather than having to modify existing code or fiddle around with plan guides.


Query hints are a way to influence optimizer behaviour towards generating desired execution plans for a given query. The word “hint” is a bit of a misnomer as usually they mandate what you wish to happen.

Right. They’re ‘hints’ in the way that my wife ‘hints’ that I should take out the garbage.

Comments closed

Where Extended Events Go by Default

Tom Zika is curious:

Have you ever wondered where the .xel file is saved when you create a new Extended Event session and don’t specify the full path (just the file name)?

Like so: [image removed because you should go to Tom’s site and see it, ed.]

Well, so did I and here’s what I’ve found out.

Click through to learn where these files end up if you don’t specify anything.

Comments closed