Press "Enter" to skip to content

Author: Kevin Feasel

Creating an AML Workspace and Trying the Studio

Tomaz Kastrun continues an advent of Azure ML. First up, Tomaz creates a workspace:

You will select “New workspace”. For now, we will work on a workspace. But just to mention, the “New registry” will enable you to share assets among different workspaces, support multi-region replication and help you provision all resources to facilitate region replications.

From there, the focus shifts to using Azure Machine Learning Studio:

In this overview page, you can click the button “Launch studio” in the middle of the workspace or you can copy and paste the Studio web URL provided under the “Essentials” to start the Studio.

But before we launch the Studio, let’s explore some additional settings, worth mentioning.

Comments closed

Bit Twiddling in T-SQL

Louis Davidson explains how bit operations work in T-SQL:

I expect that 99% of the people reading this looks at this probably would expect there to be a status table that contained the values of status. Seeing that this is a base 2 number, you may be in that 1% that thinks this might be a bitmask. but unless you have and eidetic memory, you probably don’t know what all of the bits mean.

A bitmask is a type of denormalization of values where instead of having a set of columns that have on or off values (no Null values), you encode it like:

Bitmasks make me break out the angry nun ruler. You can almost guarantee you’re doing something wrong if you design a bitmask as a column in a table.

Comments closed

Enterprise-Level Backups in MySQL

Lukas Vileikis continues a series on backups with MySQL:

MySQL Enterprise Backup is a known tool for everyone inside of the world of MySQL, no matter if people dealing with the tool are enterprise (business) minded or not. The tool can be considered the flagship of MySQL’s enterprise-level offerings: it comes as a part of its Enterprise tier and costs thousands to attain: is it worth your money, and perhaps more importantly, your time? We will figure that out in this blog.

Read on for Lukas’s thoughts.

Comments closed

Azure SQL Managed Instance Performance

Reitse Eskens wraps up a series on Azure SQL performance comparisons:

So far, the blogs were about the really SaaS databases; the database is deployed and you don’t have think about it anymore. This ease of use comes at a ‘price’. You’ve got no control whatsoever on files, you’ve lost the SQL Agent and a number of other features. The managed instance is a bit different. When I was testing you could see the TempDB files but not change them, since then a few changes have been made to this tier where you’re able to change settings and, Niko Neugebauer told the data community on twitter, there are more changes coming. With the managed instance, you get the agent back and you can run cross database query’s again. So you can safely say the managed instance is a hybrid between your trusty on-premises server and the fully managed Azure SQL database.

Click through for Reitse’s thoughts.

Comments closed

Defining an Analytics Engineer

Ust Oldfield defines a term:

Analytics Engineering, along with Data Engineering and Report Engineering, is a specialised subset of skills that would previously be the preserve of a Business Intelligence (BI) Developer. The BI Developer was once a generalist data developer, whose overall responsibilities have been split out and shared among specialist developers as the prevalence of data across organisation has increased and the tools and technologies used to ingest, transform, and serve data have become more specialised and loosely integrated.

In the same way that Data Engineering borrowed and took inspiration from Software Engineering for applying repeatable and scalable patterns and techniques to the pipelines that ingest and cleanse data, as well as the rigorous testing of those pipelines, Analytics Engineering has borrowed and taken inspiration from Software Engineering too.

Click through for the specifics of what an Analytics Engineer does.

Comments closed

An Intro to Azure Machine Learning

Tomaz Kastrun has a new Advent challenge:

Azure Machine Learning (or Azure Machine Learning Service and abbreviation AML) is Azure’s cloud service for creating, managing and productionalising machine learning projects. It is a collaborative tool for Data Scientists, Machine Learning Engineers, and data engineers, covering their daily and operational tasks. From creating and training to deploying and managing predictive models and machine learning solutions.

Click through for the introduction.

Comments closed

Apache Ranger on ElasticMapReduce

Laurence Geng looks at Ranger:

Whether you’ve successfully made it before or not, installing and integrating Windows AD/OpenLDAP + Ranger + EMR is a very hard job, it is complicated, error-prone, and time-consuming for the following reasons:

Read on for the list of reasons, some background on Ranger, and an automated installer intended to make life a bit easier.

Comments closed

Data Lake Exploration in AWS with Athena for Spark

Pathik Shah and Raj Devnath jetski the data lake:

Amazon Athena now enables data analysts and data engineers to enjoy the easy-to-use, interactive, serverless experience of Athena with Apache Spark in addition to SQL. You can now use the expressive power of Python and build interactive Apache Spark applications using a simplified notebook experience on the Athena console or through Athena APIs. For interactive Spark applications, you can spend less time waiting and be more productive because Athena instantly starts running applications in less than a second. And because Athena is serverless and fully managed, analysts can run their workloads without worrying about the underlying infrastructure.

Data lakes are a common mechanism to store and analyze data because they allow companies to manage multiple data types from a wide variety of sources, and store this data, structured and unstructured, in a centralized repository. Apache Spark is a popular open-source, distributed processing system optimized for fast analytics workloads against data of any size. It’s often used to explore data lakes to derive insights. For performing interactive data explorations on the data lake, you can now use the instant-on, interactive, and fully managed Apache Spark engine in Athena. It enables you to be more productive and get started quickly, spending almost no time setting up infrastructure and Spark configurations.

In this post, we show how you can use Athena for Apache Spark to explore and derive insights from your data lake hosted on Amazon Simple Storage Service (Amazon S3).

This feels a lot like the Spark pool in Azure Synapse Analytics, as well as some of what Databricks does

Comments closed

Using the Kusto Time Pivot Chart

Chango Valtchev reminds us of Gantt charts:

This is the scenario: We have a job scheduler and a related job deployment manager, both implemented based on a state machines framework. One of the scheduler features is preemptable jobs: Jobs of that class can be suspended when a high-priority job needs to be scheduled and there is no available capacity. Effecting preemption requires some involved orchestration between the scheduler and the deployment manager, and we’ve had reliability issues in some cases – both due to incorrectly handled races and latency spikes in the cleanup of the suspended jobs from the cluster. Debugging such issues based on the raw logs has been very tedious – a typical log is 10-30K lines. This gets much worse with the number of dependencies. Given the concurrent processing of the suspensions, tracking the interactions with the new job’s deployment can be mentally taxing. The timeline visualization brought a breakthrough to our debugging ability and productivity. The following sample is a purposefully simplified case. In this scenario, things worked well. It shows the ‘Main’ job, at high priority, waiting on its dependencies to be suspended (while waiting, “Skipped schedule processing” is logged). Shortly after all the suspensions complete, the main job gets to Running state.

Read on to see the scenario in action.

Comments closed