Press "Enter" to skip to content

Category: Cloud

Quantum: Serverless Database

Ashish Thusoo announces a new serverless SQL engine:

Today, we are launching Quantum, a high-performance serverless SQL engine, available on Qubole Data Platform, that simplifies SQL access by offering a true serverless deployment option to enable data analysts to query petabyte-scale volumes of data using ANSI-SQL.

Quantum allows teams to realize value from their data much more quickly, and because of its serverless nature, users pay only for queries they run. Data analysts can query object stores on AWS, Azure, Google Cloud, and Oracle Cloud in seconds to achieve faster time to market with far less IT management overhead.

Ankit Dixit, et al, provide a technical overview as well:

Existing serverless SQL service offerings do not provide users with the ability to use a  metastore of their choice. With Quantum, data teams can use their own custom metastore and start using Quantum without recreating schemas or table metadata.

Most existing Qubole customers already use a custom metastore in the cloud. So there’s virtually no ramp up time to reap the benefits of Quantum.

The technical overview is a bit too much marketing for my tastes, but this is a move worth watching.

Comments closed

Arrays in Azure Data Factory

Rayis Imayev takes us through arrays in Azure Data Factory:

Currently, there are 3 data types supported in ADF variables: String, Boolean, and Array. The first two are pretty easy to use: Boolean for logical binary results and String for everything else, including the numbers (no wonder there are so many conversion functions in Azure Data Factory that we can use).

I’ve also blogged about using Variables in Azure Data Factory:
– Setting Variables in Azure Data Factory Pipelines
– Append Variable activity in Azure Data Factory: Story of combining things together  
– System Variables in Azure Data Factory: Your Everyday Toolbox 
– Azure Data Factory: Extracting array first element

Click through for arrays and follow up with those other posts from there.

Comments closed

Cosmos DB Role-Based Access Control

Hasan Savran takes us through role-based access control in Cosmos DB:

Role-based access control (RBAC) is available for Azure CosmosDB. By using RBAC, you can manage who has access to CosmosDB resources. You need to have a profile in Azure Active Directory to assign RBAC roles to users, groups or other targets. There are 4 built-in roles you can use. Microsoft announced the CosmosDB Operator
 role recently.

Click through for that list.

Comments closed

Building an ARM Template for Azure Data Factory

Andy Leondard takes the first steps to building an Azure Data Factory pipeline using Azure Resource Manager Templates:

Azure Resource Manager, or ARM, “allows you to provision your applications using a declarative template.” So says the Azure Quickstart Templates page. ARM templates are JSON and allow administrators to import and export Azure resources using varying management patterns. I really like ARM templates for implementing infrastructure as code in Azure. In this post I show a very simple example of how to use ARM templates to export and then import a basic ADF (Azure Data Factory) pipeline.

The sample code doesn’t do that much by itself, but it does open up a new world of automation.

Comments closed

Time Series Modeling with Gluon

Jan Gasthaus, et al, announce a new open source product release:

We are excited to announce the open source release of Gluon Time Series (GluonTS), a Python toolkit developed by Amazon scientists for building, evaluating, and comparing deep learning–based time series models. GluonTS is based on the Gluon interface to Apache MXNet and provides components that make building time series models simple and efficient.

In this post, I describe the key functionality of the toolkit and demonstrate how to apply GluonTS to a time series forecasting problem.

It looks interesting.

Comments closed

An Introduction to Azure Databricks

Brad Llewellyn has an introduction to Azure Databricks:

So, what is Azure Databricks?  To answer this question, let’s start all the way at the bottom of the hole and climb up.  So, what is Hadoop?  Apache Hadoop is an open-source, distributed storage and computing ecosystem designed to handle incredibly large volumes of data and complex transformations.  It is becoming more common as organizations are starting to integrate massive data sources, such as social media, financial transactions and the Internet of Things.  However, Hadoop solutions are extremely complex to manage and develop.  So, many people have worked together to create platforms that layer on top of Hadoop to provide a simpler way to solve certain types of problems.  Apache Spark is one of these platforms.  You can read more about Apache Hadoop here and here.

It’s Hadoop turtles all the way down.

Comments closed

Using Notebooks with ElasticMapReduce

Vignesh Rajamani and Nikki Rouda show off ElasticMapReduce Notebooks:

One of the useful features of EMR Notebooks is the separation of the notebook environment from your underlying cluster infrastructure. The separation makes it easy for you to execute notebook code against transient clusters without worrying about deploying or configuring your notebook infrastructure every time you bring up a new cluster. You can create multiple serverless notebooks from the AWS Management Console for EMR and access the notebook UI without spending time setting up SSH access or configuring your browser for port-forwarding. Each notebook you create is launched instantly with its own Spark context. This capability enables you to attach multiple notebooks to a single shared cluster and submit parallel jobs without fear of job conflicts in a multi-tenant environment. This way you make efficient use of your clusters.

You can also connect EMR Notebooks to an EMR cluster as small as a one node. This gives you a budget-friendly sandbox environment to develop your Spark application.

Notebooks are everywhere. And for good reason.

Comments closed

Unique Key Constraints in Cosmos DB

Hasan Savran shows how you can set unique key constraints on Cosmos DB containers:

Unique key names are case-sensitive, I have good experience on this. If your unique key is in lowercase letters but your data has field with uppercase, CosmosDB will insert null value into unique key first time, you will get an error second time when it tries to insert null again. CosmosDB does not support sparse unique keys. If your unique key is /SSN, you can have only one null value in this field.

    If you like to use unique keys in Azure CosmosDB, you have to them when you create your containers. You cannot add a unique key to an existing container. Only way to add a unique key to an existing container is, to create a new container and move your data from older container to the new one. Also, you cannot update unique keys just like partition keys. Picking a wrong unique key can be an expensive error.

Looks like you’ll need to have a bit of foresight when choosing keys (or choosing not to use keys).

Comments closed

Data Classifications on Azure SQL DW

Meagan Longoria takes us through data classifications on Azure SQL Data Warehouse:

Data classifications in Azure SQL DW entered public preview in March 2019. They allow you to label columns in your data warehouse with their information type and sensitivity level. There are built-in classifications, but you can also add custom classifications. This could be an important feature for auditing your storage and use of sensitive data as well as compliance with data regulations such as GDPR. You can export a report of all labeled columns, and you can see who is querying sensitive columns in your audit logs. The Azure Portal will even recommend classifications based upon your column names and data types. You can add the recommended classifications with a simple click of a button.

But read the whole thing, as Meagan sees a problem with it when you use a popular loading technique.

Comments closed

Populating a Data Vault Model with Azure Data Factory

Rayis Imayev gives us an example of ELT into a Data Vault model using Azure Data Factory:

To make a full transition from the existing  DW model to an alternative Data Vault I removed all Surrogate Keys and other attributes that are only necessary to support Kimball data warehouse methodology. Also, I needed to add necessary Hash keys to all my Hub, Link and Satellite tables. The target environment for my Data Vault would be SQL Azure database and I decided to use a built-in crc32 function of the Mapping Data Flow to calculate hash keys (HK) of my business data sourcing keys and composite hash keys of satellite tables attributes (HDIFF).

Data Vault is somewhere on my list of things to learn. It’s not at the top of the list, but that’s not a slight against it.

Comments closed