Press "Enter" to skip to content

Category: Cloud

Data Cleansing Options with Azure

James Serra tries to give you an answer of when you should use different Azure services for data cleansing:

Clean the data and optionally aggregate it as it sits in source system.  The tool used for this would depend on the source system that stores the data (i.e. if SQL Server, you would use stored procedures).  The only benefit with this option is if you aggregate the data, you will move less data from the source system to Azure, which can be helpful if you have a small pipe to Azure and don’t need the row-level details.  The disadvantages are: the raw source data is not available in the data lake, so you would always need to go back to source system if you needed to get it again, and it may not even still exist in the source system; you would put extra stress on the source system when doing the cleaning which could affect end users using the system; it could take a long time to clean the data as the source system may not have fast performance; and you would not be able to use other tools (i.e. Hadoop, Databricks) to clean it.  Strongly advise against this option

Read on for additional options and James’s recommendations.

Comments closed

The CosmosDB Emulator

Hasan Savran has a way to let you play with CosmosDB without dropping any cash on it:

CosmosDB Emulator is a must have tool if you develop applications for Azure CosmosDB. Also, it’s a great tool to have if you like to learn about Azure CosmosDB but you have limited access to Azure for any reason. Azure CosmosDB team constantly works to make all available tools better including the emulator. Currently, emulator supports SQL, Cassandra, MongoDB, Gremlin and Table API. Data Explorer feature supports only SQL API for now. Emulator’simplementation is different than the service and you should not use it for stress testing, you can not test global replication or latency for read and writes.

It’s also useful for testing scenarios where you want some level of integration testing but don’t want to rely on an external service.

Comments closed

AzureGraph: Microsoft Graph in R

Hong Ooi takes us through AzureGraph:

Microsoft Graph is a comprehensive framework for accessing data in various online Microsoft services, including Azure Active Directory (AAD), Office 365, OneDrive, Teams, and more. AzureGraph is an R package that provides a simple R6-based interface to the Graph REST API, and is the companion package to AzureRMR and AzureAuth.

Currently, AzureGraph aims to provide an R interface only to the AAD part, with a view to supporting R interoperability with Azure: registered apps and service principals, users and groups. Like AzureRMR, it could potentially be extended to support other services.

Just to clarify, this is like Facebook Graph API for Azure components, not a graph database that you can store your own data in.

Comments closed

Extracting the First Element from an Array in ADF

Rayis Imayev shows how you can find the first element in an array using Azure Data Factory:

A user recently asked me a question on my previous blog post (Setting Variables in Azure Data Factory Pipelines) about possibility extracting the first element of a variable if this variable is set of elements (array).

So as a spoiler alert, before writing a blog post and adding a bit more clarity to the existing Microsoft ADF documentation, here is a quick answer to this question.

You’ll have to click through even for the quick answer.

Comments closed

Azure Cloud Shell

Mark Broadbent gives us an introduction to Azure Cloud Shell:

There are two ways to access Azure Cloud Shell, the first being directly through the Azure Portal itself. Once authenticated, look to the top right of the Portal and you should see a grouping of icons and in particular, one that looks very much like a DOS prompt (have no fear, DOS is nowhere to be seen).

The second method to access Azure Cloud Shell is by jumping directly to it via shell.azure.com which will require you to authenticate to your subscription before launching. There is an ever so slight difference between each method. Accessing the Shell via the Azure Portal will not require you to specify your Azure directory context (assuming you have several) since your Portal will have already defaulted to one, whereas with the direct URL method that obviously doesn’t happen.

Read the whole thing.

Comments closed

Azure SQL Linux VM Configuration with dbatools

Rob Sewell walks us through configuring SQL Server on an Azure VM running Linux, installing Powershell, and using dbatools:

I had set the Network security rules to accept connections only from my static IP using variables in the Build Pipeline. I use MobaXterm as my SSH client. Its a free download. I click on sessions

There wasn’t much I could excerpt here, but this is a heavily screenshot-driven tutorial.

Comments closed

Defining Data Egress Charges with Azure Query Editor

Dave Bland looks into what constitutes data egress with Azure and looks at a specific example:

If you have attempted to calculate your price for your Azure environment, you know that the pricing can be complex, taking into account a number of factors.  These factors include data egress, compute and storage.  The intent of the blog is not to outline all the billing factors of Azure.  The purpose of this blog post is to answer one question.

When I use the Azure SQL Query Editor, does that count as part of the data egress charges?

I completed a blog post on the Azure Query Editor a few weeks back.  Here is a link, if you would like to check it out.

Read on to see what Dave learned.

Comments closed

CosmosDB Continuation Tokens

Hasan Savran walks us through the idea of a continuation token in CosmosDB:

In CosmosDB, TOP option is required and its default value is 100. You can change the default value by sending a different value using the request header “x-ms-max-item-count“. If you have 40000 rows in your Orders table, and run the same query in CosmosDB, you will get 100 rows(documents) rather than 40000 rows(documents). CosmosDB returns all kind of metadata with the data. You can find this metadata in the response headers. One of those responses is, “x-ms-continuation” and it is responsible to display the rest of the rows of your query. If you like to get the next set of results, you can take “x-ms-continuation” value from the response headers and attach it to your next request to get the next set of rows. CosmosDB SDK does this automatically for you. SDK checks for the x-ms-continuation value when you check HasMoreResults property. If this property is true, that means CosmosDB returned a continuation token.

I have fanciful notions of SQL Server offering something similar—think of a grid built from a query. Get the first 50 rows from the result set and store that off in tempdb somewhere, using the “continuation token” (which might just be the full name in tempdb) and auto-trashing after a certain amount of time.

Comments closed

Running Spark MLlib to Feed Power BI

Brad Llewellyn shows how you can take Spark MLlib results and feed them into Power BI:

MLlib is one of the primary extensions of Spark, along with Spark SQL, Spark Streaming and GraphX.  It is a machine learning framework built from the ground up to be massively scalable and operate within Spark.  This makes it an excellent choice for machine learning applications that need to crunch extremely large amounts of data.  You can read more about Spark MLlib here.

In order to leverage Spark MLlib, we obviously need a way to execute Spark code.  In our minds, there’s no better tool for this than Azure Databricks.  In the previous post, we covered the creation of an Azure Databricks environment.  We’re going to reuse that environment for this post as well.  We’ll also use the same dataset that we’ve been using, which contains information about individual customers.  This dataset was originally designed to predict Income based on a number of factors.  However, we left the income out of this dataset a few posts back for reasons that were important then.  So, we’re actually going to use this dataset to predict “Hours Per Week” instead.

Check it out. And Brad’s not joking when he says the resulting model is terrible. But that’s okay, because it was never about the model.

Comments closed