Press "Enter" to skip to content

Category: Cloud

Storing Streaming Data in Azure Data Lake

Jesse Gorter takes us through writing streaming data from Event Hubs into Azure Data Lake Storage:

In my previous blog I showed how you can stream Twitter data to an Event Hub and stream the data to a Power BI live dashboard. In this post, I am going to show you how to store this data for long term storage. An Event Hub stores your events temporarily. That means it does not store them for later analysis. Say you want to analyze whether negative or positive tweets have an impact on your sales, you would need to store tweets for a historical view.

The question is where to store this data: directly to the datawarehouse, or store it to a data lake? This really depends on the architecture that you want to have. A data lake is often used to store the raw data historically. Is is especially interesting because it allows to store any kind of data, structured or unstructured and it is quite cheap compared to Azure SQL database or Azure SQL datawarehouse. So for that reason, we are going to store it in a data lake.

Jesse walks us through data lake creation and data migration from Event Hubs into a Data Lake Storage container.

Comments closed

Options for SQL Server in Azure

Josh Darnell looks at the options for running SQL Server in Azure:

Things are always changing the cloud, but as it stands in early 2020, these are the main options for SQL Server in Azure:

– SQL Virtual Machines
– Managed Instances
– Databases (this one really has a number of sub-options, like serverless, hyperscale, and elastic pools)

I’m going to touch on each of these options, why you might choose them, and some of my personal takeaways from the talk as well.

Read on for the summary of each. Each one fills a different set of needs and has a different set of available options.

Comments closed

Cloudera Data Platform in Azure Marketplace

Ram Venkatesh announces availability for Cloudera Data Platform in the Azure Marketplace:

Cloudera Data Platform (CDP) is now available on Microsoft Azure Marketplace – so joint customers can easily deploy the world’s first enterprise data cloud on Microsoft Azure.

Last week we announced the availability of Cloudera Data Platform (CDP) on Azure Marketplace. CDP is an integrated data platform that is easy to secure, manage, and deploy. With its availability on the Azure Marketplace, joint customers of Cloudera and Microsoft will be able to easily discover and provision CDP Public Cloud across all Azure regions. Additionally, by procuring CDP through the Azure Marketplace, these customers can leverage integrated billing i.e. the cost of CDP will be part of a single Azure bill making procurement simple and friction-free.

The new Cloudera’s approach has been cloud-first to the point of being cloud-only. It’s an interesting shift from the merger of two on-prem companies.

Comments closed

Characterizing and Optimizing a Serverless Workload

Adrian Colyer reviews an interesting paper:

Today’s paper analyses serverless workloads on Azure (the characterisation of those workloads is interesting in its own right), where users want fast function start times (avoiding cold starts), and the cloud provider wants to minimise resources consumed (costs). With fine-grained, usage based billing, resources used to keep function execution environments warm eat directly into margins. The authors demonstrate a policy combining keep-alive times for function runtimes with pre-warming, that dominates the currently popular strategy of simply keeping a function execution environment around for a period of time after it has been used, in the hope that it will be re-used. Using this policy, users see much fewer cold starts, while the cloud provider uses fewer resources. It’s the difference between the red (state-of-the-practice) and green (this paper) policies in the chart below. Win-win!

Very interesting.

Comments closed

Connecting to Cosmos DB from the Gremlin Console

Hasan Savran shows how to connect to Cosmos DB’s Gremlin API via the Gremlin Console:

Graph Databases have been popular lately. You can use Azure Cosmos DB as your  Graph database source by selecting Gremlin API. Gremlin programing language is developed by Apache TinkerPop of the Apache Software Foundation. I will show you how to connect to Cosmos DB Gremlin API from TinkerPop Gremlin console.

     You can download the latest version of Gremlin console from here. The latest version was 3.4.6 when I wrote this post. I was able to connect to the Cosmos DB by using the versions 3.4.3 and 3.4.6.  You can run the console from Linux or Windows, I will focus on the Windows version here but Linux version should work the same way. You must have Java SDK 8 to run this console. Latest version of Java SDK does not work with this console.

There are a couple of configuration steps, but nothing crazy.

Comments closed

Evaluating Regression Models in Azure ML

Dan Fitton continues a series on model evaluation with Azure Machine Learning:

The initial go-to metric for understanding a regression model is the R squared (or R2) value, also known as the coefficient of determination. R squared measures how well the model is fitted to the data – the goodness of fit. It indicates how much of the variation of y (the target) is explained by the variation in x (the features).

The measures are bog standard if you’ve worked with regressions before, and Dan does a good job explaining them.

Comments closed

Issue with PolyBase and Cosmos DB

I found an issue with connecting to Cosmos DB from PolyBase after installing SQL Server 2019 CU2:

After upgrading to SQL Server 2019 CU2, I noticed some issues when trying to connect to a Cosmos DB collection via PolyBase. Specifically, I started getting the following error message:

Msg 105082, Level 16, State 1, Line 35
105082;Generic ODBC error: [Microsoft][MongoDBODBC] (110) Error from MongoDB Client: Server at <<my Cosmos account name>>.documents.azure.com:10255 reports wire version 2, but this version of libmongoc requires at least 3 (MongoDB 3.0) (Error Code: 15) Additional error <2>: ErrorMsg: [Microsoft][MongoDBODBC] (110) Error from MongoDB Client: Server at <<my Cosmos account name>> .documents.azure.com:10255 reports wire version 2, but this version of libmongoc requires at least 3 (MongoDB 3.0) (Error Code: 15), SqlState: HY000, NativeError: 110 .

Read on for a couple attempts at a solution and some more detail.

Comments closed

Cosmos DB Changes

Hasan Savran hits on a few changes to Cosmos DB:

Free Tier of Cosmos DB
     Couple of big changes came from Azure Cosmos Db team this week. First one is, free tier of Cosmos DB. In free tier, you get the first 400 Request Units and 5 GB of storage for free for the lifetime of the account. This is a great deal if you like to use Azure Cosmos DB in your new project. Give it a try, It’s free! All you need to do is click on Apply for Apply Free Tier Discount.

This is a big one, but there are a few other interesting changes in there as well.

Comments closed

Transforming JSON to CSV with Azure Data Factory

Rayis Imayev shows how you can use the Flatten task in Azure Data Factory to convert JSON text to CSV:

What this new task does it helps to transform/transpose/flatten your JSON structure into a denormalized flatten datasets that you can upload into a new or existing flat database table. 

I like the analogy of the Transpose function in Excel that helps to rotate your vertical set of data pairs (name : value) into a table with the column names and values for corresponding objects. And when this vertical JSON structural set contains several similar sets (array) then ADF Mapping Data Flows Flatten does a really good job by transforming it into a table with several rows (records).

Click through for a demonstration.

Comments closed

ADF.PROCFWK 1.1 Released

Paul Andrew has a new name for the metadata-driven framework in Azure Data Factory:

Hi data friends! Version 1.1 of my ADF.procfwk is ready!

Following the great response I got to the version 1.0 blog series I decided to push ahead and get version 1.1 out there as soon as possible. The main thing I’ve addressed with this release is the big problem of having hard coded service principal credentials in the body of the Azure Function request that called our execution pipelines. Hopefully we can all agree this was/is not great for a number of reasons.

Read on for more details including a detailed changelog, and check out the GitHub repo.

Comments closed