Category: Cloud

Loading Azure Synapse Analytics using PolyBase

Published 2021-03-19 by Kevin Feasel

Azure Synapse Analytics is Microsoft’s data warehousing offering on Azure Cloud. It supports three types of runtimes – SQL Serverless Pool, SQL Dedicated Pool, and Spark Pools. As there are a variety of data sources on Azure, it’s very obvious that there can be varying types and volumes of data that would have to be loaded into Azure Synapse pools. There are three major types of data ingestion approaches that can be used to load data into Synapse. The COPY command is the most flexible and elaborate mechanism, where someone can execute this command from a SQL pool to load data from supported data repositories. This command is convenient to load ad-hoc and small to medium-sized data loads into Synapse. The second method of loading data is the Bulk Insert, where the method name is self-relevant regarding the approach functionality. To ingest the data from supported repositories into dedicated SQL pools, PolyBase is as efficient and at times it’s even more efficient than the COPY command. This article will help you understand the process to ingest data into Azure Synapse Analytics using PolyBase to load the data.

Click through for the process.

Comments closed

Continuous Backup with Cosmos DB

Published 2021-03-19 by Kevin Feasel

Hasan Savran reviews a new bit of functionality in Cosmos DB:

Azure Cosmos DB announced Continuous Backup in Cosmos DB on March 2021. This feature is currently in public preview mode and It is not recommended to use in production. This option gives you more options for your backup requirements. You might be using Azure Data Factory to handle your custom backup needs. Azure Data Factory is the SSIS in cloud. ETL jobs can be problematic. Backing up a database is half of the problem; other half is restoring a database. Until now, we had to call Microsoft to restore Cosmos DB databases/accounts.
By using Continuous backup, you can easily backup and restore your database. For now, this option is available only for SQL API and Mongo API. There are many limitations in this public preview version. I am sure many of these limitations will go away when it becomes generally available to everybody.

Click through for more details about the offering, as well as how to enable it. We’ll have to wait until it’s out of public preview to see how much it will cost, but it does look interesting.

Comments closed

Tracking Azure Resources with Tags

Published 2021-03-18 by Kevin Feasel

Jess Pomfret explains the value of tags:

One of the vital parts of this learning and experimenting needs to be cleaning up after myself. We all know the risks of leaving things running in Azure- it’s likely to drain your training budget pretty quickly. To be fair, this is also a good lesson for real world scenarios. Getting used to turning off or scaling down resources based on need is a good way to reduce your Azure spend.
This brings me to one morning last week. I logged in to the portal and got a pop up that my credit was down to under $5, which is not what I was expecting. I started looking around and wondering what I’d left running – it isn’t always easy to spot though.

Read on to see how tags can help with this, as well as other forms of cloud governance. If you remember to set them, that is.

Comments closed

Early Thoughts on Dremio

Published 2021-03-18 by Kevin Feasel

Meagan Longoria gives us a review of Dremio:

I’ve been working on a project for the last few months with a client who has chosen to implement Dremio in Azure. Dremio is a data lake engine that creates a semantic layer and supports interactive queries.
It uses Apache Arrow, Gandiva, and Parquet files under the hood. It runs on either Linux VMs or Kubernetes containers. Like most big data systems, there is at least one coordinator node and one or more executor nodes. These nodes communicate and are managed using Apache Zookeeper. Client applications connect to Dremio via ODBC, JDBC, REST APIs, or Arrow Flight. Dremio can read from storage accounts, external databases, and a few other sources.

Read on for good and bad aspects of the product.

Comments closed

Creating Parquet Files from SQL Server Data

Published 2021-03-16 by Kevin Feasel

Andy Leonard answers a challenge:

I searched and found some promising Parquet SSIS components available from CData Software and passed that information along. I shared my inexperience in exporting to parquet format and asked a few friends how they’d done it.
I thought: How many times have I demonstrated Azure Data Factory and clicked right past file format selection without giving Parquet a second thought? Too many times. It was time to change that.

Another route is to use PolyBase. If you’re okay with writing the results to Azure Blob Storage, you can insert directly into Parquet files the results of a SQL query. If that sounds interesting, here are posts on connecting to Azure Blob Storage via PolyBase and inserting into Azure Blob Storage. I insert in CSV format to make it easier for people to follow, but swap the file format with Parquet and it works all the same.

Comments closed

Deploying Azure Data Services via Terraform

Published 2021-03-15 by Kevin Feasel

Chris Adkin has started a series on deploying Azure Arc enabled Data Services. Part 1 serves as an introduction

:One of the most significant things to change the landscape for Azure data professionals will be general release of Azure Arc enabled Data Services. To provide an expedient means of experiencing all that Azure Arc has to offer, Microsoft has come up with Jumpstart – a collection of GitHub repos for deploying Arc in different scenarios. Last Christmas I had a few vacation days and took the opportunity to try out Jumpstart for Azure Arc enabled data services on AWS. AWS was my choice because it made a certain amount of sense to try out Azure Managed SQL Server instances and Postgres Hyperscale on a cloud that they are not natively available on. After all, the whole point of Azure Arc enabled Data Services is to bring Azure to you on your terms if for any reason you cannot use the Azure cloud.

Part 2 gives us an introduction to Terraform:

Before diving into what the various Terraform modules do that make up the Arc-PX-VMware-Faststart repo, I’m going to provide an introduction to Terraform in this blog post. Terraform comes from Hashicorp, it is a tool that works on the principle of infrastructure-as-code. Resources are specified in what are called configuration files using Hashicorp Control Language in a declarative manner, i.e. you state what you want and to the best of its ability Terraform attempts to create those resources for you. ‘Providers’ are used to create resources for particular types of entity, for example you might use local file, helm (the Kubernetes package manager), Azure, VMware providers etc. etc. . . . Using providers requires plugins, most of which are provided by Hashicorp, but third parties can write their own plugins also.

Check out the first two posts in what promises to be an interesting series.

2 Comments

Bad Request when Debugging an Azure Data Factory Pipeline

Published 2021-03-15 by Kevin Feasel

Ed Elliott ran into a problem:

Now, whenever I am troublehooting something in Azure and I come to the activity logs I am always hopeful but also always dissapointed that they don’t show more details. The bit that really annoys me is that I know Micrsoft see more detailed error information as I have been screen sharing with a support tech who used log exporer to see more detailed error messages than I see – grrrr, just show us the data! Anyway, I digress – so in the activity log, does it give a clue as to what is wrong?
No, in a word no it doesn’t.

Read on for the conclusion, which rates as “Should have been an easy fix but the error message was completely unhelpful.”

Comments closed

Delivering Data Insights using the Microsoft Data Platform

Published 2021-03-05 by Kevin Feasel

Paul Andrew has a talk:

Let’s start with a story, not a ‘once upon a time story‘, a story for your backlog
As a solution architect
I need to design and build an Azure data analytics platform end to end
to deliver data insights for my customer.
In February 2021 I delivered a talk as part of the Scottish Summit conference on how you could/should build an end to end data platform solution in Azure to deliver data insights and analytics. This is one of my favourite sessions so thought it worth re-sharing the recording here.

Click through for the abstract as well as the video.

Comments closed

Ignite 2021 Data and AI Announcements

Published 2021-03-05 by Kevin Feasel

James Serra has a roundup of the Data & AI announcements at Microsoft Ignite 2021:

Azure Arc enabled machine learning (preview): Build models on-premises, in multi-cloud, and at the edge with Azure Arc. It does this by deploying Azure Machine Learning to a Kubernetes cluster that resides on-prem or in another cloud. More info
Azure Migrate new features: Discover and assess your SQL servers and their databases for migration to Azure from within the Azure Migrate portal, get target SKU recommendations, and estimate monthly costs. More info
Azure SQL enhancements: Maintenance window – Azure SQL Database and Azure Managed Instance fixed maintenance windows (preview) – More info; Advance notifications – Enable you to prepare for planned maintenance events on your Azure SQL Database resources and minimize the impact of database failover on your sensitive workloads – More info

Click through for details on each of the announcements.

Comments closed

Analyzing XGBoost Training Reports

Published 2021-03-04 by Kevin Feasel

Simon Zamarin, et al, walk us through using XGBoost reports in Amazon’s Sagemaker Debugger:

In 2019, AWS unveiled Amazon SageMaker Debugger, a SageMaker capability that enables you to automatically detect a variety of issues that may arise while a model is being trained. SageMaker Debugger captures model state data at specified intervals during a training job. With this data, SageMaker Debugger can detect training issues or anomalies by leveraging built-in or user-defined rules. In addition to detecting issues during the training job, you can analyze the captured state data afterwards to evaluate model performance and identify areas for improvement. This task is made easier with the newly launched XGBoost training report feature. With a minimal amount of code changes, SageMaker Debugger generates a comprehensive report outlining key information that you can use to evaluate and improve the model.
This post shows you an end-to-end example of training an XGBoost model on Sagemaker and how to enable the automatic XGBoost report functionality in Sagemaker Debugger to quickly and easily evaluate model performance and identify areas of improvement for your model. Even if you don’t have a lot of data science experience, you can still gauge how well the model performs and identify areas of improvement based on information provided by the report. The code from this post is available in the GitHub repo.

Click through for an example of this in action.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31