Press "Enter" to skip to content

Category: Cloud

Monitoring Azure Data Factory, Integration Runtimes, and Pipelines

Sandeep Arora monitors all the things:

For effective monitoring of ADF pipelines, we are going to use Log Analytics, Azure Monitor and Azure Data Factory Analytics. The above illustration shows the architectural representation of the monitoring setup.

The details of setting up log analytics, alerts and Azure Data Factory Analytics are further discussed in this section.

If you manage Azure Data Factory in your environment, give this a read.

Comments closed

Role-Based Access Control in Snowflake

Warner Chaves explains how role-based access controls work in Snowflake:

The data access privilege granularity is the lowest level of securable that you will use to provide data access. This can theoretically go all the way down to rows and all the way up to full databases. 

I usually recommend that people start out with using Schema as their data access securable granularity. Database is usually too broad and you will inevitably have to re-do your roles and table level. Below is too specific to turn it into a general methodology—you would end up with way too many roles. See the FAQ later in this post on how to mix and match granularities if needed.

Once you have the granularity defined, you then create back-end roles at that level.

Read on to see what those roles look like. It’s a pretty standard RBAC setup.

Comments closed

Azure DevOps Templates for Data Platform Deployments

Kevin Chant has some toys for us:

For my T-SQL Tuesday contribution this month is I want to introduce my Azure DevOps templates for Data Platform deployments.

This months T-SQL Tuesday is hosted by Frank Geisler. Frank has invited us to write about deploying SQL components through descriptive methods and build some new cool templates for them.

Which is good timing for me, because I co-presented a session on the day this post is published. I showed how to use YAML in Azure DevOps for Data Platform deployments at Data Platform Virtual Summit. .

Click through to learn more and see Kevin’s repos, as well as more information on the topic.

Comments closed

Environments in Azure ML

Luis Valencia explains what environments are in Azure ML:

An Environment defines Python packages, environment variables, and Docker settings that are used in machine learning experiments, including in data preparation, training, and deployment to a web service. An Environment is managed and versioned in an Azure Machine Learning Workspace. You can update an existing environment and retrieve a version to reuse. Environments are exclusive to the workspace they are created in and can’t be used across different workspaces.

In basic terms for a developer, it’s basically a Docker Image with all the needed dependencies (conda/pip packages) to run your experiment.

A friendly word of advice from some bad experiences: stick with the curated environments as much as you can. Those are easy and rarely fail. Building your own environments from Conda files is a possibility, but it’s an, err, probabilistic exercise as to whether your compute target will actually work or not.

Comments closed

AWS EC2 I3 Instance Types and Storage Persistence

Steve REzhener has a warning for us:

Amazon Web Services Elastic Cloud Computing (a.k.a. EC2)  is a service that lets anyone with a credit card rent a virtualized server from Amazon. To cater to different clients’ needs, AWS provides various instance types that are either general instance or specific-purpose instances (focused on CPU, RAM, IO). You can see the different types in Fig 1. This blog post is going to talk about a storage optimized instance. the I3 instance type family, its little-known problem, and the solution in the form of  Elastic Block Storage (a.k.a. EBS).

Click through for the warning, more explanation, and what you can do about it. H/T the SQLServerCentral newsletter.

Comments closed

Patched Security Flaw in Azure Container Instances

Ionut Ilascu reports on a vulnerability:

Microsoft has fixed a vulnerability in Azure Container Instances called Azurescape that allowed a malicious container to take over containers belonging to other customers on the platform.

An adversary exploiting Azurescape could execute commands in the other users’ containers and gain access to all their data deployed to the platform, the researchers say.

This is fixed now, but it’s a good reminder that platform-as-a-service offerings can still have security problems (as we’ve also seen recently with Power Apps and Cosmos DB).

Comments closed

Databricks Serverless SQL

Nikhil Jethava and Kevin Clugage announce serverless SQL on Databricks:

Databricks SQL already provides a first-class user experience for BI and SQL directly on the data lake, and today, we are excited to announce another step in making data and AI simple with Databricks Serverless SQL. This new capability for Databricks SQL provides instant compute to users for their BI and SQL workloads, with minimal management required and capacity optimizations that can lower overall cost by an average of 40%. This makes it even easier for organizations to expand adoption of the lakehouse for business analysts who are looking to access the rich, real-time datasets of the lakehouse with a simple and performant solution.

Under the hood of this capability is an active server fleet, fully managed by Databricks, that can transfer compute capacity to user queries, typically in about 15 seconds. The best part? You only pay for Serverless SQL when users start running reports or queries.

Things are getting interesting between Databricks and Azure Synapse Analytics, as both now have serverless SQL and Spark offerings. Synapse Analytics has the better implementation for serverless SQL and Databricks the superior Spark implementation, so it becomes a question of which weakness you take in order to gain the strength.

1 Comment

Interchangability between ADF and Synapse Integration Pipelines

Paul Andrew makes a discovery:

Inspired by an earlier blog where we looked at ‘How Interchangeable Delta Tables Are Between Databricks and Synapse‘ I decided to do a similar exercise, but this time with the integration pipeline components taking centre stage.

As I said in my previous blog post, the question in the heading of this blog should be incredibly pertinent to all solution/technical leads delivering an Azure based data platform solution so to answer it directly:

Read on to learn the answer.

Comments closed

Attributing Redshift Costs to Users

Jason Pedreza, et al, show how you can break down query utilization by user in an Amazon Redshift database:

At its simplest form, cost attribution can be determined using the amount of the storage assigned to the individual objects using the ownership of the objects to the groups. But the downside of this approach is it doesn’t provide a true translation of the resource usage. For example, let’s say Team 1 has total object size of 1 TB, whereas Team 2 has 100 GB in total size. Team 1 member runs 10 queries daily, and Team 2 runs 1,000 queries per day. Of course, Team 2 uses more resources than Team 1.

The Amazon Redshift RA3 architecture allows you to pay for the compute and data warehouse storage capacity separately, therefore storage doesn’t reflect the resources used by the teams for the cost attribution.

Click through to see how.

Comments closed

An Overview of Function-as-a-Service

Grace Ol’Halloran lays out the basics of serverless computing in cloud platforms:

The term serverless computing can be misleading; how can you compute things without a server? Well, the answer is that you don’t. The term “serverless” comes from the idea that the server is abstracted from the developer, and is totally maintained by the cloud provider. In other words, the developer doesn’t really care what environment their code is run in; they just need it hosted somewhere where it can be executed. This removes the responsibility of infrastructure configuration and maintenance from the developer, but naturally gives them less flexibility and control over the environment.

It took me watching several presentations before I really understood the value behind serverless compute.

Comments closed