Press "Enter" to skip to content

Category: Cloud

Using Cosmos DB For Graph Data

Jose Mendes has an introduction to the Cosmos DB graph engine:

Gremlin is the graph traversal language of Apache TinkerPop, an open source Graph Computing Framework. Gremlin allows the users to write complex queries to traverse their graphs by using a composed sequence of steps, with each step performing an operation on the data stream (further details here). There are 4 fundamental steps:

· transform: transform the objects in the stream

· filter: remove objects from the stream

· sideEffect: pass the object, but yield some side effect

· branch: decide which step to take

Click through for a quick example showing how to create and populate a graph.

Comments closed

A Frugal Stretch Database Alternative

Chris Bell shares a version of Stretch databases for people with budgets:

Stretch databases were going to provide “Cost-effective” availability for cold data, and unlike typical cold data storage,  our data would always be online and available to query. Applications would not need to be modified to work with the seamless design of the stretch database. Run a query, and the data was there being pulled from the cloud when needed. Streamlining on-premises data maintenance by reducing the local footprint of the data files as well as the size of backups! It was even going to be possible to keep data secure via encrypted connections to the cloud and in theory, make a migration to the cloud even easier.

It was destined to be a major win!

Then the price was mentioned.

Do you know anyone using stretch databases today?

Yeah, me neither.

It’s an interesting workaround with several moving parts.

Comments closed

Resuming Azure SQL Data Warehouse With Powershell

Arun Sirpal shows how to unpause an Azure SQL Data Warehouse instance using Powershell:

I totally forgot that with Azure SQL DWH you can pause and resume compute, to save money because it is expensive. Question is how do you go about resuming compute? TSQL is not possible and sure you can do the change via Azure portal but what about PowerShell?

This makes it easy to script out an overnight data load and then pausing the Azure Data Warehouse until the morning when those analysts come in, so that you can save a bit of cash (or a lot, depending upon your DWU utilization).

Comments closed

Comparing Hadoop On EC2 To EMR

Mark Litwintschik looks at EC2 versus EMR in terms of performance, specifically targeting a solution at less than $3 per hour:

The $3.00 price point was driven by the first method: running a single-node Hadoop installation. I wanted to make sure the dataset used in this benchmark could easily fit into memory.

The price set the limit for the second method: AWS EMR. This is Amazon’s Hadoop Platform offering. It has a huge feature set but the key one is that it lets you setup Hadoop clusters with very little instruction. The $3.00 price limit includes the service fee for EMR.

Note I’ll be running everything in AWS’ Irish region and the prices mentioned are region- and in the case of spot prices, time-specific.

I was a bit surprised at which service won.

Comments closed

Backing Up Azure SQL Databases

Arun Sirpal enumerates the options we have for backups of Azure SQL Databases:

If you have a business requirement which has a need to retain database backups for longer than 35 days, then you have an option to use long-term backup retention. This feature utilises the Azure Recovery Services Vault where you can store up to 10 years’ worth of backups for up to 1000 databases per vault and 25 vaults per subscription.

There are some guidelines that you need to follow to successful set this up:

  • Your vault MUST be in the same region, subscription and resource group as your logical SQL Server, if not then you will not be able to set this up.

  • Register the vault to the server.

  • Create a protection policy.

  • Apply the above policy to the databases that require long-term backup retention.

Arun also looks at restoration options.

Comments closed

Uploading Files To Azure Blob Storage With Data Factory V2

Ben Jarvis shows how to use Azure Data Factory V2 to upload files from an on-prem server to Azure Blob Storage:

In ADF V2 the integration runtime is responsible for providing the compute infrastructure that carries out data movement between data stores. A self-hosted integration runtime is an on-premise version of the integration runtime that is able to perform copy activities to and from on-premise data stores.

When we configure a self-hosted integration runtime the data factory service, that sits in Azure, will orchestrate the nodes that make up the integration runtime through the use of Azure Service Bus meaning our nodes that are hosted on-prem are performing all of our data movement and connecting to our on-premises data sources while being triggered by our data factory pipelines that are hosted in the cloud. A self-hosted integration runtime can have multiple nodes associated with it, which not only caters for high availability but also gives an additional performance benefit as ADF will use all of the available nodes to perform processing.

Read on for the scripts and full process.

Comments closed

Corrupting Managed Instances

Brent Ozar has found a bug with Azure SQL Database Managed Instances:

Corruption happens. It’s just a fact of life – storage is gonna fail. Microsoft’s SLAs for storage only give you 3-4 9’s, and there’s nothing in there about never losing your data. Nothing against Azure, either – I’ve lost entire VMs in AWS due to storage corruption.

So let’s demo it. Normally, this kind of thing might be hard to do, but at the moment, DBCC WRITEPAGE is enabled (although I expect that to change before MIs hit General Availability.) I used Erik’s notorious sp_GoAheadAndFireMe to purposely corrupt the master database (not TempDB. I modified it to work with a user database instead, ran it, and in less than ten seconds, the entire instance went unresponsive.

It’s a good post, so check it out.

Comments closed

Data Lake Permissions

Melissa Coates has started a multi-part series on Azure Data Lake permissions.  She’s put up the first three parts already.  Part 1 covers the types of permissions available as well as some official documentation:

(1) RBAC permissions to the ADLS account itself, for the purpose of managing the resource.
RBAC = Role-based access control. RBAC are the familiar Azure roles such as reader, contributor, or owner. Granting a role on the service allows someone to view or manage the configuration and settings for that particular Azure service (ADLS in this case). See Part 2 for info about setting up RBAC.

Part 2 looks at permissions for the Azure Data Lake Store service itself:

Setting permissions for the service + the data stored in ADLS is always two separate processes, with one exception: when you define an owner for the ADLS service in Azure, that owner is automatically granted ‘superuser’ (full) access to manage the ADLS resource in Azure *AND* full access to the data. Any other RBAC role other than owner needs the data access specifically assigned via ACLs. This is a good thing because not all system administrators need to see the data, and not all data access users/groups/service principals need access to the service itself. This type of separation is true for certain other services too, such as Azure SQL Database.

Try to use groups whenever you can to grant access, rather than individual accounts. This is a consistent best practice for managing security across many types of systems.

Part 3 covers using ACLs to grant rights to specific files or folders in Azure Data Lake Storage:

There are two types of ACLs: Access ACLs and Default ACLs.

An Access ACL is the read/write/execute permissions specified for a folder or file. Every single folder or file has its security explicitly defined — so that means the ADLS security model is not an ‘inheritance’ model. That is an important concept to remember.

Default ACL is like a ‘template’ setting at a folder level (the concept of a default doesn’t apply at the file level). Any new child item placed in that folder will automatically obtain that default security setting. The default ACLs are absolutely critical, given that data permissions aren’t an inheritance model. You want to avoid a situation where a user has permission to read a folder, but is unable to see any of the files within the folder — that situation will happen if a new file gets added to a folder which has an access ACL set at the folder level, but not a default ACL to apply to new child objects.

There’s a lot of good information here and I’m looking forward to parts 4 and 5.

Comments closed

Working With Azure SQL Managed Instances

Jovan Popovic has a couple of posts covering configuration for Azure SQL Managed Instances.  First, he looks at how to configure tempdb:

One limitation in the current public preview is that tempdb don’t preserves custom settings after fail-over happens. If you add new files to tempdb or change file size, these settings will not be preserved after fail-over, and original tempdb will be re-created on the new instance. This is a temporary limitation and it will be fixed during public preview.

However, since Managed Instance supports SQL Agent, and SQL Agent can be configured to execute some script when SQL Agent start, you can workaround this issue and create a SQL Agent job that will pre-configure your tempdb.

SQL Agent will start whenever Managed Instance fail-over and the job that contains script above can increase tempdb size before you start running your workload on the new instance.

Then, he covers network configuration:

Managed Instance is your dedicated resource that is placed in Azure Virtual network with assigned private IP address. Before you create Managed Instance, you need to create Azure Virtual network using Azure portalPowerShell, or Azure CLI.

If you are using Azure portal, make sure that you use Resource Manager ake sure that Service Endpoints option is Disabled in Creating Virtual Network Blade (this is default option so don’t change it).

If you want to have only one subnet in your Virtual Network (Virtual Network blade will enable you to define first subnet called default), you need to know that Managed Instance subnet can have between 16 and 256 addresses. Therefore, use subnet masks /28 to /24 when defining your subnet IP ranges for default subnet. If you know how many instances you will have make sure that you have at least 2 addresses per instance + 5 system addresses in the default subnet.

Both posts are useful if you’re interested in getting started with a managed instance.

Comments closed

Flushing The Authentication Cache

Arun Sirpal describes an Azure SQL DB-only DBCC command:

This command only applies to Azure SQL Database, at a high level it empties the database authentication cache for logins and firewall rules for the current USER database.

In Azure SQL Database the authentication cache makes a copy of logins and server firewall rules which are in the master database and puts them into memory within the user database. The Database Engine attempts re-authorisation using the originally submitted password and no user input is required.

If this still doesn’t make sense, then an example will really help.

Click through for the helpful example.

Comments closed