Category: Cloud

Azure Databricks And Active Directory

Published 2019-01-22 by Kevin Feasel

Tristan Robinson wraps up a two-parter on Azure Databricks security:

With the addition of Databricks runtime 5.1 which was released December 2018, comes the ability to use Azure AD credential pass-through. This is a huge step forward since there is no longer a need to control user permissions through Databricks Groups / Bash and then assigning these groups access to secrets to access Data Lake at runtime. As mentioned previously – with the lack of support for AAD within Databricks currently, ACL activities were done on an individual basis which was not ideal. By using this feature, you can now pass the authentication onto Data Lake, and as we know one of the advantages of Data Lake is the tight integration into Active Directory so this simplifies things. Its worth noting that this feature is currently in public preview but having tested it thoroughly, am happy with the implementation/limitations. The feature also requires a premium workspace and only works with high concurrency clusters – both of which you’d expect to use in this scenario.

It looks like this is the way to go forward with securing Azure Databricks. Read the whole thing.

Comments closed

Recreating Dropped Azure SQL Managed Instance DBs

Published 2019-01-22 by Kevin Feasel

Jovan Popovic has a script to re-create an Azure SQL Managed Instance database which you might accidentally have dropped:

Azure SQL Database – Managed Instance is fully-managed PaaS service that provides advanced disaster-recovery capabilities. Even if you accidentally drop the database or someone drops your database as part of security attack, Managed Instance will enable you to easily recover the dropped database.
Azure SQL Managed Instance performs automatic backups of you database every 5-10 minutes. If anything happens with your database and even if someone drops it, your data is not lost. Managed Instance enables you to easily re-create the dropped database from the automatic backups.

Click through for the Powershell script.

Comments closed

Azure Databricks Security

Published 2019-01-21 by Kevin Feasel

Tristan Robinson looks at what’s currently available in terms of security on Azure Databricks:

You’ll notice that as part of this I’m retrieving the secrets/GUIDS I need for the connection from somewhere else – namely the Databricks-backed secrets store. This avoids exposing those secrets in plain text in your notebook – again this would not be ideal. The secret access is then based on an ACL (access control list) so I can only connect to Data Lake if I’m granted access into the secrets. While it is also possible to connect Databricks up to the Azure Key Vault and use this for secrets store instead, when I tried to configure this I was denied based on permissions. After research I was unable to overcome the issue. This would be more ideal to use but unfortunately there is limited support currently and the fact the error message contained spelling mistakes suggests to me the functionality is not yet mature.

To be charitable, there appears to be room for implementation improvement.

Comments closed

Data Lake Organization Tips

Published 2019-01-21 by Kevin Feasel

Melissa Coates has some great advice for people working with data lakes:

Q: Partitioning by date is common. Where should the dates go in the folder hierarchy?
Almost always, you will want the dates to be at the end of the folder path. This is because we often need to set security at specific folder levels (such as by subject area), but we rarely set up security based on time elements.
Optimal for folder security: \SubjectArea\DataSource\YYYY\MM\DD\FileData_YYYY_MM_DD.csv
Tedious for folder security: \YYYY\MM\DD\SubjectArea\DataSource\FileData_YYYY_MM_DD.csv

Click through for all of Melissa’s advice in FAQ form.

Comments closed

CosmosDB Time To Live Support

Published 2019-01-18 by Kevin Feasel

Hasan Savran explains the Time To Live (TTL) counter in CosmosDB:

Another great feature of Cosmos DB is, TTL (Time To Live) support. This is a great option to have if you need a database system with Caching option, or you need to purge your data and you don’t want to develop a function to remove data from your dataset. CosmosDB’s TTL feature is pretty simple, all you need to do is, turn this feature on and declare when data should be removed from your dataset. Best part about TTL in CosmosDB is, CosmosDB does not charge you when it removes the data from your containers so you can use your Request Units for other transactions!

There are two ways to set Time to Live value. You can set the TTL value on a container or you can set it on a specific item by using CosmosDB SDK. TTL value must be in seconds.
TTL timer resets if data gets modified for any reason.

Click through for an example of it in action.

Comments closed

The Forgotten Infrastructure Below Azure BI Architecture Diagrams

Published 2019-01-14 by Kevin Feasel

Meagan Longoria reminds us that there are several products which Azure BI projects need but which we tend to forget when building architectural diagrams:

Let’s start with Azure Active Directory (AAD). In order to provision the resources in the diagram, your Azure subscription must already be associated with an Active Directory. AAD is Microsoft’s cloud-based identity and access management service. Members of an organization have a user account that can sign in to various services. AAD is used to access Office 365, Power BI, and Dynamics 365, as well as the Azure portal. It can also be used to grant access and permissions to specific Azure resources.

Meagan has several of these, so check it out.

Comments closed

AzureR Packages In Cran

Published 2019-01-11 by Kevin Feasel

David Smith points out that the Azure packages for R are now in CRAN:

The suite of AzureR packages for interfacing with Azure services from R is now available on CRAN. If you missed the earlier announcements, this means you can now use the install.packages function in R to install these packages, rather than having to install from the Github repositories. Updated versions of these packages will also be posted to CRAN, so you can get the latest versions simply by running update.packages.

Read on for a summary of those packages.

Comments closed

Data Transformation Tools In The Azure Space

Published 2019-01-11 by Kevin Feasel

James Serra gives us an overview of the major tools you would use for ETL and ELT in Azure:

If you are building a big data solution in the cloud, you will likely be landing most of the source data into a data lake. And much of this data will need to be transformed (i.e. cleaned and joined together – the “T” in ETL). Since the data lake is just storage (i.e. Azure Data Lake Storage Gen2 or Azure Blob Storage), you need to pick a product that will be the compute and will do the transformation of the data. There is good news and bad news when it comes to which product to use. The good news is there are a lot of products to choose from. The bad news is there are a lot of products to choose from :-). I’ll try to help your decision-making by talking briefly about most of the Azure choices and the best use cases for each when it comes to transforming data (although some of these products also do the Extract and Load part

The only surprise is the non-mention of Azure Data Lake Analytics, and there is a good conversation in the comments section explaining why.

Comments closed

Creating Cosmos DB Indexes

Published 2019-01-10 by Kevin Feasel

Hasan Savran explains indexing in Cosmos DB:

In SQL Server you need to pick which columns you like to index, In CosmosDB you need to pick which columns not to index. It’s kind of same thing at the end. You might ask “If everything is indexed and working fine, why do you want me to poke the well running system?” When we compare SQL Server indexes to CosmosDB Indexes, one thing works exactly same. That is the index file size. CosmosDB holds the indexes in a separate file like SQL Server and if we want to index everything, index file size is going to get large. Since we need to pay for the file space in CosmosDB, you might need to pay extra for indexes that you might never use. Also, your updates, inserts and deletes might cost you more Request Units since CosmosDB needs to maintain all the indexes in the background.

There’s just enough difference to make you pay the price if you assume Cosmos DB works just like SQL Server.

Comments closed

Query Tuning In CosmosDB

Published 2019-01-03 by Kevin Feasel

Hasan Savran explains how we can tune queries in CosmosDB:

This is most common question in my talks about Cosmos DB from DBAs. Cosmos DB is a managed database, this does not mean that you cannot tune up your queries. But the way you tune up the queries is nothing like SQL Server.

First you need to be sure that you configured your Cosmos DB containers right. What do I mean with that? You should pick the right partition key before you start to tune up any of your queries. Tuning up your queries is not going to help you in long run if you selected a wrong partition key when you created Cosmos DB containers. Throughput value is another value you need to worry about, the good news about the throughput is, you can change it if you need to. You cannot change your partition key!

It’s a whole different world over there.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31