Press "Enter" to skip to content

Category: Cloud

Data Lake Permissions

Melissa Coates has started a multi-part series on Azure Data Lake permissions.  She’s put up the first three parts already.  Part 1 covers the types of permissions available as well as some official documentation:

(1) RBAC permissions to the ADLS account itself, for the purpose of managing the resource.
RBAC = Role-based access control. RBAC are the familiar Azure roles such as reader, contributor, or owner. Granting a role on the service allows someone to view or manage the configuration and settings for that particular Azure service (ADLS in this case). See Part 2 for info about setting up RBAC.

Part 2 looks at permissions for the Azure Data Lake Store service itself:

Setting permissions for the service + the data stored in ADLS is always two separate processes, with one exception: when you define an owner for the ADLS service in Azure, that owner is automatically granted ‘superuser’ (full) access to manage the ADLS resource in Azure *AND* full access to the data. Any other RBAC role other than owner needs the data access specifically assigned via ACLs. This is a good thing because not all system administrators need to see the data, and not all data access users/groups/service principals need access to the service itself. This type of separation is true for certain other services too, such as Azure SQL Database.

Try to use groups whenever you can to grant access, rather than individual accounts. This is a consistent best practice for managing security across many types of systems.

Part 3 covers using ACLs to grant rights to specific files or folders in Azure Data Lake Storage:

There are two types of ACLs: Access ACLs and Default ACLs.

An Access ACL is the read/write/execute permissions specified for a folder or file. Every single folder or file has its security explicitly defined — so that means the ADLS security model is not an ‘inheritance’ model. That is an important concept to remember.

Default ACL is like a ‘template’ setting at a folder level (the concept of a default doesn’t apply at the file level). Any new child item placed in that folder will automatically obtain that default security setting. The default ACLs are absolutely critical, given that data permissions aren’t an inheritance model. You want to avoid a situation where a user has permission to read a folder, but is unable to see any of the files within the folder — that situation will happen if a new file gets added to a folder which has an access ACL set at the folder level, but not a default ACL to apply to new child objects.

There’s a lot of good information here and I’m looking forward to parts 4 and 5.

Comments closed

Working With Azure SQL Managed Instances

Jovan Popovic has a couple of posts covering configuration for Azure SQL Managed Instances.  First, he looks at how to configure tempdb:

One limitation in the current public preview is that tempdb don’t preserves custom settings after fail-over happens. If you add new files to tempdb or change file size, these settings will not be preserved after fail-over, and original tempdb will be re-created on the new instance. This is a temporary limitation and it will be fixed during public preview.

However, since Managed Instance supports SQL Agent, and SQL Agent can be configured to execute some script when SQL Agent start, you can workaround this issue and create a SQL Agent job that will pre-configure your tempdb.

SQL Agent will start whenever Managed Instance fail-over and the job that contains script above can increase tempdb size before you start running your workload on the new instance.

Then, he covers network configuration:

Managed Instance is your dedicated resource that is placed in Azure Virtual network with assigned private IP address. Before you create Managed Instance, you need to create Azure Virtual network using Azure portalPowerShell, or Azure CLI.

If you are using Azure portal, make sure that you use Resource Manager ake sure that Service Endpoints option is Disabled in Creating Virtual Network Blade (this is default option so don’t change it).

If you want to have only one subnet in your Virtual Network (Virtual Network blade will enable you to define first subnet called default), you need to know that Managed Instance subnet can have between 16 and 256 addresses. Therefore, use subnet masks /28 to /24 when defining your subnet IP ranges for default subnet. If you know how many instances you will have make sure that you have at least 2 addresses per instance + 5 system addresses in the default subnet.

Both posts are useful if you’re interested in getting started with a managed instance.

Comments closed

Flushing The Authentication Cache

Arun Sirpal describes an Azure SQL DB-only DBCC command:

This command only applies to Azure SQL Database, at a high level it empties the database authentication cache for logins and firewall rules for the current USER database.

In Azure SQL Database the authentication cache makes a copy of logins and server firewall rules which are in the master database and puts them into memory within the user database. The Database Engine attempts re-authorisation using the originally submitted password and no user input is required.

If this still doesn’t make sense, then an example will really help.

Click through for the helpful example.

Comments closed

Using Biml With Azure Data Factory v2

Ben Weissman shows how you can use BimlStudio to build ADF v2 flows:

As you may have seen at PASS Summit 2017 or another event, with the announcement of Azure Data Factory v2 (adf), Biml will natively support adf objects.

Please note, that the native support is currently only available in BimlStudio 2018. If you’re using BimlExpress, you can still generate the JSON for your pipelines, datasets etc. using Biml but you cannot use the newly introduced tags.

The really good parts are only available in the paid product; if you do a lot of Azure Data Factory work, that might tip the scales in favor of getting BimlStudio.

Comments closed

Calling Azure Cognitive Services From SSIS

Rolf Tesmer shows off how easy it is to call Azure Cognitive Services from SQL Server Integration Services:

My SQL SSIS package leverages the Translator Text API service.  For those who want to learn the secret sauce then I suggest to check here – https://azure.microsoft.com/en-us/services/cognitive-services/translator-text-api/

essentially this API is pretty simple;

  1. It accepts source textsource language and target language.  (The API can translate to/from over 60 different languages.)

  2. You call the API with your request parameters + API Key

  3. The API will respond with the language translation of the source text you sent in

  4. So Simple, so fast, so effective!

Click through for the full post.  It really is simple.

Comments closed

Running Cassandra On EC2

Prasad Alle and Provanshu Dey share some tips if you’re running Cassandra on Amazon’s EC2:

Apache Cassandra is a commonly used, high performance NoSQL database. AWS customers that currently maintain Cassandra on-premises may want to take advantage of the scalability, reliability, security, and economic benefits of running Cassandra on Amazon EC2.

Amazon EC2 and Amazon Elastic Block Store (Amazon EBS) provide secure, resizable compute capacity and storage in the AWS Cloud. When combined, you can deploy Cassandra, allowing you to scale capacity according to your requirements. Given the number of possible deployment topologies, it’s not always trivial to select the most appropriate strategy suitable for your use case.

In this post, we outline three Cassandra deployment options, as well as provide guidance about determining the best practices for your use case in the following areas:

  • Cassandra resource overview

  • Deployment considerations

  • Storage options

  • Networking

  • High availability and resiliency

  • Maintenance

  • Security

Click through to see these tips.

Comments closed

Launching A Sparklyr Cluster

David Smith shows how to launch a sparklyr cluster in Azure:

When you’re finished, shut down your cluster using the aztk spark cluster delete command. (While you can delete the nodes from the Pools view in the Azure portal, the command does some additional cleanup for you.) You’ll be charged for each node in the cluster at the usual VM rates for as long as the cluster is provisioned. (One cost-saving option is to use low-priority VMs for the nodes, for savings of up to 90% compared to the usual rates.)

That’s it! Once you get used to it, it’s all quick and easy — the longest part is waiting for the cluster to spin up in Step 5. This is just a summary, but the full details see the guide SparklyR on Azure with AZTK.

It’ll take a bit more than five minutes to get started, but it is a good sight easier than building the servers yourself.

Comments closed

Flotilla: Containerized Task Execution

Akshay Wadia, et al, introduce a new open source tool:

Data scientists are not always equipped with the requisite engineering skills to deploy robust code to a production job execution and scheduling system. Yet, forcing reliance on data platform engineers will impede the scientists’ autonomy. If only there was another way.

Today we’re excited to introduce Flotilla, our latest open source project. Flotilla is a human friendly service for task execution. It allows you to focus on the work you’re doing rather than how to do it. In other words, Flotilla takes the struggle out of defining and running containerized jobs.

It looks like an interesting service.

Comments closed

Temp Tables In Redshift

Derik Hammer has some notes on temporary tables in Amazon Redshift:

One difference between regular tables and temporary tables is how they are typically used. Temporary tables are session scoped which means that adding them into a process or report will probably cause them to be created multiple times. Temporary tables might be very similar to regular tables but most regular tables are not re-written into, every time they are queried.

The disk writes involved in populating the temporary table might be more expensive than the reads would be if you were to modify your query to include the logic into one, larger, query. The frequency of the report or process will be a factor into how much of a performance hit you get by using the temporary tables. If you are using temporary tables to make debugging a procedure easier or to enhance readability, make sure you understand the IO cost of performing writes and then reading that data back into a subsequent query.

Read on for more.

Comments closed