Press "Enter" to skip to content

Category: Cloud

Azure Data Factory Switch Activity

Rayis Imayev explains what the Switch activity does in Azure Data Factory:

Developing conditional logic of your Azure Data Factory control flow has been simplified with introducing of the Switch activity – https://docs.microsoft.com/en-us/azure/data-factory/control-flow-switch-activity. Official documentation resource states, this new data factory activity “provides the same functionality that a switch statement provides in programming languages“. I would also add a more simplified definition of the Switch activity in Azure Data Factory: it is a container (or wrapper) for multiple IF conditions.

Click through for an example.

Comments closed

Running Big Data Clusters on VS Subscriptions

Kevin Chant has a few tips for people wanting to try out Big Data Clusters with their Visual Studio subscriptions to Azure:

In order to present the right results for various outcomes I attempted to deploy Big Data Clusters multiple times.

When I say multiple times, I mean the number of deployments easily went into double figures. Because I was testing deploying various virtual machine sizes in multiple regions.

Hence, I spent many hours testing and verifying the results in order to present them properly.

Read on to see Kevin’s notes and recommendations.

Comments closed

Backing Up SQL Server on Azure VMs

Arun Sirpal looks at three techniques for backing up SQL Server running on Azure virtual machines:

In the previous blog post I did a quick overview building a SQL VM (imaged) in Azure. It is now time to clarify some backup techniques because it can get confusing.

At a high level there are 3 techniques.
– Automated backup.
– Azure backup for SQL VM (that’s what MS call it).
– Manual backup, for example backup to URL.

Read on to learn more about each.

Comments closed

Running Oracle on Azure

Kellyn Pot’vin-Gorman takes us through various options on running Oracle in Azure:

Running Oracle on Azure VM environments aren’t that different from running Oracle on VMs in your on-premises for a DBA.  The DBAs and developers that I work with still have their jobs, still work with their favorite tools and also get the chance to learn new skills such as cloud administration.

Click through for more, including a setup script.

Comments closed

Comparing Azure SQL DB and RDS on Price and Performance

Joey D’Antoni summarizes a recent publication:

Yesterday, GigaOm published a benchmark of Azure SQL Database as compared to Amazon’s RDS service. It’s an interesting test case that tries to compare the performance of these platform as a service database offerings. One of the many challenges of this kind of a study is that the product offerings are not exactly analogous. However, GigaOm was able to build a test case where they used similar sized offerings as shown in the image below.

Read on for the summary and Joey’s thoughts.

Comments closed

Limiting Index Sizes in Cosmos DB

Hasan Savran explains why you might want to exclude columns from Cosmos DB indexes:

If everything is indexed already; Why do we want to exclude some of indexes? Indexes are saved on disk, you pay for the storage in Azure. If you keep indexing everything, your index file gets larger and you pay more for storage.

     Also; write operations to index file takes longer if index file is larger. By keeping only what you need in index file will improve the latency of write operations. If you will need to change your indexing policies, Rebuilding indexes will take less time.

This behavior is quite different from the way SQL Server behaves, where indexing is more of an opt-in philosophy.

Comments closed

Using Azure Kubernetes Services for Big Data Clusters

Mohammad Darab explains why it’s a good idea to use Azure Kubernetes Service when building out a Big Data Cluster:

According to the Microsoft documentation, there are three ways to deploy a Big Data Cluster:

1. Minikube
2. Kubeadm
3. AKS

I’ll go into each and list the pros and cons.

Of course, if you have a great Kubernetes admin, on-prem is certainly a viable option, but AKS is definitely easier to get started with.

Comments closed

Architecting a Data Lake in AWS

Gaurav Mishra takes us through data lake architecture on AWS:

Landing zone: This is the area where all the raw data comes in, from all the different sources within the enterprise. The zone is strictly meant for data ingestion, and no modelling or extraction should be done at this stage.

Curation zone: Here’s where you get to play with the data. The entire extract-transform-load (ETL) process takes place at this stage, where the data is crawled to understand what it is and how it might be useful. The creation of metadata, or applying different modelling techniques to it to find potential uses, is all done here.

Production zone: This is where your data is ready to be consumed into different application, or to be accessed by different personas. 

This is a nice overview of data lake concepts and worth the read if you’re using AWS. Even if not, the same principles (if not the same technologies) apply for Azure, other clouds, and on-prem.

Comments closed

Installing Python Libraries on EMR Clusters with Notebooks

Parag Chaudhari shows how we can install Python libraries on existing ElasticMapReduce clusters using EMR Notebooks:

The notebook-scoped libraries discussed previously require your EMR cluster to have access to a PyPI repository. If you cannot connect your EMR cluster to a repository, use the Python libraries pre-packaged with EMR Notebooks to analyze and visualize your results locally within the notebook. Unlike the notebook-scoped libraries, these local libraries are only available to the Python kernel and are not available to the Spark environment on the cluster. To use these local libraries, export your results from your Spark driver on the cluster to your notebook and use the notebook magic to plot your results locally. Because you are using the notebook and not the cluster to analyze and render your plots, the dataset that you export to the notebook has to be small (recommend less than 100 MB).

Read the whole thing.

Comments closed

Ordered Clustered Columnstore Indexes in Azure SQL DW

Niko Neugebauer takes us through a new feature in preview for Azure SQL Data Warehouse:

After creating (or dropping and recreating a Clustered Columnstore Index we can specify the reserved word ORDER and then one or !!!MULTIPLE!!! columns. This looks like an extremely promising feature!

On Azure SQL Data Warehouse one can of course define table as a Columnstore and with that specification it is also possible to define an ORDER option with one or multiple columns.

For the syntax and basic functionality testing purposes on Azure SQL Data Warehouse, let us then create a table with a Clustered Columnstore Index, load some data and see if by recreating an Ordered Clustered Columnstore Index we can achieve some improvements.

Niko has a few hard-earned lessons from this post.

Comments closed