Press "Enter" to skip to content

Category: Cloud

Getting Started with Azure Databricks

Brad Llewellyn has a tutorial for Azure Databricks:

Databricks is a managed Spark framework, similar to what we saw with HDInsight in the previous post.  The major difference between the two technologies is that HDInsight is more of a managed provisioning service for Hadoop, while Databricks is more like a managed Spark platform.  In other words, HDInsight is a good choice if we need the ability to manage the cluster ourselves, but don’t want to deal with provisioning, while Databricks is a good choice when we simply want to have a Spark environment for running our code with little need for maintenance or management.

Azure Databricks is not a Microsoft product.  It is owned and managed by the company Databricks and available in Azure and AWS.  However, Databricks is a “first party offering” in Azure.  This means that Microsoft offers the same level of support, functionality and integration as it would with any of its own products.  You can read more about Azure Databricks herehereand here.

Click through for a demonstration of the product.

Comments closed

Using the Cosmos DB Change Feed

Hasan Savran (who just became a Microsoft MVP, so congrats to him) takes us through the Cosmos DB Change Feed:

Azure Cosmos DB Change Feed exposes Cosmos DB Logs to outside of CosmosDB. CosmosDB notifies you immediately when there is any change in your database. It supports all Inserts and Updates, Delete will be available soon. You can always use soft delete to catch delete events if you need to.

     By knowing what is changed in your database, you can trigger all kind of events and you can make your application work very smart. SQL Server has similar functionality but like many other features Log shipping is usually blocked by DBAs or the company policies. In CosmosDB, you don’t need to do anything to enable Change Feed feature! It’s already enabled, all you need to do is to configure it. Easiest way to catch change feed events is Azure Functions.

When I hear someone describe the change feed, I immediately imagine it as a Kafka topic.

Comments closed

Developing Big Data Cluster Spark Jobs with IntelliJ

Jenny Jiang shows how we can use IntelliJ IDEA to develop Spark jobs against SQL Server Big Data Clusters:

We’re delighted to release the Azure Toolkit for IntelliJ support for SQL Server Big Data Cluster Spark job development and submission. For first-time Spark developers, it can often be hard to get started and build their first application, with long and tedious development cycles in the integrated development environment (IDE). This toolkit empowers new users to get started with Spark in just a few minutes. Experienced Spark developers also find it faster and easier to iterate their development cycle.

The toolkit extends IntelliJ support for the Spark job life cycle starting from creation, authoring, and debugging, through submission of jobs to SQL Server Big Data Clusters. It enables you to enjoy a native Scala and Java Spark application development experience and quickly start a project using built-in templates and sample code. The integration with SQL Server Big Data Cluster empowers you to quickly submit a job to the big data cluster as well as monitor its progress. The Spark console allows you to check schemas, preview data, and validate your code logic in a shell-like environment while you can develop Spark batch jobs within the same toolkit.

It looks pretty good from my vantage point.

Comments closed

Building the Right Architecture for the Job

Gogula Aryalingam takes us through an example where the newest and most expensive tools aren’t the best for the job at hand:

When Azure SQL Data Warehouse was chosen to implement a multi-dimensional data warehouse, it may have seemed like the ideal choice. Why? because it was plain to see: keywords: “SQL”, “Warehouse”. However, no, SQL Data Warehouse is ideal only when you have data loads that are quite high, not when it is only several 100GBs. Armed with a few more reasons as to why not (A good reference for choosing Azure SQL Data Warehouse), I had confronted them. But the rebuke then was that they did get good enough performance, and that cost wasn’t a problem. Until of course a few months later when complex queries started hitting the system, and despite being able to afford that cost, the value of paying that amount did not seem worth it.

Having a good architectural understanding of the Azure or AWS platform—even if you aren’t deeply familiar with all of the tools—can help avoid these types of problems.

Comments closed

Deploying Azure Databricks in a Custom VNET

Abhinav Garg and Anna Shrestinian explain how you can use VNET injection with Azure Databricks:

To make the above possible, we provide a Bring Your Own VNET (also called VNET Injection) feature, which allows customers to deploy the Azure Databricks clusters (data plane) in their own-managed VNETs. Such workspaces could be deployed using Azure Portal, or in an automated fashion using ARM Templates, which could be run using Azure CLI, Azure Powershell, Azure Python SDK, etc.

With this capability, the Databricks workspace NSG is also managed by the customer. We manage a set of inbound and outbound NSG rules using a Network Intent Policy, as those are required for secure, bidirectional communication with the control/management plane. 

This is a good article if the defaults won’t get past corporate security.

Comments closed

Generating TPC-DS Data Sets with HDInsight

Chris Koester shows how you can generate artificial data sets in the TCP-DS format using HDInsight:

This post describes how to generate big datasets with Hive in HDInsight, specifically TPC-DS benchmarking datasets. There are many tools for generating sample data, and this one is particularly nice due to its familiarity and ability to generate massive datasets up to 100 terabytes in size. The intended purpose of TPC data is for benchmarking purposes, but big sample datasets are also very useful for learning big data tools, proofs of concept, testing, etc.

The TPC (Transaction Processing Performance Council) provides tools for generating the benchmarking data, but using them to generate big data is not trivial, and would take a very long time on modest hardware. Thankfully someone has written a nice utility that uses Hive and Python to run the generator on a Hadoop cluster. While Hadoop clusters are not easy to setup, using a Hadoop cloud service like Azure HDInsight is remarkably easy. With HDInsight, you can use a powerful cluster of machines to generate the data quickly, and when you’re done you can delete the cluster, leaving the data in place.

Most of the instructions should follow through to work with on-prem or non-HDInsight Hadoop clusters, though there will be some changes to accommodate differences in HDInsight.

Comments closed

Databricks Dashboards

Megan Quinn takes us through building dashboards with Apache Zeppelin on Databricks:

The first step in any type of analysis is to understand the dataset itself. A Databricks dashboard can provide a concise format in which to present relevant information about the data to clients, as well as a quick reference for analysts when returning to a project.

To create this dashboard, a user can simply switch to Dashboard view instead of Code view under the View tab. The user can either click on an existing dashboard or create a new one. Creating a new dashboard will automatically display any of the visualizations present in the notebook. Customization of the dashboard is easily achieved by clicking on the chart icon in the top right corner of the desired command cells to add new elements.

This isn’t quite a step-by-step guide but does spur on ideas.

Comments closed

Putting TempDB Files On Azure IaaS D Drive

John McCormack tries out using the temporary drive on Azure VMs for tempdb:

Azure warn you not to to store data on the D drive in Azure VMs, but following this advice could mean you are missing out on some very fast local storage. It’s good general advice because this local storage is not permanently attached to your instance, meaning you could lose data or log files if your VM is stopped and restarted but what if you could afford to lose certain files? Say files that are recreated during startup anyway.

TempDB is the ideal candidate for this. No other database is suitable! Putting the tempdb data and log files onto D drive can be achieved quite easily with a little bit of effort. And you will most likely see a big improvement in tempdb read/write latency.

John ended up seeing much bigger gains than I did when I tried this, but with a difference that big, it’s definitely worth using the temporary drive for tempdb.

Comments closed

Sizing Azure SQL Database

Arun Sirpal takes us through finding the right size for Azure SQL Database:

Do you want to identify the correct Service Tier and Compute Size ( was once known as performance level) for your Azure SQL Database? How would you go about it? Would you use the DTU (Database Transaction Unit) calculator? What about the new pricing model vCore? How would you translate you current on-premises workload to the cloud?

It can be a form of trial and error especially if you are new to this but I really do recommend trying out the PowerShell script that you can access once you have installed  DMA – Database Migration Assistant.

Read on to see how to run this tool and potentially save some money.

Comments closed

Cleaning Up After Yourself in Azure Data Factory

Rayis Imayev shows how you can automatically delete old files in Azure Data Factory:

File management may not be at the top of my list of priorities during data integration projects. I assume that once I learn enough about sourcing data systems and target destination platform, I’m ready to design and build a data integration solution between two or more connecting points. Then, a historical file management process becomes a necessity or a need to log and remove some of the incorrectly loaded data files. Basically, a step in my data integration process to remove (or clean) such files would be helpful. 

Click through to see how to do this.

Comments closed