Press "Enter" to skip to content

Category: Cloud

Thoughts On The Evolution Of Big Data

Praveen Sripati shares an opinion on where the various Hadoop and Big Data platforms are headed:

The different Cloud Vendors had been offering Big Data as a service for quite some time. Athena, EMR, RedShift, Kinesis are a few of the services from AWS. There are similar offerings from Google CloudMicrosoft Azure and other Cloud vendors also. All these services are native to the Cloud (built for the Cloud) and provide tight integration with the other services from the Cloud vendor.

In the case of Cloudera, MapR and HortonWorks the Big Data platforms were not designed with the Cloud into considerations from the beginning and later the platforms were plugged or force fitted into the Cloud. The Open Hybrid Architecture Initiative is an initiative by HortonWorks to make their Big Data platform more and more Cloud native.

It’ll be interesting to see where this goes.

Comments closed

Databricks Cluster-Scoped Init Scripts

Aayush Bhasin shares some background on a Databricks intern project, adding cluster-scoped initialization scripts to Databricks clusters:

One of the biggest pain points for customers used to be that init scripts for a cluster were not part of the cluster configuration and did not show up in the User Interface. Because of this, applying init scripts to a cluster was unintuitive, and editing or cloning a cluster would not preserve the init script configuration. Cluster-scoped init scripts addressed this issue by including an ‘Init Scripts’ panel in the UI of the cluster configuration page, and adding an ‘init_scripts’ field to the public API. This also allows init scripts to take advantage of cluster access control.

Read on to see how Aayush & co. solved this issue.

Comments closed

Databricks UDF Performance Testing

Tristan Robinson shares some performance comps for different Azure Databricks scenarios:

I’ve recently been spending quite a bit of time on the Azure Databricks platform, and while learning decided it was worth using it to experiment with some common data warehousing tasks in the form of data cleansing. As Databricks provides us with a platform to run a Spark environment on, it offers options to use cross-platform APIs that allow us to write code in Scala, Python, R, and SQL within the same notebook. As with most things in life, not everything is equal and there are potential differences in performance between them. In this blog, I will explain the tests I produced with the aim of outlining best practice for Databricks implementations for UDFs of this nature.

Scala is the native language for Spark – and without going into too much detail here, it will compile down faster to the JVM for processing. Under the hood, Python on the other hand provides a wrapper around the code but in reality is a Scala program telling the cluster what to do, and being transformed by Scala code. Converting these objects into a form Python can read is called serialisation / deserialisation, and its expensive, especially over time and across a distributed dataset. This most expensive scenario occurs through UDFs (functions) – the runtime process for which can be seen below. The overhead here is in (4) and (5) to read the data and write into JVM memory.

Click through for the results.  Looks like Python barely beat out Scala for the #1 position, but Scala was a little faster than Python in-class (e.g., the Scala program with a Scala SQL UDF was a little bit faster than the Python equivalent).

Comments closed

Migrating Azure SQL Databases Between Resource Groups

Arun Sirpal shows us a method for moving Azure SQL Databases between resource groups:

In my mind there are a couple of ways to move a database across resource groups. They vary from scripting to just using the Azure portal. I am going to use the Azure portal and do the following.

  1. Export a database in resource group X to a storage account Z.
  2. Import the file from the storage account Z into a database that is in resource group Y.

It’s just like a “backup and restore” strategy, all with the assumption that you are working within the same subscription ID.

Read on for a step-by-step demonstration on how to do this.

Comments closed

Using JSON In Azure Data Lake Analytics

Jeffrey Verheul shows how to register .NET assemblies in Azure Data Lake Analytics:

The power of Azure Data Lake is that you can use a variety of different file types to process data (from Azure Data Lake Analytics). But in order to use JSON, you need to register some assemblies first.

Downloading assemblies
The assemblies are available on Github for download. Unfortunately you need to download the solution, and compile it on your machine. So I’ve also made the 2 DLL’s you need available via direct download:

Click through for links to the assemblies and instructions on how to register them.  And to continue my long-running joke that every .NET project has as a core requirement Newtonsoft.Json.

Comments closed

Databricks Runtime 4.3 Released

Todd Greenstein announces Databricks Runtime 4.3:

In addition to the performance improvements, we’ve also added new functionality to Databricks Delta:

  • Truncate Table: with Delta you can delete all rows in a table using truncate.  It’s important to note we do not support deleting specific partitions.  Refer to the documentation for more information: Truncate Table

  • Alter Table Replace columns: Replace columns in a Databricks Delta table, including changing the comment of a column, and we support reordering of multiple columns.   Refer to the documentation for more information: Alter Table

  • FSCK Repair Table: This command allows you to Remove the file entries from the transaction log of a Databricks Delta table that can no longer be found in the underlying file system. This can happen when these files have been manually deleted.  Refer to the documentation for more information: Repair Table

  • Scaling “Merge” Operations: This release comes with experimental support for larger source tables with “Merge” operations. Please contact support if you would like to try out this feature.

Looks like a nice set of reasons to upgrade.

Comments closed

Reserved Capacity With Azure SQL Database

Chris Seferlis explains the concept of Azure SQL Database Reserved Capacity:

Last week I posted about the Azure Reserve VM Instance where you could save some money in Azure. Another similar way to save is with Azure SQL Database Reserved Capacity. With this you can save 33% compared to license included pricing by pre-buying SQL Database pre-cores for a 1- or 3-year term.

This can be applied to a single subscription or shared across your enrollments, so you can control how many subscriptions can use the benefit, as well as how the reservation is applied to the specific subscriptions you choose.

The reservation scope to a single subscription allows you to apply it to that SQL Database resource(s) within the selected subscription. A reservation with a shared scope can be shared across subscriptions in the enrollment and there’s some flexibility involved like Managed Instances where you can scale up/down.

Read on for more.  AWS had been offering discounts for reserved capacity for a while, but now we’re seeing Microsoft play the game too.

Comments closed

Azure SQL Database Elastic Job False Successes

Arun Sirpal walks us through a case where an Azure SQL Database elastic job appears to be running correctly but isn’t:

I use elastic pools. They are a fabulous way of saving money when running many Azure SQL Databases, that is assuming you understand the resource utilization patterns of the databases involved.

I recently created couple of elastic jobs for my pool and went through a successful setup (so I thought) and proceeded to execute the query. The query being elastic in nature basically runs across all my databases from a single point. This is where it got puzzling, if you look at the below screen shot you can clearly see that the status is RUNNING, drilling into it, it says RETRYING. This obviously is not right.

Click through to see the root cause of this problem.

Comments closed

Moving Data Between Data Lakes

Jeffrey Verheul shows us how to use AdlCopy to migrate data from one Azure Data Lake to another:

Migrating data from one Data Lake to the other
We started out with a test version of a Data Lake, and this week I needed to migrate data to the production version of our Data Lake. After a lot of trial and error I couldn’t find a good way to migrate data. In the end I found a tool called AdlCopy. This is a command-line tool that copies files for you. Let me show you how easy it is.

Download & Install
AdlCopy needs to be installed on your machine. You can find the download here. By default the tool will install the files in “C:\Users\\Documents\AdlCopy\”, but this can be changed in the setup wizard.

Click through to see how to use this tool.

Comments closed

Azure SQL Database SLAs

Arun Sirpal ponders the Azure SQL Database service level agreement:

Lets just get straight to the point, Azure SQL Database across all service tiers gives you the customer a SLA of 99.99% up-time. This means potential unavailability periods shown below.

Good, bad, you decide. The point is that even in the cloud we “could” potentially encounter downtime. Can you improve on 99.99%? Well that was the question I asked Microsoft, I was given a “wishy-washy” answer that yes you can by using failover groups ( I’m guessing the read/write endpoint is key here ) to improve the up time. I then pressed on what sort of figure in terms of nines does this provide, to no avail.

So what happens if up time is less than 99.99% or even worse 99% (ouch). Service credits are available as shown below.

Arun also includes some of the exceptions Microsoft has.  Most of these are “you messed up” types of exceptions, but not all of them.

Comments closed