Press "Enter" to skip to content

Category: Cloud

Azure Data Factory V2 Or SSIS?

Merrill Aldrich explains the differences between Azure Data Factory V2 and SQL Server Integration Services:

The next consideration is a bit more involved if you are new to data integration. Both of these tools excel at transporting data from place to place, but they have important differences in terms of what you can do to modify the data in transit. As a matter of emphasis, ADF has more features geared toward moving the data than performing any complex transformation along the way. SSIS, on the other hand, was built with a large library of transformations that you can chain together to make elaborate data flows including lookups, matching, splitting data, and more.

The tools also overlap quite a lot. In projects this seems to lead to the question of whether you’ll transform the data “in flight” using Extract Transform Load (ETL), or instead move the data to a destination where it’ll be transformed using Extract Load Transform (ELT).

These are not “pretty much the same thing” and Merrill does a good job of explaining what those differences in design mean for the products.

Comments closed

Thoughts On The Evolution Of Big Data

Praveen Sripati shares an opinion on where the various Hadoop and Big Data platforms are headed:

The different Cloud Vendors had been offering Big Data as a service for quite some time. Athena, EMR, RedShift, Kinesis are a few of the services from AWS. There are similar offerings from Google CloudMicrosoft Azure and other Cloud vendors also. All these services are native to the Cloud (built for the Cloud) and provide tight integration with the other services from the Cloud vendor.

In the case of Cloudera, MapR and HortonWorks the Big Data platforms were not designed with the Cloud into considerations from the beginning and later the platforms were plugged or force fitted into the Cloud. The Open Hybrid Architecture Initiative is an initiative by HortonWorks to make their Big Data platform more and more Cloud native.

It’ll be interesting to see where this goes.

Comments closed

Databricks Cluster-Scoped Init Scripts

Aayush Bhasin shares some background on a Databricks intern project, adding cluster-scoped initialization scripts to Databricks clusters:

One of the biggest pain points for customers used to be that init scripts for a cluster were not part of the cluster configuration and did not show up in the User Interface. Because of this, applying init scripts to a cluster was unintuitive, and editing or cloning a cluster would not preserve the init script configuration. Cluster-scoped init scripts addressed this issue by including an ‘Init Scripts’ panel in the UI of the cluster configuration page, and adding an ‘init_scripts’ field to the public API. This also allows init scripts to take advantage of cluster access control.

Read on to see how Aayush & co. solved this issue.

Comments closed

Databricks UDF Performance Testing

Tristan Robinson shares some performance comps for different Azure Databricks scenarios:

I’ve recently been spending quite a bit of time on the Azure Databricks platform, and while learning decided it was worth using it to experiment with some common data warehousing tasks in the form of data cleansing. As Databricks provides us with a platform to run a Spark environment on, it offers options to use cross-platform APIs that allow us to write code in Scala, Python, R, and SQL within the same notebook. As with most things in life, not everything is equal and there are potential differences in performance between them. In this blog, I will explain the tests I produced with the aim of outlining best practice for Databricks implementations for UDFs of this nature.

Scala is the native language for Spark – and without going into too much detail here, it will compile down faster to the JVM for processing. Under the hood, Python on the other hand provides a wrapper around the code but in reality is a Scala program telling the cluster what to do, and being transformed by Scala code. Converting these objects into a form Python can read is called serialisation / deserialisation, and its expensive, especially over time and across a distributed dataset. This most expensive scenario occurs through UDFs (functions) – the runtime process for which can be seen below. The overhead here is in (4) and (5) to read the data and write into JVM memory.

Click through for the results.  Looks like Python barely beat out Scala for the #1 position, but Scala was a little faster than Python in-class (e.g., the Scala program with a Scala SQL UDF was a little bit faster than the Python equivalent).

Comments closed

Migrating Azure SQL Databases Between Resource Groups

Arun Sirpal shows us a method for moving Azure SQL Databases between resource groups:

In my mind there are a couple of ways to move a database across resource groups. They vary from scripting to just using the Azure portal. I am going to use the Azure portal and do the following.

  1. Export a database in resource group X to a storage account Z.
  2. Import the file from the storage account Z into a database that is in resource group Y.

It’s just like a “backup and restore” strategy, all with the assumption that you are working within the same subscription ID.

Read on for a step-by-step demonstration on how to do this.

Comments closed

Using JSON In Azure Data Lake Analytics

Jeffrey Verheul shows how to register .NET assemblies in Azure Data Lake Analytics:

The power of Azure Data Lake is that you can use a variety of different file types to process data (from Azure Data Lake Analytics). But in order to use JSON, you need to register some assemblies first.

Downloading assemblies
The assemblies are available on Github for download. Unfortunately you need to download the solution, and compile it on your machine. So I’ve also made the 2 DLL’s you need available via direct download:

Click through for links to the assemblies and instructions on how to register them.  And to continue my long-running joke that every .NET project has as a core requirement Newtonsoft.Json.

Comments closed

Databricks Runtime 4.3 Released

Todd Greenstein announces Databricks Runtime 4.3:

In addition to the performance improvements, we’ve also added new functionality to Databricks Delta:

  • Truncate Table: with Delta you can delete all rows in a table using truncate.  It’s important to note we do not support deleting specific partitions.  Refer to the documentation for more information: Truncate Table

  • Alter Table Replace columns: Replace columns in a Databricks Delta table, including changing the comment of a column, and we support reordering of multiple columns.   Refer to the documentation for more information: Alter Table

  • FSCK Repair Table: This command allows you to Remove the file entries from the transaction log of a Databricks Delta table that can no longer be found in the underlying file system. This can happen when these files have been manually deleted.  Refer to the documentation for more information: Repair Table

  • Scaling “Merge” Operations: This release comes with experimental support for larger source tables with “Merge” operations. Please contact support if you would like to try out this feature.

Looks like a nice set of reasons to upgrade.

Comments closed

Reserved Capacity With Azure SQL Database

Chris Seferlis explains the concept of Azure SQL Database Reserved Capacity:

Last week I posted about the Azure Reserve VM Instance where you could save some money in Azure. Another similar way to save is with Azure SQL Database Reserved Capacity. With this you can save 33% compared to license included pricing by pre-buying SQL Database pre-cores for a 1- or 3-year term.

This can be applied to a single subscription or shared across your enrollments, so you can control how many subscriptions can use the benefit, as well as how the reservation is applied to the specific subscriptions you choose.

The reservation scope to a single subscription allows you to apply it to that SQL Database resource(s) within the selected subscription. A reservation with a shared scope can be shared across subscriptions in the enrollment and there’s some flexibility involved like Managed Instances where you can scale up/down.

Read on for more.  AWS had been offering discounts for reserved capacity for a while, but now we’re seeing Microsoft play the game too.

Comments closed

Azure SQL Database Elastic Job False Successes

Arun Sirpal walks us through a case where an Azure SQL Database elastic job appears to be running correctly but isn’t:

I use elastic pools. They are a fabulous way of saving money when running many Azure SQL Databases, that is assuming you understand the resource utilization patterns of the databases involved.

I recently created couple of elastic jobs for my pool and went through a successful setup (so I thought) and proceeded to execute the query. The query being elastic in nature basically runs across all my databases from a single point. This is where it got puzzling, if you look at the below screen shot you can clearly see that the status is RUNNING, drilling into it, it says RETRYING. This obviously is not right.

Click through to see the root cause of this problem.

Comments closed

Moving Data Between Data Lakes

Jeffrey Verheul shows us how to use AdlCopy to migrate data from one Azure Data Lake to another:

Migrating data from one Data Lake to the other
We started out with a test version of a Data Lake, and this week I needed to migrate data to the production version of our Data Lake. After a lot of trial and error I couldn’t find a good way to migrate data. In the end I found a tool called AdlCopy. This is a command-line tool that copies files for you. Let me show you how easy it is.

Download & Install
AdlCopy needs to be installed on your machine. You can find the download here. By default the tool will install the files in “C:\Users\\Documents\AdlCopy\”, but this can be changed in the setup wizard.

Click through to see how to use this tool.

Comments closed