Press "Enter" to skip to content

Category: Cloud

Using Azure Data Lake Store With Hadoop

Amit Kulkarni shows how to make Azure Data Lake Store the default file system for a Hadoop cluster:

So to give a concrete example, if the default file system was hdfs://123.23.12.4344:9000 then the /user/filename.txt would resolve to hdfs://123.23.12.4344:9000/user/filename.txt.

Why does the default file system matter? The first answer to this is purely convenience. It is a heck lot easier to simply say /events/sensor1/ than adl://amitadls.azuredatalakestore.net/ in code and configurations. Secondly, many components in Hadoop use relative paths by default. For instance there are a fixed set of places, specified by relative paths, where various applications generate their log files. Finally, many ISV applications running on Hadoop specify important locations by relative paths.

Read on to see how.

Comments closed

Hadoop In The Cloud

Peter Coates talks about pros and cons to Hadoop in the cloud:

Hadoop was developed for deployment over Linux running on bare metal. Cloud deployment implies virtual machines, and for Hadoop it’s a huge difference.

As detailed in other articles (for instance, Your Cluster Is an Appliance or Understanding Hadoop Hardware Requirements), bare-metal deployments have an inherent advantage over virtual machine deployments. The biggest of these is that they can use direct attached storage, i.e., local disks.

Not every Hadoop workload is storage I/O bound, but most are, and even when Hadoop seems to be CPU bound, much of the CPU activity is often either directly in service of I/O, i.e., marshaling, unmarshaling, compression, etc., or in service of avoiding I/O, i.e., building in-memory tables for map-side joins.

Read the whole thing.

Comments closed

Managing Azure SQL Database Firewall Rules

Cedric Charlier shows how to manage Azure SQL Database firewall rules from within Management Studio:

When you create a new Azure database, you usually need to open the firewall to remotely administrate or query this database with SSMS. An option is to create rules from the Azure Portal. It’s surely a convenient way to do it when you create a database but I prefer to keep a minimum of tools and when the Azure portal is not open, I prefer to not have to open it just to define a few firewall rules.

Opening the firewall with SSMS is a kind of chicken and eggs problem: to connect to your database/server, you need to open the firewall. Hopefully, SSMS has a great suite of screens to call the underlying API of Azure Portal and open the firewall for the computer running SSMS.

Cedric shows off sp_delete_firewall_rule but there’s also a corresponding sp_set_firewall_rule.

Comments closed

Bulk Loading HDInsight Using Phoenix

Anunay Tiwari uses Phoenix to bulk load data into HBase on HDInsight:

Apache HBase is an open Source No SQL Hadoop database, a distributed, scalable, big data store. It provides real-time read/write access to large datasets. HDInsight HBase is offered as a managed cluster that is integrated into the Azure environment. HBase provides many features as a big data store. But in order to use HBase, the customers have to first load their data into HBase.

There are multiple ways to get data into HBase such as – using client API’s, Map Reduce job with TableOutputFormat or inputting the data manually on HBase shell. Many customers are interested in using Apache Phoenix – a SQL layer over HBase for its ease of use. The current post describes about how to use phoenix bulk load with HDinsight clusters.

Phoenix provides two methods for loading CSV data into Phoenix tables – a single-threaded client loading tool via the psql command, and a MapReduce-based bulk load tool.

Anunay explains both methods, allowing you to choose based on your data needs.

Comments closed

Azure Disk Encryption

Melissa Coates configures Azure Disk Encryption for an already-existing Azure VM:

As I discussed in my previous blog post, I opted to use Azure Disk Encryption for my virtual machines in Azure, rather than Storage Service Encryption. Azure Disk Encryption utilizes Bitlocker inside of the VM. Enabling Azure Disk Encryption involves these Azure services:

  • Azure Active Directory for a service principal
  • Azure Key Vault for a KEK (key encryption key) which wraps around the BEK (bitlocker encryption key)
  • Azure Virtual Machine (IaaS)

Following are 4 scripts which configures encryption for an existing VM. I initially had it all as one single script, but I purposely separated them. Now that they are modular, if you already have a Service Principal and/or a Key Vault, you can skip those steps. I have my ‘real’ version of these scripts stored in an ARM Visual Studio project (same logic, just with actual names for the Azure services). These PowerShell templates go along with other ARM templates to serve as source control for our Azure infrastructure.

The Powershell scripts are straightforward and clear, so check them out.

Comments closed

IoT Versus Event Hub

James Serra clarifies the differences between Azure’s IoT Hub and its Event Hub:

The majority of the time, if the data is coming directly from the devices, either directly or via a field-based gateway, IoT Hub will be the more appropriate choice.  Event Hub will generally be the more appropriate choice if either the data will not be coming to Azure directly from the devices, but rather either cloud-to-cloud through another provider, intra-cloud, or if the data is already landing on-premise and needs to be streamed to the cloud from a small number of endpoints internally.  There are exceptions to both conditions, of course.

Both solutions offer very high throughput data ingestion and can handle tremendous streaming data volumes.  In fact, today, IoT Hub is primarily a set of additional services that wrap an underlying Event Hub.

Read on for more scenarios and limitations in each.  They definitely serve different use cases.

Comments closed

Using Sparklyr To Analyze Flight Data

Aki Ariga uses sparklyr on Apache Spark 2.0 to analyze flight data living in S3:

Using sparklyr enables you to analyze big data on Amazon S3 with R smoothly. You can build a Spark cluster easily with Cloudera Director. sparklyr makes Spark as a backend database of dplyr. You can create tidy data from huge messy data, plot complex maps from this big data the same way as small data, and build a predictive model from big data with MLlib. I believe sparklyr helps all R users perform exploratory data analysis faster and easier on large-scale data. Let’s try!

You can see the Rmarkdown of this analysis on RPubs. With RStudio, you can share Rmarkdown easily on RPubs.

Sparklyr is an exciting technology for distributed data analysis.

Comments closed

Azure Price Cuts

Brad Sams reports that Azure VM and Azure Blob Storage prices are going down:

Microsoft, Amazon and now Google are in a heated cloud race to grab as much market share as they can as they know that once a company starts using their service, the likelihood of switching platforms is low. With more services being offered via cloud vendors and more companies diving into these platforms, Microsoft and Amazon are frequently cutting prices to create a competitive advantage.

On this edition of ‘cloud cuts’, Microsoft is slashing prices on some of its Azure Virtual Machines and its Blob storage. The company is dropping the prices on compute-optimized instances – F Series and general purpose instances – A1; the company says pricing cuts on its D-series general purpose instances will happen in the near future.

Blob storage is down to 2 cents per GB per month for hot storage.  That’s slightly below S3’s 2.3 cents per GB per month.

Comments closed

Integrating Data Lake Storage With SQL Data Warehouse

Sachin Sheth alerts us to a new integration point between Azure Data Lake Storage and Azure SQL Data Warehouse via Polybase:

Most common patterns using Azure Data Lake Store (ADLS) involve customers ingesting and storing raw data into ADLS. This data is then cooked and prepared by analytic workloads like Azure Data Lake Analytics and HDInsight. Once cooked this data is then explored using engines like Azure SQL Data Warehouse. One key pain point for customers is having to wait for a substantial time after the data was cooked to be able to explore it and gather insights. This was because the data stored in ADLS would have to be loaded into SQL Data Warehouse using tools row-by-row insertion. But now, you don’t have to wait that long anymore. With the new SQL Data Warehouse PolyBase support for ADLS, you will now be able to load and access the cooked data rapidly and lessen your time to start performing interactive analytics. PolyBase support will allow to you access unstructured/semi-structured files in ADLS faster because of a highly scalable loading design. You can load the files stored in ADLS into SQL Data Warehouse to perform analytics with fast response times or you use can the files in ADLS as external tables. So get ready to unlock the value stored in your petabytes of data stored in ADLS.

I’ve been waiting for this support, and I’m happy that they were able to integrate the two products.

Comments closed

Azure SQL Database Extended Events

Arun Sirpal compares on-prem extended events to what’s available in Azure SQL Database:

There are 22 actions and 261 events. Naturally less than your local based SQL Servers, for example on my local 2014 machine running the above query returned 50 actions and 284 events.

There are a few subtle differences and a couple not-so-subtle differences, so it’s worth digging into if you plan to spin up an Azure SQL Database database.

Comments closed