Press "Enter" to skip to content

Category: Cloud

Polybase: Inserting Into Azure Blob Storage

I have a post which uses Polybase to insert into Azure Blob Storage:

One additional question I have involves whether the process for loading data is round-robin on a row-by-row basis.  My conjecture is that it is not (particularly given that our first example had 4 files with zero records in them!), but I figured I’d create a new table and test.  In this case, I’m using three fixed-width data types and loading 10 million identical records.  I chose to use identical record values to make sure that the text length of the columns in this line were exactly the same; the reason is that we’re taking data out of SQL Server (where an int is stored in a 4-byte block) and converting that int to a string (where each numeric value in the int is stored as a one-byte character).  I chose 10 million because I now that’s well above the cutoff point for data to go into each of the eight files, so if there’s special logic to handle tiny row counts, I’d get past it.

Read on for the exciting(?) conclusion.

Comments closed

Replication Subscriber To Azure SQL DB

Jes Borland continues her series on setting up transactional replication between an on-prem SQL Server Availability Group and Azure SQL Database:

This subscription is going to use an Azure SQL Database.

Go to the AG primary replica. (In this demo, this is SQL2014AG2.)

Expand Replication. Expand Local Publications. Right-click the publication and select New Subscription.

It turns out that this is a basic push subscription.  Jes’s post is full of screenshots, making it even easier to follow.

Comments closed

HDInsight With Hive LLAP

Rashin Gupta explains some performance benefits of using Hive 2.0 (LLAP) on HDInsight:

With LLAP, we allow data scientists to query data interactively in the same storage location where data is prepared. This means that customers do not have to move their data from a Hadoop cluster to another analytic engine for data warehousing scenarios. Using ORC file format, queries can use advanced joins, aggregations and other advanced Hive optimizations against the same data that was created in the data preparation phase.

In addition, LLAP can also cache this data in its containers so that future queries can be queried from in-memory rather than from on-disk. Using caching brings Hadoop closer to other in-memory analytic engines and opens Hadoop up to many new scenarios where interactive is a must like BI reporting and data analysis.

Even with this, Hive is still more of a “warehousing” technology, but this moves it closer to real-time (or at least “not slow”) warehousing.

Comments closed

Polybase Execution Plan With Blob Storage

I look at an execution plan and packet capture of a Polybase query which reads from Azure Blob Storage:

In this case, all of those packets were 1514 bytes, so it’s an easy multiplication problem to see that we downloaded approximately 113 MB.  The 2008.csv.bz2 file itself is 108 MB, so factoring in TCP packet overhead and that there were additional, smaller packets in the stream, I think that’s enough to show that we did in fact download the entire file.  Just like in the Hadoop scenario without MapReduce, the Polybase engine needs to take all of the data and load it into a temp table (or set of temp tables if you’re using a Polybase scale-out cluster) before it can pull out the relevant rows based on our query.

The upshot is that Polybase behaves very similarly on Azure Blob Storage as it does with on-prem Hadoop for non-MapReduce queries.

Comments closed

Replication Publisher To Azure SQL DB

Jes Borland continues her series on transactional replication from on-prem SQL Server + Availability Groups into Azure SQL Database:

After initializing, check the Snapshot Agent and Log Reader Agent for success. (To do so, go to Replication, right-click the publication name, and select Snapshot Agent Status and Log Reader Agent Status.) I ran into problems with the Snapshot account not having high enough permissions in the databases (it needs db_owner), and then not having enough permissions on the snapshot folder (it needs Full). (This forum post, answered by Hilary Cotter, helped: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/899857db-e38e-4026-a34c-2a8c2628c6fc/access-denied-to-sql-replication-snapshot-folder?forum=sqlreplication.)

Except for the final section, it’s pretty much the same as dealing with on-prem SQL Server sans Availability Groups.

Comments closed

Polybase With Azure Blob Storage

I look at using Polybase to read data from Azure Blob Storage:

To this point, I have focused my Polybase series on interactions with on-premises Hadoop, as it’s the use case most apropos to me.  I want to start expanding that out to include other interaction mechanisms, and I’m going to start with one of the easiest:  Azure Blob Storage.

Ayman El-Ghazali has a great blog post the topic, which he turned into a full-length talk.  As such, this post will fill in the gaps rather than start from scratch.  In today’s post, my intention is to retrieve data from Azure Blob Storage and get an idea of what’s happening.  From there, we’ll spend a couple more posts on Azure Blob Storage, looking a bit deeper into the process.  That said, my expectation going into this series is that much of what we do with Azure Blob Storage will mimic what we did with Hadoop, as there are no Polybase core concepts unique to Azure Blob Storage, at least any of which I am aware.

Spoilers:  I’m still not aware of any core concepts unique to Azure Blob Storage.

Comments closed

Transactional Replication To Azure SQL Database

Jes Borland has a five-part series on replicating a series of databases in an Availability Group to Azure SQL Database.  Part 1 involves planning:

There are tasks you’ll need to take care of in SQL Server, the AG, and the SQL DB before you can begin.

This blog series assumes you already have an AG set up – it won’t go through the setup of that. It also assumes you have an Azure SQL server and a SQL Database created – it won’t go through that setup either.

Ideally, the publishers, distributor, and subscribers will all be the same version and edition of SQL Server. If not, you have to configure from the highest-version server, or you will get errors.

Part 2 prepares the replication distributor:

The first step in this process is to set up the remote distributor. As I mentioned in the first blog, you do not want your distribution database on one of the AG replicas. You need to set this up on a server that is not part of the AG.

Start by logging on to the distributor server – in this demo, SQL2014demo.

Stay tuned for the remainder of the series.

Comments closed

Jupyter On ElasticMapReduce

Tom Zeng shows howt o install Jupyter Notebooks on Amazon’s ElasticMapReduce:

By default (with no --password and --port arguments), Jupyter will run on port 8888 with no password protection; JupyterHub will run on port 8000.  The --port and --jupyterhub-port arguments can be used to override the default ports to avoid conflicts with other applications.

The --r option installs the IRKernel for R. It also installs SparkR and sparklyr for R, so make sure Spark is one of the selected EMR applications to be installed. You’ll need the Spark application if you use the --toree argument.

If you used --jupyterhub, use Linux users to sign in to JupyterHub. (Be sure to create passwords for the Linux users first.)  hadoop, the default admin user for JupyterHub, can be used to set up other users. The –password option sets the password for Jupyter and for the hadoop user for JupyterHub.

Installation is fairly straightforward, and they include a series of samples you can get to try out Jupyter.

Comments closed

Linux Data Science Virtual Machine

David Smith mentions the Linux data science virtual machine on Azure:

The Linux Data Science Virtual Machine includes all of the tools a modern data scientist needs, in one easy-to-launch package. With it, you can try exploring data with Apache Drill, train deep neural networks for computer vision with MXNet, develop AI applications with the Cognitive Toolkit, or create statistical models with big data in R with Microsoft R Server 9.0.

They also offer a free trial, so check it out.

Comments closed

Azure Management Using R

Alan Weaver introduces AzureSMR:

The AzureSMR functions currently addresses the following Azure Services:

  • Azure Blob: List, Read and Write to Blob Services

  • Azure Resources: List, Create and Delete Azure Resource. Deploy ARM templates.

  • Azure VM: List, Start and Stop Azure VMs

  • Azure HDI: List and Scale Azure HDInsight Clusters

  • Azure Hive: Run Hive queries against a HDInsight Cluster

  • Azure Spark: List and create Spark jobs/Sessions against a HDInsight Cluster(Livy)

This can be useful for cases like when you need to ramp up the Spark cluster before running a particularly compute-intensive process.

Comments closed