Cloud – Page 182 – Curated SQL

This subscription is going to use an Azure SQL Database.

Go to the AG primary replica. (In this demo, this is SQL2014AG2.)

Expand Replication. Expand Local Publications. Right-click the publication and select New Subscription.

It turns out that this is a basic push subscription. Jes’s post is full of screenshots, making it even easier to follow.

Comments closed

HDInsight With Hive LLAP

Published 2016-12-29 by Kevin Feasel

Rashin Gupta explains some performance benefits of using Hive 2.0 (LLAP) on HDInsight:

With LLAP, we allow data scientists to query data interactively in the same storage location where data is prepared. This means that customers do not have to move their data from a Hadoop cluster to another analytic engine for data warehousing scenarios. Using ORC file format, queries can use advanced joins, aggregations and other advanced Hive optimizations against the same data that was created in the data preparation phase.

In addition, LLAP can also cache this data in its containers so that future queries can be queried from in-memory rather than from on-disk. Using caching brings Hadoop closer to other in-memory analytic engines and opens Hadoop up to many new scenarios where interactive is a must like BI reporting and data analysis.

Even with this, Hive is still more of a “warehousing” technology, but this moves it closer to real-time (or at least “not slow”) warehousing.

Comments closed

Polybase Execution Plan With Blob Storage

Published 2016-12-29 by Kevin Feasel

I look at an execution plan and packet capture of a Polybase query which reads from Azure Blob Storage:

In this case, all of those packets were 1514 bytes, so it’s an easy multiplication problem to see that we downloaded approximately 113 MB. The 2008.csv.bz2 file itself is 108 MB, so factoring in TCP packet overhead and that there were additional, smaller packets in the stream, I think that’s enough to show that we did in fact download the entire file. Just like in the Hadoop scenario without MapReduce, the Polybase engine needs to take all of the data and load it into a temp table (or set of temp tables if you’re using a Polybase scale-out cluster) before it can pull out the relevant rows based on our query.

The upshot is that Polybase behaves very similarly on Azure Blob Storage as it does with on-prem Hadoop for non-MapReduce queries.

Comments closed

Replication Publisher To Azure SQL DB

Published 2016-12-29 by Kevin Feasel

Jes Borland continues her series on transactional replication from on-prem SQL Server + Availability Groups into Azure SQL Database:

After initializing, check the Snapshot Agent and Log Reader Agent for success. (To do so, go to Replication, right-click the publication name, and select Snapshot Agent Status and Log Reader Agent Status.) I ran into problems with the Snapshot account not having high enough permissions in the databases (it needs db_owner), and then not having enough permissions on the snapshot folder (it needs Full). (This forum post, answered by Hilary Cotter, helped: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/899857db-e38e-4026-a34c-2a8c2628c6fc/access-denied-to-sql-replication-snapshot-folder?forum=sqlreplication.)

Except for the final section, it’s pretty much the same as dealing with on-prem SQL Server sans Availability Groups.

Comments closed

Polybase With Azure Blob Storage

Published 2016-12-28 by Kevin Feasel

I look at using Polybase to read data from Azure Blob Storage:

To this point, I have focused my Polybase series on interactions with on-premises Hadoop, as it’s the use case most apropos to me. I want to start expanding that out to include other interaction mechanisms, and I’m going to start with one of the easiest: Azure Blob Storage.

Ayman El-Ghazali has a great blog post the topic, which he turned into a full-length talk. As such, this post will fill in the gaps rather than start from scratch. In today’s post, my intention is to retrieve data from Azure Blob Storage and get an idea of what’s happening. From there, we’ll spend a couple more posts on Azure Blob Storage, looking a bit deeper into the process. That said, my expectation going into this series is that much of what we do with Azure Blob Storage will mimic what we did with Hadoop, as there are no Polybase core concepts unique to Azure Blob Storage, at least any of which I am aware.

Spoilers: I’m still not aware of any core concepts unique to Azure Blob Storage.

Comments closed

Transactional Replication To Azure SQL Database

Published 2016-12-28 by Kevin Feasel

Jes Borland has a five-part series on replicating a series of databases in an Availability Group to Azure SQL Database. Part 1 involves planning:

There are tasks you’ll need to take care of in SQL Server, the AG, and the SQL DB before you can begin.

This blog series assumes you already have an AG set up – it won’t go through the setup of that. It also assumes you have an Azure SQL server and a SQL Database created – it won’t go through that setup either.

Ideally, the publishers, distributor, and subscribers will all be the same version and edition of SQL Server. If not, you have to configure from the highest-version server, or you will get errors.

Part 2 prepares the replication distributor:

The first step in this process is to set up the remote distributor. As I mentioned in the first blog, you do not want your distribution database on one of the AG replicas. You need to set this up on a server that is not part of the AG.

Start by logging on to the distributor server – in this demo, SQL2014demo.

Stay tuned for the remainder of the series.

Comments closed

Jupyter On ElasticMapReduce

Published 2016-12-23 by Kevin Feasel

Tom Zeng shows howt o install Jupyter Notebooks on Amazon’s ElasticMapReduce:

By default (with no --password and --port arguments), Jupyter will run on port 8888 with no password protection; JupyterHub will run on port 8000. The --port and --jupyterhub-port arguments can be used to override the default ports to avoid conflicts with other applications.

The --r option installs the IRKernel for R. It also installs SparkR and sparklyr for R, so make sure Spark is one of the selected EMR applications to be installed. You’ll need the Spark application if you use the --toree argument.

If you used --jupyterhub, use Linux users to sign in to JupyterHub. (Be sure to create passwords for the Linux users first.) hadoop, the default admin user for JupyterHub, can be used to set up other users. The –password option sets the password for Jupyter and for the hadoop user for JupyterHub.

Installation is fairly straightforward, and they include a series of samples you can get to try out Jupyter.

Comments closed

Linux Data Science Virtual Machine

Published 2016-12-23 by Kevin Feasel

David Smith mentions the Linux data science virtual machine on Azure:

The Linux Data Science Virtual Machine includes all of the tools a modern data scientist needs, in one easy-to-launch package. With it, you can try exploring data with Apache Drill, train deep neural networks for computer vision with MXNet, develop AI applications with the Cognitive Toolkit, or create statistical models with big data in R with Microsoft R Server 9.0.

They also offer a free trial, so check it out.

Comments closed

Azure Management Using R

Published 2016-12-22 by Kevin Feasel

Alan Weaver introduces AzureSMR:

The AzureSMR functions currently addresses the following Azure Services:

Azure Blob: List, Read and Write to Blob Services
Azure Resources: List, Create and Delete Azure Resource. Deploy ARM templates.
Azure VM: List, Start and Stop Azure VMs
Azure HDI: List and Scale Azure HDInsight Clusters
Azure Hive: Run Hive queries against a HDInsight Cluster
Azure Spark: List and create Spark jobs/Sessions against a HDInsight Cluster(Livy)

This can be useful for cases like when you need to ramp up the Spark cluster before running a particularly compute-intensive process.

Comments closed

Understanding Data Gateways

Published 2016-12-22 by Kevin Feasel

James Serra walks us through the different data gateways available in Azure:

On-premises data gateway: Formally called the enterprise version. Multiple users can share and reuse a gateway in this mode. This gateway can be used by Power BI, PowerApps, Microsoft Flow or Azure Logic Apps. For Power BI, this includes support for both scheduled refresh and DirectQuery. To add a data source such as SQL Server that can be used by the gateway, check out Manage your data source – SQL Server. To connect the gateway to your Power BI, you will sign in to Power BI after you install it (see On-premises data gateway in-depth).

Click through for more details on additional gateways.

Comments closed

Category: Cloud

Replication Subscriber To Azure SQL DB

HDInsight With Hive LLAP

Polybase Execution Plan With Blob Storage

Replication Publisher To Azure SQL DB

Polybase With Azure Blob Storage

Transactional Replication To Azure SQL Database

Jupyter On ElasticMapReduce

Linux Data Science Virtual Machine

Azure Management Using R

Understanding Data Gateways