Cloud – Page 185 – Curated SQL

Submitting A Spark Job On HDInsight

Published 2017-01-09 by Kevin Feasel

Bharath Venkatesh shows different ways to run a Spark job on HDInsight:

From HDI 3.5 onwards, our clusters come preinstalled with Zeppelin Notebooks. Much like Jupyter notebooks, Zeppelin is a web-based notebook that enables interactive data analytics. It provides built-in Spark intergration that allows for:

Automatic SparkContext and SQLContext injection

Runtime jar dependency loading from local filesystem or maven repository. Learn more about dependency loader.

Canceling job and displaying its progress

This MSDN article provides a quick easy-to-use onboarding guide to help get acclimatized to Zeppelin. You can also try several applications that come pre-installed on your cluster to get hands on experience of Zeppelin.

Zeppelin is probably my favorite method, but there are good reasons to use all of these.

Comments closed

Azure Resource Explorer

Published 2017-01-06 by Kevin Feasel

Kenneth Fisher discusses the Azure Resource Explorer:

Now this is just default tab. The GET, PUT tab. Which basically shows you the get command of the resource manager API that calls this information, and if you hit the edit button you can actually change information in the JSON output and issue a PUT command to send it back. I’ll admit up front that this is a bit beyond me as I don’t do API calls and I’m new enough to Azure that I don’t know what I can and can’t change (everything I’ve tried so far hasn’t worked). There are several other tabs, though, including a Powershell one and I’m a bit more familiar with Powershell. In it, you can see some of the Powershell commands associated with the resource manager and this particular object.

Read on for more information.

Comments closed

Polybase And Azure SQL Data Warehouse

Published 2017-01-05 by Kevin Feasel

I have a post on using Polybase with Azure SQL Data Warehouse:

That’s a header row, and I’m okay with it not making its way in. As a quick aside, I should note that I picked tailnum as my distribution key. The airplane’s tail number is unique to that craft, so there absolutely will be more than 60 distinct values, and as I recall, this data set didn’t have too many NULL values. After loading the 2008 data, I loaded all years’ data the same way, except selecting from dbo.Flights instead of Flights2008.

Click through for more details, including the CETAS statement, which I’d love to see in on-prem SQL Server.

Comments closed

Elastic Pool Database Sizes

Published 2017-01-05 by Kevin Feasel

Vincent-Philippe Lauzon looks at how you can size databases with an Azure Elastic Pool:

We can’t change a database maximum size in the portal (as of December 2016).

Using ARM template, it is easy to change the parameter. Here, let’s simply show how we would change it for an existing database.

Building on the example we gave in a previous article, we can easily grab the Pool-A-Db0 database in resource group DBs and server pooldemoserver:

Click through for all the details. I highlighted this snippet as another point that the most important language for a Windows administrator to learn nowadays is Powershell.

Comments closed

Where Azure Analysis Services Fits

Published 2017-01-03 by Kevin Feasel

Melissa Coates explains where Azure Analysis Services fits in common BI architectures:

(2) Data Sources

From a single source such as a data warehouse. This is the most traditional path for BI development, and still has a very valid place in many BI/analytics deployments. This scenario puts the work of data integration on the ETL process into the data warehouse, which is the most appropriate place.
Directly from various systems. This can be done, but works well only in specific cases – it definitely won’t work well if there are a lot of highly normalized tables, or if there’s not a straightforward way to relate the disparate data together. Trying to go directly to the source systems & skip an intermediary data warehouse puts the “integration” burden on the data source view in Analysis Services, so plan for plenty of time testing if you’re going to try this route (i.e., it can be much harder, not easier). Note that this option only makes sense if the data is stored in Analysis Services because it needs to be related together somehow (i.e., DirectQuery mode, discussed next in #3, with > 1 data source won’t work if a user tries to combine data sources because the data is not inherently related).

If you’re thinking about Azure Analysis Services, this post is a good one.

Comments closed

Testing Transactional Replication

Published 2017-01-03 by Kevin Feasel

Jes Borland wraps up her series on transactional replication from an on-prem Availability Group to Azure SQL Database:

Congratulations, you’ve configured a remote distributor, configured all of your AG replicas as publishers, and configured your SQL Database as a subscriber! Now you want to ensure that transactions are replicating to the database, and that they continue to do so if there is a failover in the AG.

Read on for the two testing scenarios.

Comments closed

Polybase: Inserting Into Azure Blob Storage

Published 2016-12-30 by Kevin Feasel

I have a post which uses Polybase to insert into Azure Blob Storage:

One additional question I have involves whether the process for loading data is round-robin on a row-by-row basis. My conjecture is that it is not (particularly given that our first example had 4 files with zero records in them!), but I figured I’d create a new table and test. In this case, I’m using three fixed-width data types and loading 10 million identical records. I chose to use identical record values to make sure that the text length of the columns in this line were exactly the same; the reason is that we’re taking data out of SQL Server (where an int is stored in a 4-byte block) and converting that int to a string (where each numeric value in the int is stored as a one-byte character). I chose 10 million because I now that’s well above the cutoff point for data to go into each of the eight files, so if there’s special logic to handle tiny row counts, I’d get past it.

Read on for the exciting(?) conclusion.

Comments closed

Replication Subscriber To Azure SQL DB

Published 2016-12-30 by Kevin Feasel

Jes Borland continues her series on setting up transactional replication between an on-prem SQL Server Availability Group and Azure SQL Database:

This subscription is going to use an Azure SQL Database.

Go to the AG primary replica. (In this demo, this is SQL2014AG2.)

Expand Replication. Expand Local Publications. Right-click the publication and select New Subscription.

It turns out that this is a basic push subscription. Jes’s post is full of screenshots, making it even easier to follow.

Comments closed

HDInsight With Hive LLAP

Published 2016-12-29 by Kevin Feasel

Rashin Gupta explains some performance benefits of using Hive 2.0 (LLAP) on HDInsight:

With LLAP, we allow data scientists to query data interactively in the same storage location where data is prepared. This means that customers do not have to move their data from a Hadoop cluster to another analytic engine for data warehousing scenarios. Using ORC file format, queries can use advanced joins, aggregations and other advanced Hive optimizations against the same data that was created in the data preparation phase.

In addition, LLAP can also cache this data in its containers so that future queries can be queried from in-memory rather than from on-disk. Using caching brings Hadoop closer to other in-memory analytic engines and opens Hadoop up to many new scenarios where interactive is a must like BI reporting and data analysis.

Even with this, Hive is still more of a “warehousing” technology, but this moves it closer to real-time (or at least “not slow”) warehousing.

Comments closed

Polybase Execution Plan With Blob Storage

Published 2016-12-29 by Kevin Feasel

I look at an execution plan and packet capture of a Polybase query which reads from Azure Blob Storage:

In this case, all of those packets were 1514 bytes, so it’s an easy multiplication problem to see that we downloaded approximately 113 MB. The 2008.csv.bz2 file itself is 108 MB, so factoring in TCP packet overhead and that there were additional, smaller packets in the stream, I think that’s enough to show that we did in fact download the entire file. Just like in the Hadoop scenario without MapReduce, the Polybase engine needs to take all of the data and load it into a temp table (or set of temp tables if you’re using a Polybase scale-out cluster) before it can pull out the relevant rows based on our query.

The upshot is that Polybase behaves very similarly on Azure Blob Storage as it does with on-prem Hadoop for non-MapReduce queries.

Comments closed

Category: Cloud

(2) Data Sources