Press "Enter" to skip to content

Category: Cloud

Sparklyr On HDInsight

Ali Zaidi has a walkthrough on using sparklyr on HDInsight:

The majority of Spark is written in Scala (~80% of Spark core), which is a functional programming language. Functional programming languages emphasize functional purity (the output only depends on the inputs) and strive to avoid side-effects. One important component of most functional programming languages is their lazy evaluation. While it might seem odd that we would appreciate laziness from our computing tools, lazy evaluation is an effective way of ensuring computations are evaluated in the most efficient manner possible.

Lazy evaluation allows Spark SQL to highly optimize the queries. When a user submits a query to Spark SQL, Spark composes the components of the SQL query into a logical plan. The logical plan is basically a recipe Spark SQL creates in order to evaluate the desired query. Spark SQL then submits the logical plan to its highly optimized engine called Catalyst, which optimizes this plan into a physical plan of action that is executed inside Spark computation engine (a series of coordinating JVMs).

Read on for more description and code.

Comments closed

Elastic Database Jobs

Mark Vaillancourt looks at Elastic Database Jobs in Azure:

The new Elastic Database Jobs are designed to echo well the functionality the folks working with SQL Server are accustomed to on-prem with SQL Agent. But it’s even better than that. There are many features that are just baked in that you no longer have to worry about. I’ve presented on the new Elastic Jobs as part of a larger presentation on the overall Elastic tools associated with Azure SQL Database a handful of times. That presentation is called Azure SQL Database Elastic Boogie and references Marcia Griffith’s hit song Electric Boogie (The Electric Slide). Yeah. I know. That will explain the use of the word boogie all over the place.

Even with it just being a very new private preview, my experience has been a great one. Huge kudos to Debra and her team on that.

This sounds pretty good.  I really like the dynamic resolution portion and wish that on-prem SQL Agent jobs could do the same out of the box.

Comments closed

TMSCHEMA DMVs

Meagan Longoria wants Azure Analysis Services documentation:

It would be great to get the DMVs documented similar to the MDSCHEMA DMVs as they are quite useful for tasks like documenting your tabular model.  Since the TMSCHEMA DMVs work in Azure Analysis Services as well, I have logged this request on the Azure AS User Voice for that. Please lend me a vote so we can make this information more easily available.

Please vote on this.

Comments closed

Free Trial Of Azure SQL Data Warehouse

James Serra notes that there is a free one-month trial of Azure SQL Data Warehouse:

You can use this one month free trial to do POCs and try out SQL DW up to 200 DWU and 2TB of data.  You must sign up by December 31st 2016.  Please note that once the one month free trial is over, you will start getting billed at general availability pricing rates.  For more information on the free trial, and to sign up, go here.

This is great because you can quickly run out of credits otherwise.

Comments closed

Test Connection With HDInsight

I have a post trying to test a connection using HDInsight:

WebHCat is a web-based REST API for HCatalog, a management layer for dealing with files in HDFS.  If you’re looking for configuration settings for WebHCat, you’ll want generally to look for “templeton” in config files, as Templeton was the project name before WebHCat.  In Ambari, you can go to the Hive configs and look at webhcat-site.xml for configuration settings.  For WebHCat, the default port in HDInsight is 30111, which you should find in the templeton.port configuration setting.

I don’t like the fact that WebHDFS is blocked, but at least WebHCat is functional.

Comments closed

Spark Clusters On Spot Pricing

Sameer Farooqui explains spot pricing with respect to AWS servers:

The idea behind Spot instances is to allow you to bid on spare Amazon EC2 compute capacity. You choose the max price you’re willing to pay per EC2 instance hour. If your bid meets or exceeds the Spot market price, you win the Spot instances. However, unlike traditional bidding, when your Spot instances start running, you pay the live Spot market price (not your bid amount). Spot prices fluctuate based on the supply and demand of available EC2 compute capacity and are specific to different regions and availability zones.

So, although you may have bid 0.55 cents per hour for a r3.2xlarge instance, you’ll end up paying only 0.10 cents an hour if that’s what the going rate is for the region and availability zone.

Databricks uses spot pricing for Community Edition clusters to control costs.  Click through for a very interesting discussion of spot pricing and how they take advantage of it.

Comments closed

Capturing SSAS Query Activity

Bill Anton explains why and how he captures query activity by user in SSAS:

In most environments, it is trivial to obtain the name of the user who ran each query… all you have to do was capture the [QueryEnd] event in a profiler/xevent trace and pull the information from the [NTUserName] field. However, in environments involving Power BI and the Enterprise On-Premise Data Gateway, there’s a bit more to it.

The main issue is how authentication is handled in this type of architecture. When working with Power BI reports connected to an on-premise data source via the On-Premise Data Gateway, the account of the user running the report is passed as the “EffectiveUsername”. The implication here is that the value shown in the [NTUserName] field of the xevent/profiler trace is going to be the Data Gateway account – NOT the account of the user who actually generated the activity.

Read on for the full answer.

Comments closed

Developing In The Cloud

Richie Rump has some nice pointers about developing for Azure or AWS:

Since we’re a bunch of data freaks, we wanted to make sure that our data and files are properly backed up. I set out to create a script that will backup DynamoDB to a file and copy the data in S3 to Azure. The reasoning for saving our backups into a different cloud provider is pretty straightforward. First, we wanted to keep the data in a separate cloud account from the application. We didn’t make the same mistakes that Code Spaces did. Secondly, I wanted to kick the tires of Azure a bit. Heck, why not?

I figure this script would take me a day to write and a morning to deploy. In the end it took four days to write and deploy. So here are some lessons that I learned the hard way from trying to bang out this backup code.

This is a must-read if you’re starting to look at using cloud providers for services.

Comments closed

Azure SQL DW Statistics

Emma Stewart looks at how statistics are created in Azure SQL Data Warehouse:

In Azure SQL Data Warehouse, statistics have to be created manually. On previous SQL Server projects, creating and maintaining statistics wasn’t something that we had to incorporate into our design (and really think about!) however with SQL DW we need to make sure we think about how to include it in our process in order to make sure we take advantage of the benefits of working with Azure DW.

The major selling point of Azure SQL Data Warehouse is that it is capable of processing huge volumes of data, one of the specific performance optimisations that has been made is the distributed query optimiser. Using the information obtained from the statistics (information on data size and distribution), the service is able to optimize queries by assessing the cost of specific distributed query operations. Therefore, since the query optimiser is cost-based, SQL DW will always choose the plan with the lowest cost.

Azure SQL Data Warehouse is a bit of a strange animal, with differences in statistics being one of the smaller changes versus “classic” SQL Server.

Comments closed

Details On Azure SSAS

James Serra breaks down what Azure Analysis Services has to offer:

  • Developers can use SQL Server Data Tools (SSDT) in Visual Studio for creating models and deploying them to the service.  Administrators can manage the models using SQL Server Management Studio (SSMS) and investigate issues using SQL Server Profiler

  • Business users can consume the models in any major BI tool.  Supported Microsoft tools include Power BI, Excel, and SQL Server Reporting Services.  Other MDX compliant BI tools can also be used, after downloading and installing the latest drivers

  • The service currently supports tabular models (compatibility level 1200 only).  Support for multidimensional models will be considered for a future release, based on customer demand

Between tabular-only support and the max size being 100 GB (if I’m reading this correctly), they’re not yet ready to push the product hard.  Given that it just came out, that makes sense, and hopefully the training wheels come off.

Comments closed