Spark Usage Scenarios

Rimma Nehme has several usage scenarios for Spark on Azure:

For data scientists, we provide out-of-the-box integration with Jupyter (iPython), the most popular open source notebook in the world. Unlike other managed Spark offerings that might require you to install your own notebooks, we worked with the Jupyter OSS community to enhance the kernel to allow Spark execution through a REST endpoint.

We co-led “Project Livy” with Cloudera and other organizations to create an open source Apache licensed REST web service that makes Spark a more robust back-end for running interactive notebooks.  As a result, Jupyter notebooks are now accessible within HDInsight out-of-the-box. In this scenario, we can use all of the services in Azure mentioned above with Spark with a full notebook experience to author compelling narratives and create data science collaborative spaces. Jupyter is a multi-lingual REPL on steroids. Jupyter notebook provides a collection of tools for scientific computing using powerful interactive shells that combine code execution with the creation of a live computational document. These notebook files can contain arbitrary text, mathematical formulas, input code, results, graphics, videos and any other kind of media that a modern web browser is capable of displaying. So, whether you’re absolutely new to R or Python or SQL or do some serious parallel/technical computing, the Jupyter Notebook in Azure is a great choice.

If you could only learn one new thing in 2016, Spark probably should be that thing.  Also, I probably should agitate a bit more about wanting Spark support within Polybase…

U-SQL

Ginger Grant has a quick intro on U-SQL:

In my previous series on Stream Analytics, I wrote some U-SQL. That U-SQL didn’t look much different than Ansi-SQL, which is sort of the point of porting the functionality to a different yet familiar language. Another application which heavily uses U-SQL is Azure Data Lake. Data Lake stores its data in HDInsight, but you don’t need to write hive to query the data, as U-SQL will do it. Like Hive, U-SQL can be used to create a schema on top of some data, and then query it.

For example, to write a query on this csv file stored in a Data Lake, I would need to create the data definition for the data, then I could easily write a statement to query it.

I’m interested in seeing how much adoption we see in this language.

Pivoting Data

Jeffrey Verheul shows how to use the PIVOT operator in T-SQL:

The idea of using key-value pairs to store data isn’t new, but with the rapid development of cloud solutions like Azure and the hype around NoSQL databases, using key-value pairs to store data got a big boost. Especially developers (in my experience) love using key-value pair to store their data, because it’s easy for them to consume the data in an application. But it gives the database professional an extra challenge because we’re used to retrieve columns with values instead of a record per value. So how can we turn those key-value pairs into rows?

This is a good example of using PIVOT.  I’m not a big fan of storing data in key-value pairs and using pivoting operators because you’re burning CPU on that very expensive SQL Server instance (and you’re not taking advantage of what relational databases do well); if you really need to store data as key-value, I’d recommend doing the pivot in cheaper application servers.

The Power Of DBCC CLONEDATABASE

Erin Stellato hacks DBCC CLONEDATABASE and makes it that much more powerful:

Very often when I mention testing before an upgrade, I’m told that there is no environment in which to do the testing.  I know some of you have a Test environment. Some of you have Test, Dev, QA, UAT and who knows what else. You’re lucky.

For those of you that state you have no test environment at all in which to test, I give you DBCC CLONEDATABASE. With this command, you have no excuse to not run the most frequently-executed queries and the heavy-hitters against a clone of your database. Even if you don’t have a test environment, you have your own machine.  Backup the clone database from production, drop the clone, restore the backup to your local instance, and then test.  The clone database takes up very little space on disk and you won’t incur memory or I/O contention as there’s no data.  You will be able to validate query plans from the clone against those from your production database. Further, if you restore on SQL Server 2016 you can incorporate Query Store into your testing! Enable Query Store, run through your testing in the original compatibility mode, then upgrade the compatibility mode and test again. You can use Query Store to compare queries side by side! (Can you tell I’m dancing in my chair right now?)

Erin’s discovery makes CLONEDATABASE go from being an interesting tool to being outright powerful for handling upgrades.

Surrogate Versus Natural Keys

Kenneth Fisher digs into the debate on surrogate keys versus natural keys:

A natural key is one constructed of data that already exists in the table. For example using latitude and longitude in a table of addresses. Or the social security number in a table of employees. (Before you say anything, yes, the social security number is a horrible primary key. Be patient.)

My personal preference is to use surrogate keys most of the time and put unique constraints (or unique indexes) on the natural key.  There are some occasions in which I’d deviate, but ceteris paribus I’d pick this strategy..

Azure SQL Data Warehouse Plans

Grant Fritchey shows how to build an execution plan for an Azure SQL Data Warehouse query:

So now we just save this as a .sqlplan file and open it in SSMS, right?

Nope!

See, that’s not a regular execution plan, at all. Instead, it’s a D-SQL plan. It’s not the same as our old execution plans. You can’t open it as a graphical plan (and no, not even in that very popular 3rd party tool, I tried). You will have to learn how to read these plans differently because, well, they are different.

That’s an unfortunate outcome.  Reading is hard…

Start-Demo With Multi-Line Commands

Rob Sewell has an update to the Start-Demo cmdlet, making it no longer require backticks when running multi-line commands:

Start-Demo was written in 2007 by a fella who knows PowerShell pretty well 🙂 https://blogs.msdn.microsoft.com/powershell/2007/03/03/start-demo-help-doing-demos-using-powershell/

It was then updated in 2012 by Max Trinidad http://www.maxtblog.com/2012/02/powershell-start-demo-now-allows-multi-lines-onliners/

This enabled support for multi-line code using backticks at the end of each line. This works well but I dislike having to use the backticks in foreach loops, it confuses people who think that they need to be included and to my mind looks a bit messy

I’m going to need to look into Start-Demo; I’m not sure I’ve seen it before.

Registering U-SQL Assemblies

Kevin Feasel

2016-08-29

U-SQL

Michael Rys shows how to register an assembly in the U-SQL catalog:

While the ADL Tools in VisualStudio make it easy to register an assembly, you can also do it with a script (in the same way that the tools do it for you) if you are for example developing on a different platform, have already compiled assemblies that you just want to upload and register. You basically follow the following steps:

You upload your assembly dll and all additionally required non-system dlls and resource files into a location of your choosing in your Azure Data Lake Storage account or even a Windows Azure Blob Store account that is linked to your Azure Data Lake account. You can use any of the many upload tools available to you (e.g., Powershell commands, VisualStudio’s ADL Tool Data Lake Explorer upload, your favorite SDK’s upload command or through the Azure Portal).

 

  1. Once you have uploaded the dlls, you use the CREATE ASSEMBLY statements to register them.

We will use this approach in the spatial example below.

There’s quite a bit going on in this post, making it an interesting read.

Power BI Line Charts

Reza Rad digs into line charts:

Value Line

Line chart can show one or more measures as measures, such as Sales Amount, Total Costs, Quantity of the goods sold and etc.

Trend Line

Depends on the variety of values across a time period, Line chart can illustrate a straight line as a trend. This trend is good to understand how in overall products are selling, is revenue going up or down for example.

Reference Line

You might want to define minimum, maximum, average, or median values for your line chart as separate lines, and compare values lines with these lines. These are reference lines which can be dynamically created based on value of measures in the chart.

I think this was a rather helpful post in figuring out what Power BI is capable of doing, although if you build a lot of charts that look like Reza’s last example, you probably want to scale that back a bit.

Ambari Metrics Collector Error

Kevin Feasel

2016-08-29

Hadoop

Jon Morisi had to troubleshoot an issue with the Ambari metrics collector not starting:

Last week I had a bit of a trial by fire:
“Here’s a 7 node, Hortonworks Hadoop cluster, metrics is broken, fix it, go!”

The initial indication that metrics was broken was apparent in the Services tab for Ambari Metrics.  Here it showed that there was an error and that Metrics Collector was Stopped.  The error however wasn’t very informative:

Connection failed: [Errno 111] Connection refused…

That didn’t tell me much at all, and neither did googling.
(I hope the title of this blog helps someone else find this solution quicker.)

Jon includes the answer and some additional helpful details.  Check it out.

Categories

September 2018
MTWTFSS
« Aug  
 12
3456789
10111213141516
17181920212223
24252627282930