Secure Enterprise Data Hub On Azure

James Morantus has a two-parter on Azure, Active Directory, and Cloudera’s enterprise data hub solution.  Part one hits on DNS and Samba:

As you can see, the hostname -f command displays a very long FQDN for my VM and hostname -i gives us the IP address associated with the VM. Next, I did a forward DNS lookup using the host FQDN command, which resolved to the IP address. Then, I did a reverse DNS lookup using host IPaddress as shown in the red box above, it did not locate a reverse entry for that IP address. A reverse lookup is a requirement for a CDH deployment. We’ll revisit this later.

Part two looks at tying everything together in the Azure portal as well as within AD:

The remaining steps must be executed as the Cloudera Director admin user you created earlier. In my case, that’s the “azuredirectoradmin” account. All resources created by Cloudera Director in the Azure Portal will be owned by this account. The “root” user is not allowed to create resources on the Azure Portal.

First, we’ll need to create a SSH key as the “azuredirectoradmin” user on the VM where Cloudera Director is installed. This key will be added to our deployment configuration file, which will be added on all the VMs provisioned by Cloudera Director. This will allow us to use passwordless SSH to the cluster nodes with this key.

This isn’t trivial, but considering all that’s going on, it’s rather straightforward.

Grabbing Spark With sbt

Kevin Feasel

2017-01-16

Spark

Ian Hellström shows how to create an sbt script to get the a particular version of Spark:

If you have already installed sbt on your machine, read on. If not, have a look here on how to set up your machine.

With sbt available, create a folder in which you can play around, your ‘sandbox’. I’ll assume you have created the folder under /path/to/sandbox. On Windows, also create a sub-folder inside it for Spark’s so-called warehouse directory. Let’s call that sub-folder ‘warehouse’.

Click through for more details.

Querying The Histogram XE Target

Kendra Little shows how to query the histogram target for Extended Events:

I like using the histogram target because it’s relatively lightweight — you can “bucket” results by what you’re interested in. In my case, I was interested seeing the cumulative number of file_read events by file name.

But there’s one problem: the histogram target is stored in memory, not in a data file. If you want to query that data and store it off in a table, it’s not obvious how to do that.

Click through to figure out how to do that.

DILM Reports

Andy Leonard walks through executing an Integration Services package and then seeing the results in an SSRS report:

One of the first Data Integration Lifecycle Management (DILM) Suite solutions I built was Catalog Reports. Catalog Reports is a relatively simple and straightforward version of some of the SSIS Catalog Reports embedded in SSMS. The main difference is Catalog Reports is a SQL Server Reporting Services (SSRS) solution.

It’s free.

And it’s open source. Here’s a screenshot of the Overview Report for the same execution above

Check it out.

Word Count In Spark 2.0

Kevin Feasel

2017-01-16

Spark

Anubhav Tarar has a word count app for Spark 2.0:

Now you have to perform the given steps:

  • Create a spark session from org.apache.spark.sql.sparksession api and specify your master and app name

  • Using the sparksession.read.txt method, read from the file wordcount.txt the return value of this method in a dataset. In case you don’t know what a data set looks like you can learn from this link.

  • Split this dataset of type string with white space and create a map which contains the occurence of each word in that data set.

  • Create a class prettyPrintMap for printing the result to console.

This Hello World app is a bit longer than the sheer minimum code necessary, as it includes a class for formatting results and some error handling.

R Tools For Visual Studio

Matt Willis has a two-parter on R Tools for Visual Studio.  First, an introduction:

Once all the prerequisites have been installed it is time to move onto the fun stuff! Open up Visual Studio 2015 and add an R Project: File > Add > New Project and select R. You will be presented with the screen below, name the project AutomobileRegression and select OK.

Microsoft have done a fantastic job realising that the settings and toolbar required in R is very different to those required when using Visual Studio, so they have split them out and made it very easy to switch between the two. To switch to the settings designed for using R go to R Tools > Data Science Settings you’ll be presented with two pop ups select Yes on both to proceed. This will now allow you to use all those nifty shortcuts you have learnt to use in RStudio. Anytime you want to go back to the original settings you can do so by going to Tools > Import/Export Settings.

Next is executing an Azure Machine Learning web service within RTVS:

Whilst in R you can implement very complex Machine Learning algorithms, for anyone new to Machine Learning I personally believe Azure Machine Learning is a more suitable tool for being introduced to the concepts.

Please refer to this blog where I have described how to create the Azure Machine Learning web service I will be using in the next section of this blog. You can either use your own web service or follow my other blog, which has been especially written to allow you to follow along with this blog.

Coming back to RTVS we want to execute the web service we have created.

RTVS has grown on me.  It’s still not R Studio and may never be, but they’ve come a long way in a few months.

Databases In Source Control

Robert Sheldon walks through core concepts of source control:

The source control system can’t merge the two file versions until Barb resolves the conflict between black bear and brown bear (the additions of wolf and fox still cause no problem).

When conflicts of this nature arise, someone must examine the comparison and determine which version of bear should win out. In this case, Barb decides to go with black bear.

It’s worth considering the risk associated with this merge process. Barb’s commit fails, so she can’t save her changes to the repository until she can successfully perform a merge. If something goes wrong with the merge operation, she risks losing her changes entirely. This might be a minor problem for small textual changes like these, but a big problem if she’s trying to merge in substantial and complex changes to application logic. This is why the source control mantra is: commit small changes often.

The article is more of an intro to source control, but if you aren’t familiar with how source control works, it’s a great read.  Regardless, the best thing you can do for yourself is to get your database code in source control.  That opens up the possibility for safer refactoring of code.

CRISP-DM

Steph Locke explains the CRISP-DM model for data mining projects and applies it to data science projects:

Within a given project, we know that at the beginning of our first ever project we may not have a lot of domain knowledge, or there might be problems with the data or the model might not be valuable enough to put into production. These things happen, and the really nice thing about the CRISP-DM model is it allows for us to do that. It’s not a single linear path from project kick-off to deployment. It helps you remember not to beat yourself up over having to go back a step. It also equips you with something upfront to explain to managers that sometimes you will need to bounce between some phases, and that’s ok.

This is another place in which “iterate, iterate, iterate” ends up being the best answer available.

More On Columnstore Batch Mode

Sunil Agarwal talks about batch mode processing with columnstore indexes:

While these results may not appear as dramatic on my laptop, the picture below shows the performance gains with Window Aggregates on a Server class machine with large DW database. The orange bar represents the query speed up we got with Window Aggregate operator in BatchMode. The highest speed up we saw was 289x!!

Batch mode is generally a huge benefit for data warehousing environments.

Categories

January 2017
MTWTFSS
« Dec Feb »
 1
2345678
9101112131415
16171819202122
23242526272829
3031