HDInsight Tool For Eclipse

Xiaoyong Zhu reports that the HDInight tool for Eclipse is now generally available:

The HDInsight Tool for Eclipse extends Eclipse to allow you to create and develop HDInsight Spark applications and easily submit Spark jobs to Microsoft Azure HDInsight Spark clusters using the Eclipse development environment.  It integrates seamlessly with Azure, enabling you to easily navigate HDInsight Spark clusters and to view associated Azure storage accounts. To further boost productivity, the HDInsight tool for Eclipse also offers the capability to view Spark job history and display detailed job logs.

Check out the link for videos and additional resources.

Running Compiled Code In Azure ML

Max Kaznady shows how to use R or Python scripts to call compiled code within Azure ML:

In this post, we focus on sourcing R and Python’s external dependencies, such as R libraries and Python modules, which are not already installed on Azure ML and require code compilation. Commonly the compiled code comes from a variety of other languages such as C, C++ and Fortran. One could also use this approach to wrap their compiled code with R or Python wrappers and run it on Azure ML.

To illustrate the process, we will build two MurmurHash modules from C++ for R and Python using the following two implementations on GitHub, and link them to Azure ML from a zipped folder

Link via David Smith.  I knew it was possible to call compiled C code from Python and R, but didn’t expect to be able to do it within Azure ML, so that’s good to know.

Using Azure Data Catalog

Kevin Feasel

2016-07-04

Cloud

Melissa Coates has some good advice if you start using Azure Data Catalog:

Register only data sources that users interact with. Usually the first priority is to register data sources that the users see-for instance, the reporting database or DW that you want users to go to rather than the original source data. Depending on how you want to use the data catalog, you might also want to register the original source. In that case you probably want to hide it from business users so it’s not confusing. Which leads me to the next tip…

Use security capabilities to hide unnecessary sources. The Standard (paid) version will allow you to have some sources registered but only discoverable by certain users & hidden from other users (i.e., asset level authorization). This is great for sensitive data like HR. It’s also useful for situations when, say, IT wants to document certain data sources that business users don’t access directly.

This is a good set of advice.

Overlapping Ranges Using U-SQL

Michael Rys explains how to merge overlapping ranges of data using U-SQL:

If you look at the problem, you will at first notice that you want to define something like a user-defined aggregation to combine the overlapping time intervals. However, if you look at the input data, you will notice that since the data is not ordered, you will either have to maintain the state for all possible intervals and then merge disjoint intervals as bridging intervals appear, or you need to preorder the intervals for each user name to make the merging of the intervals easier.

The ordered aggregation is simpler to scale out, but U-SQL does not provide ordered user-defined aggregators (UDAGGs) yet. In addition, UDAGGs normally produce one row per group, while in this case, I may have multiple rows per group if the ranges are disjoint.

Luckily, U-SQL provides a scalable user-defined operator called a reducer which gives us the ability to aggregate a set of rows based on a grouping key set using custom code.

There are some good insights here, so read the whole thing.

Azure Cortana Intelligence Suite Walkthrough

Kevin Feasel

2016-06-27

Cloud

Rolf Tesmer gives us a high-level walkthrough of the Azure Cortana Intelligence Suite, using management of a wind turbine farm as an example problem:

Event Hub

What is it

https://azure.microsoft.com/en-us/services/event-hubs/

Fully Managed Service (PaaS) for ingesting events/messages at a massive scale (think telemetry processing from websites, IoT etc).

What does it do in our wind farm

Provides a “front door” to our wind farm application to accept all of the streaming telemetry being generated from the turbines.  Event Hubs wont process any of this data per se – its just ensuring that its being accepted and queued (short term) while other components cane come in to consume it.

Before you dig deeply into particular services, it’s nice to see how they fit together at a higher level.

Multiple Connection Attempts Required

Kevin Feasel

2016-06-27

Cloud

Ron Dameron has a situation in which he needs to try to connect multiple times to hit his Azure SQL Database instance:

I’ve noticed on several occasions that my first attempt to connect to an Azure Sql Server using SQL Server Management Studio 2016 doesn’t always succeed.

The fix? Press OK and try again.

I’ve not noticed this issue myself, so it does seem weird.

Polybase Setup Errors

Murshed Zaman on the Azure CAT team covers a number of Polybase configuration errors:

SSMS Error:

Any Select query fails with the following error.
Msg 106000, Level 16, State 1, Line 1
Java heap space

Possible Reason:

Illegal input may cause the java out of memory error.  In this particular case the file was not in UTF8 format. DMS tries to read the whole file as one row since it cannot decode the row delimiter and runs into Java heap space error.

Possible Solution:

Convert the file to UTF8 format since PolyBase currently requires UTF8 format for text delimited files.

I imagine that this page will get quite a few hits over the years, as there currently exists limited information on how to solve these issues if you run into them, and some of the error messages (especially the one quoted above) have nothing to do with root causes.

Netflix Billing Architecture

The Netflix tech blog discusses changing their billing infrastructure to be entirely in the cloud (AWS in this case):

Cleaning up Code: We started chipping away existing code into smaller, efficient modules and first moved some critical dependencies to run from the Cloud. We moved our tax solution to the Cloud first.

Next, we retired serving member billing history from giant tables that were part of  many different code paths. We built a new application to capture billing events, migrated only necessary data into our new Cassandra data store and started serving billing history, globally, from the Cloud.

We spent a good amount of time writing a data migration tool that would transform member  billing attributes spread across many tables in Oracle  into a much simpler Cassandra data structure.

We worked with our DVD engineering counterparts to further simplify our integration and got rid of obsolete code.

Purging Data: We took a hard look at every single table to ensure that we were migrating only what we needed and leaving everything else behind. Historical billing data is valuable to legal and customer service teams. Our goal was to migrate only necessary data into the Cloud. So, we worked with impacted teams  to find out what parts of historical data they really needed. We identified alternative data stores that could serve old data for these teams. After that, we started purging data that was obsolete and was not needed for any function.

All in all, a very interesting read on how to migrate large databases.  Even if you’re moving from one version of a product to another, some of these steps might prove very helpful in your environment.

Minimizing Cloud Costs

Kevin Feasel

2016-06-22

Cloud

Kenneth Fisher looks at reducing the bottom line for cloud operations:

This got me thinking about ways to reduce/minimize costs. These are some general ideas since from what I can tell cloud billing is as complex as the tax codes and at that I have limited experience.

  • If you aren’t using your VM, shut it down. You can do this manually, or with apowershell script or even at the push of a button

  • Start small. Only create the machines you need and keep them to a minimum.

  • Starting small will lead to some bottle necks. Feel free to bounce up and down as you need. There are some restrictions (size etc) when you move downwards, so be careful. Again this can be done manually or with powershell. Let’s say you need to do a high volume load. Bump your service tier, then once you are done, bump it back down again.

  • And my personal favorite : Don’t install enterprise when you only need standard.

Doing business on Azure or AWS does require a bit of a shift in mindset.  Cloud costs are entirely variable—you control when services run; how much compute, storage, and bandwidth you want to use; and your SLA.  Choosing different spots on the continuum results in different pricing.  This has also helped the growth of technologies like Hadoop, in which you can separate compute from storage.  If I know that my cluster gets heavy usage during core business hours, light usage overnight, and no usage on the weekend, I can spin up and down nodes as necessary, and can even shut off clusters which don’t need to operate, and because I’m storing the data off of the cluster nodes (and on S3 or in Azure Data Lake Storage), data doesn’t become unavailable just because the primary compute process is unavailable—I could spin up another cluster or write a quick one-off data reader.

Tools For Cortana Intelligence Suite Development

Melissa Coates has a list of tools she uses when working with Cortana Intelligence Suite:

4. Azure SDK

The Azure SDK sets up lots of libraries; the main features we are looking for from the Azure SDK right away are (a) the ability to use the Cloud Explorer within Visual Studio, and (b) the ability to create ARM template projects for automated deployment purposes. In addition to the Server Explorer we get from Visual Studio, the Cloud Explorer from the SDK gives us another way to interact with our resources in Azure.

This is a nice tools checklist to compare against what you’re using.

Categories

March 2019
MTWTFSS
« Feb  
 123
45678910
11121314151617
18192021222324
25262728293031