Press "Enter" to skip to content

Month: October 2016

Sentiment Analysis With Nifi

Timothy Spann ties together a bunch of interesting things with Apache Nifi, including integrations with Twitter, Slack, Tensorflow, and Zeppelin:

First up, I used GetTwitter to read tweets and filtered on these terms:

strata, stratahadoop, strataconf, NIFI, FutureOfData, ApacheNiFi, Hortonworks, Hadoop, ApacheHive, HBase, ApacheSpark, ApacheTez, MachineLearning, ApachePhoenix, ApacheCalcite,ApacheStorm, ApacheAtlas, ApacheKnox, Apache Ranger, HDFS, Apache Pig, Accumulo, Apache Flume, Sqoop, Apache Falcon

Input:

InvokeHttp: I used this to download the first image URL from tweets.

It’s interesting to see this all tie together relatively easily.

Comments closed

Documenting Tabular Models

Meagan Longoria shows how to document a tabular model using Excel and Power BI:

The Database Info tab uses cube functions to display the values shown. The rest of the tabs use pivot tables and slicers.

You can download and use this model if you have Excel 2016 or 2013 with Power Query. Just download the file, change the values in the TabularInstanceName and TabularDBName queries, and refresh. I was able to open the file and update the queries with no issues and no changes in Excel 2013.

This is great if you just need documentation or a quick way to look up specific information. But we could really analyze this using visualizations other than tables in Power BI Desktop.

That’s pretty fancy.  Check it out.

Comments closed

Running U-SQL Jobs

Ginger Grant shows how to run a U-SQL job locally from Visual Studio:

There are a few steps required before any code is run. If the Data Lake Analytics Tools are not installed within Visual Studio, download themhere and install them. When the tools are installed, the menu item Data Lake appears in Visual studio. The second step is to model your PC with the same file structure as your data lake. The default location which the Data Lake tools will look for your data structure is C:\Users\<<insertyourname>>\AppData\Local\USQLDataRoot . What this means is if you have folders and subfolders created in your data lake, your PC needs to have the same structure, including the data.

There is also a way to test these jobs locally before you spend that Azure money spinning up Data Lake jobs.

Comments closed

Rounding

Kenneth Fisher looks at decimal truncation methods:

ROUND

This is the most complicated of the three. It does a standard rounding. If value is .5 or over then you get back 1. If it’s less than .5 you get back 0. On top of that you get to pass the place you want to round to. So for example 0 rounds to the nearest ones place, -1 rounds to the tens place, 2 rounds to the hundredths.

There’s a bonus here:  if you use ROUND in T-SQL, the results are different than the default Round method in .NET; that method uses banker’s rounding by default whereas the T-SQL rounding does not.

Comments closed

MOSL

Niko Neugebauer has a new script library, this time for memory-optimized objects:

Today, I am announcing the initial availability of the brand new library for diving into the internals of SQL Server the MOSL – Memory Optimised Scripts Library, the library that will help you to discover, maintain & troubleshoot the Hekaton in SQL Server 2014, 2016 & Azure SQLDatabase. At the moment, I am releasing the scripts for SQL Server 2014 but at the PASS Summit I am hoping to have all 3 versions online.

I am inviting everyone working with Memory-Optimised Technology to join me in this project, which is targeted on helping everyone who is working or aspiring to work with this amazing and ground-breaking technology.

Check them out if you use memory-optimized tables in your environment.

Comments closed

Graph Analytics With Spark

Mirko Kämpf looks at using GraphFrames on Spark:

Next, we’ll define a DataFrame by loading data from a CSV file, which is stored in HDFS.

Our datafile facebook_combined.txt contains two columns to represent links between network nodes. The first column is called source (src), and the second is the destination (dst) of the link. (Some other systems, such as Gephi, use “source” and “target” instead.)

First we define a custom schema, and than we load the DataFrame, using SQLContext.

It sounds like Spark graph database engines are early in their lifecycle, but they might already be useful for simple analysis.

Comments closed

Hadoop’s S3 Support

Steve Loughran and Sanjay Radia give us a history lesson on Hadoop’s support for Amazon S3:

Hadoop’s ability to work with Amazon S3 storage goes back to 2006 and the issue HADOOP-574, “FileSystem implementation for Amazon S3”. This filesystem client, “s3://” implemented an inode-style filesystem atop S3: it could support bigger files than S3 could then support, some its operations (directory rename and delete) were fast. The s3 filesystem allowed Hadoop to be run in Amazon’s EMR infrastructure, using S3 as the persistent store of work. This piece of open source code predated Amazon’s release of EMR, “Elastic MapReduce” by over two years. It’s also notable as the piece of work which gained Tom White, author of “Hadoop, the Definitive Guide”, committer status.

It’s interesting to see how this project has matured over the past decade.

Comments closed

Polybase With HDP 2.5

I run into some issues with Polybase and Hortonworks Data Platform 2.5:

First, it’s interesting to note that the Polybase engine uses “pdw_user” as its user account.  That’s not a blocker here because I have an open door policy on my Hadoop cluster:  no security lockdown because it’s a sandbox with no important information.  Second, my IP address on the main machine is 192.168.58.1 and the name node for my Hadoop sandbox is at 192.168.58.129.  These logs show that my main machine runs a getfileinfo command against /tmp/ootp/secondbasemen.csv.  Then, the Polybase engine asks permission to open /tmp/ootp/secondbasemen.csv and is granted permission.  Then…nothing.  It waits for 20-30 seconds and tries again.  After four failures, it gives up.  This is why it’s taking about 90 seconds to return an error message:  it tries four times.

Aside from this audit log, there was nothing interesting on the Hadoop side.  The YARN logs had nothing in them, indicating that whatever request happened never made it that far.

Here’s hoping there’s a solution in the future.

Comments closed

ReaderWriterSpinlock

Ewald Cress looks at the new ReaderWriterSpinlock in SQL Server 2016 CU2:

As a quick refresher, a traditional SQLOS spinlock is a 32-bit integer, or of course 64-bit as of 2016, with a value of either zero (lock not acquired) or the 32-bit Windows thread ID of the thread that owns it. All very simple and clean in terms of atomic acquire semantics; the only fun part is the exponential backoff tango that results from a collision.

We have also observed how the 2016 flavour of the SOS_RWLock packs a lot of state into 64 bits, allowing more complicated semantics to be implemented in an atomic compare-and-swap. What seems to be politically incorrect to acknowledge is that these semantics boil down to a simplified version of a storage engine latch, who is the unloved and uncool grandpa nowadays.

Clearly a lot can happen in the middle of 64 bits.

Definitely worth a read, as it seems that this is going to get more play in the years to come.

Comments closed