Press "Enter" to skip to content

Month: May 2016

Integrating Custom Data Sources Into Spark

Nicolas A Perez builds a custom Spark streaming data source:

We first receive the order ID and the total amount of the order, and then we receive the line items of the order. The first value is the item ID, the second is the order ID, (which matches the order ID value) and then the cost of the item. In this example, we have two orders. The first one has four items and the second one has only one item.

The idea is to hide all of this from our Spark application, so what it receives on the DStream is a complete order defined on a stream as follows:

Check out this practical application of Spark Streaming.

Comments closed

In-Memory OLTP Using Ignite

Babu Elumalai explains how to use Apache Ignite to build an in-memory OLTP system on top of Amazon’s DynamoDB:

Business users have been content to perform analytics on data collected in Amazon Redshift to spot trends. But recently, they have been asking AWS whether the latency can be reduced for real-time analysis. At the same time, they want to continue using the analytical tools they’re familiar with.

In this situation, we need a system that lets you capture the data stream in real time and use SQL to analyze it in real time.

In the earlier section, you learned how to build the pipeline to Amazon Redshift with Firehose and Lambda functions. The following illustration shows how to use Apache Spark Streaming on EMR to compute time window statistics from DynamoDB Streams. The computed data can be persisted to Amazon S3 and accessed with SparkSQL using Apache Zeppelin.

There are a lot of technologies at play here and it’s worth a perusal, even though I’m going to keep recommending that you use a relational database like SQL Server for OLTP work in all but the most extreme of circumstances.

Comments closed

Deep Learning

Pete Warden argues that deep learning is not just a fad:

This kind of attribution of an adjective to a subject is something an accurate parser can do automatically. Rather than laboriously going through just a hundred examples, it’s easy to set up the Parser McParseface and run through millions of sentences. The parser isn’t perfect, but at 94% accuracy on one metric, it’s pretty close to humans who get 96%.

Even better, having the computer do the heavy lifting means that it’s possible to explore many other relationships in the data, to uncover all sorts of unknown statistical relationships in the language we use. There’s bound to be other words that are skewed in similar or opposite ways to ‘bossy’, and I’d love to know what they are!

Looks like one more time sink for me…  Check this out if you’re at all interested in parsers.

Comments closed

Grant View Definition

Matt Smith ran into an issue when trying to compare two databases using SQL Server Data Tools:

This appeared to work but didn’t display any results. When you look at the bottom status bar it reads

Comparison complete. No differences detected. Restricted comparison. See Error List for details.

I knew there we at least some differences. I then click on the Error List table below it which revealed

The reverse engineering operation cannot continue because you do not have View Definition permission on the ‘Warehouse’

The answer is pretty simple, so read on.

Comments closed

Deploying To Azure SQL Database

Julie Smith shows how to deploy a database (AdventureWorksDW) out to Azure:

This is telling us that four of the tables in the sample do not have clustered indexes. Azure SQL Database insists on a clustered index for every table. So without warranty, here is a script that I used to refactor my on-prem AdventureWorksDW2014 database. After making these fixes, I was able to deploy to Azure SQL DB from SSMS with no errors.

Julie includes the script she used to fix up AdventureWorksDW.

Comments closed

Building A Prediction Engine

Richard Williamson explains how to build a prediction engine using technologies such as Spark, Kudu, Impala, and Kafka:

We’ll aim to predict the volume of events for the next 10 minutes using a streaming regression model, and compare those results to a traditional batch prediction method. This prediction could then be used to dynamically scale compute resources, or for other business optimization. I will start out by describing how you would do the prediction through traditional batch processing methods using both Apache Impala (incubating) and Apache Spark, and then finish by showing how to more dynamically predict usage by using Spark Streaming.

Of course, the starting point for any prediction is a freshly updated data feed for the historic volume for which I want to forecast future volume. In this case, I discovered that Meetup.com has a very nice data feed that can be used for demonstration purposes. You can read more about the API here, but all you need to know at this point is that it provides a steady stream of RSVP volume that we can use to predict future RSVP volume.

This is pretty dense, but it is a great look at one potential architecture leveraging Spark and several tools in the Hadoop ecosystem.

Comments closed

Dashboard Design

Melissa Yu explains how people look at dashboards:

Dashboards can be used to communicate a dense collection of information efficiently on a single canvas. Your audience has a limited amount of time to monitor key metrics to get a quick status and identify anything that needs attention. The attention span of the average human has gone from about 12 seconds in 2000 (when mobile phones became mainstream) to about 8 seconds today – a second less than a goldfish – according to a 2015 study.

Following data visualization design principles is key to making your dashboard easily consumable. A poorly designed dashboard can make your eyes jump all over the screen. While it won’t give you much insight, it may cause a headache. In the Western world, we read from top left to right, then zig-zag down left and scroll right again (in a Z-pattern). Understanding where the audience’s eyes will start and travel next allows you to guide them through your dashboard.

Check the link for more details.

Comments closed

Working With Windows Server Core

Sander Stad has a couple of blog posts on working with Windows Server Core edition.  First, what happens if you lose your command prompt?

In my enthusiasm I clicked the “X” on the top right corner. So this happened:

Mayhem! How do I get my command screen back? Reboot?! NO WAY!

After figuring that out, Sander also explains how to perform updates:

The Windows Server Core Edition is a really good option because there aren’t that many binaries as there would be in a full installation. Due to the smaller number of objects you more stability, simplified management, reduced maintenance and a reduced risk to get attacked.

So you have a Windows Server Core Edition installed but want to update the server manually. Maybe this is a virtual machine on your local PC that needs updating and you don’t have WSUS running.

About everything you previously managed using the GUI in Windows is now done with the sconfig. You can edit your server’s name, the domain, network settings, date/time, shutdown the server but also manage the updates.

Admittedly, most Core installations will probably be in environments with a lot of automation around them, but sometimes you’re just doing a one-off thing.

Comments closed

Calculating Weekdays Between Dates

Daniel Hutmacher shows how to calculate the number of weekdays between two dates:

It has to be said, if performance is important (and it should be), you’ll know that user-defined scalar T-SQL functions can kill your query. So you might want to implement the above in the form of a CROSS APPLY instead

Here’s where I advocate for two tables on every instance:  a tally table and a date table.  The date table should look like a date dimension, including every potentially-interesting piece of information about a date—including if it’s a weekday.  It might not perform quite as fast as Daniel’s solution (I’d have to test to know for sure), but it’s definitely easier.   If you can’t get a viable date table on your instances for whatever reason, Daniel’s solution does work and does not require any additional objects.

Comments closed

Removing Snippet Line Numbers

Jens Vestergaard grabs some code and is left with line numbers he wants to remove:

Only too often I find myself searching the web for pieces of code, that I need for some odd solution. And almost every time, I come across a web page, where the code sample is displayed with line numbers. It’s always annoying to have to sit and delete those manually. No longer I say! Yes, you heard me right. Actually Visual Studio has a neat trick to help you get this done in a jiff.

This is useful for a lot more than just removing line numbers.

3 Comments