Choosing A Hadoop Data Format

Silvia Oliveros has a set of considerations to help you choose a file format for your data in Hadoop:

What does your pipeline look like, and what steps are involved?

Some of the file formats were optimized to work in certain situations. For example, Sequence files were designed to easily share data between Map Reduce (MR) jobs, so if your pipeline involves MR jobs then Sequence files make an excellent option. In the same vein, columnar data formats such as Parquet and ORC were designed to optimize query times; if the final stage of your pipeline needs to be optimized, using a columnar file format will increase speed while querying data.

At first, I’d suggest just using delimited files, as it’s easiest that way.  Once you have developed a bit of Hadoop maturity, then it makes sense to think about whether rowstore formats (like Parquet and Avro) or columnstore formats (like ORC) make sense for a particular data set.

OCR With Tesseract

Amuda Adelou shows how to use Tesseract’s Java API to perform character recognition in images:

Extracting text from an image means that you are considering the flowchart imagery that’s processed to extract the text components and then extracting the geometrical shapes components. The text components are extracted with geometrical components, as well. The internal relationship between the components is set up by tracing the flow lines that connect different components. The extracted components are output to metadata (in XML format), which is machine-readable. This metadata can be archived, stored in a knowledge base, or shared with others.

Click through for a demo app and code.

Spark Deep Learning On AWS

Joseph Spisak, et al, show how to configure and use BigDL in Amazon Web Services’s ElasticMapReduce:

Classify text using BigDL

In this tutorial, we demonstrate how to solve a text classification problem based on the example found here. This example uses a convolutional neural network to classify posts in the 20 Newsgroup dataset into 20 categories.

We’ve provided a companion Jupyter notebook example on GitHub that you can open in the Jupyter dashboard to execute the code sections.

There’s a lot to this tutorial.

Troubleshooting AG Creation Failure

Anthony Nocentino digs into logs to troubleshoot a failure when trying to create an Availability Group:

Now we have some data to look through!

When we look at the contents of the cluster logs, we’re totally on the other side of the spectrum when it comes to information verbosity. The logs so far have been pretty terse and haven’t really told us about what’s causing the failure…well dig through this log and you’ll likely find your reason and a lot more information. Good stuff to look at to get an understanding of the internals of WSFCs. Now for the the reason my Availability Group creation failed was permissions. Check out the log entries.

It’s a good troubleshooting guide.

Building A Recommendation System With Graph Data

Arvind Shyamsundar and Shreya Verma show how to implement a recommender system using SQL Server 2017’s new graph database functionality:

As you can see from the animation, the algorithm is quite simple:

  • First, we identify the user and ‘current’ song to start with (red line)
  • Next, we identify the other users who have also listened to this song (green line)
  • Then we find the other songs which those other users have also listened to (blue, dotted line)
  • Finally, we direct the current user to the top songs from those other songs, prioritized by the number of times they were listened to (this is represented by the thick violet line.)

The algorithm above is quite simple, but as you will see it is quite effective in meeting our requirement. Now, let’s see how to actually implement this in SQL Server 2017.

Click through for animated images as well as an actual execution plan and recommendations for graph query optimization (spoilers:  columnstore all the things).  They also link to the GitHub project where you can try it out yourself.

Finding Query Plan Regressions

Jovan Popovic shows how to find query plan regressions in SQL Server 2017:

In CTP2.0 version is added new system view sys.dm_db_tuning_recommendations that returns recommendations that you can apply to fix potential problems in your database. This view contains all identified potential performance issues in SQL queries that are caused by the SQL plan changes, and the correction scripts that you can apply. Every row in this view contains one recommendation that you can apply to fix the issue. Some of the information that are shown in this view are:

  • Id of the query, plan that caused regression, and the plan that that might be used instead of this plan.

  • Reason that describes what kind of regression is detected (e.g. CPU time for the query is changed from 17ms to 189ms)

  • T-SQL script that can be used to force the plan.

  • Information about the current plan, and previous plan that had better performance.

In the “surgical scalpel to chainsaw” range of query tuning options, this rates approximately guillotine.  I think it’ll be a very useful tool for finding issues, but it wouldn’t be wise to start lopping off all the heads just because the optimizer tells you to.  In this context, I imagine this DMV to be about as useful as the missing indexes DMV and for the same reasons.

Team Maturity Levels

Ed Elliott has the best lists:

Maturity Levels

OK so this is pretty simple, we have these levels:

  • Low
  • Medium
  • High

Wow. Just WOW

That is an amazing list, how did you come up with it? Did it come from some phd study on the effectiveness of lists in the internet age? No.

So a little more detail…

Read on for Ed’s take on database development maturity levels.  I might quibble with some of the specifics, but I agree with the principle.

Back Up Those System Databases

Arun Sirpal had gremlins infest his server, causing a service pack installation to go sideways:

I try to start the service but it fails. (I was desperate ok!)

Script level upgrade for database ‘master’ failed because upgrade step ‘msdb110_upgrade.sql’ encountered error 5846, state 1, severity 16. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the ‘master’ database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion.

Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.

To be honest, rebuilding master would be my last option.

Read the whole thing, and then double-check that you have good copies of master & msdb somewhere.

Instantiating Cortana Intelligence Gallery Solutions

Melissa Coates has a step-by-step guide showing how to install a solution from the Cortana Intelligence Gallery:

We had no options along the way for selecting names for resources, so we have a lot of auto-generated suffixes for our resource names. This is ok for purely learning scenarios, but not my preference if we’re starting a true project with a pre-configured solution. Following an existing naming convention is impossible with solutions (at this point anyway). A wish list item I have is for the solution deployment UI to display the proposed names for each resource and let us alter if desired before the provisioning begins.

The deployment also doesn’t prompt for which subscription to deploy to (if you have multiple subscriptions like I do). The deployment did go to the subscription I wanted, however, it would be really nice to have that as a selection to make sure it’s not just luck.

It sounds like there are some undesirable defaults, but at least it does appear to be very easy to do.


April 2017
« Mar May »