Reporting On Unstructured Data

Kevin Feasel



Jen Underwood discusses a tool which helps provide structure to text documents:

Although there are a variety of ways to extract unstructured data from files, one tried-and-true, fast and simple approach is to use Datawatch Monarch. Years ago I used this tool when building Department of Defense digital contract reporting projects. At that time, the process to define data regions and extract unstructured data required a bit of field mapping experimentation. With the latest version of Monarch Auto Define, that process is intelligently automated today.

Looks like an interesting tool.

One-Sample T Tests

Kevin Feasel



Mala Mahadevan shows how to perform one-sample T Tests:

For this post I decided to go with a simple example of how many steps I walked with my per day for the month of August. My goal is 10,000 steps per day – that has been my average over the year but is this true of the data I gathered in August? I have a simple table with two columns – day and steps. Each record has how many steps I took in August per day, for 30 days. So – SELECT AVG(steps) FROM [dbo].[mala-steps] gives me 8262 as my average number of steps per day in August. I want to know if am consistently under performing my goal, or if this is a result of my being less active in August alone. Let me state my problem first – or state what is called ‘null hypothesis’:

I walk 10,000 steps on an average per year. 

Read on for T test operations in T-SQL (although not all operations are available) and R.

Clickstream Anomaly Detection

Chris Marshall shows how to perform anomaly detection using AWS Kinesis Analytics:

The RANDOM_CUT_FOREST function greatly simplifies the programming required for anomaly detection.  However, understanding your data domain is paramount when performing data analytics.  The RANDOM_CUT_FOREST function is a tool for data scientists, not a replacement for them.  Knowing whether your data is logarithmic, circadian rhythmic, linear, etc. will provide the insights necessary to select the right parameters for RANDOM_CUT_FOREST.  For more information about parameters, see the RANDOM_CUT_FOREST Function.

Fortunately, the default values work in a wide variety of cases. In this case, use the default values for all but the subSampleSize parameter.  Typically, you would use a larger sample size to increase the pool of random samples used to calculate the anomaly score; for this post, use 12 samples so as to start evaluating the anomaly scores sooner.

Your SQL query outputs one record every ten seconds from the tumbling window so you’ll have enough evaluation values after two minutes to start calculating the anomaly score.  You are also using a cutoff value where records are only output to “DESTINATION_SQL_STREAM” if the anomaly score from the function is greater than 2 using the WHERE clause. To help visualize the cutoff point, here are the data points from a few runs through the pipeline using the sample Python script:

This kind of scenario is pretty cool—you could also do things like detecting service outages in streams (fewer than X events in a window, where X is some very small number relative to your overall data) or changes in advertising campaigns.

Deadlocks In Apache Ignite

Prachi Garg discusses Deadlock-Free Transactions in Apache Ignite:

When transactions in Ignite are performed with concurrency mode -OPTIMISTIC and isolation level -SERIALIZABLE, locks are acquired during transaction commit with an additional check allowing Ignite to avoid deadlocks. This also prevents cache entries from being locked for extended periods and avoids “freezing” of the whole cluster, thus providing high throughput. Furthermore, during commit, if Ignite detects a read/write conflict or a lock conflict between multiple transactions, only one transaction is allowed to commit. All other conflicting transactions are rolled back and an exception is thrown, as explained in the section below.

This sounds pretty similar to how SQL Server’s In-Memory OLTP works.

Asynchronous Mirroring And Asynchronous AGs

Tara Kizer contrasts asynchronous database mirroring with asynchronous Availability Groups:


For asynchronous Database Mirroring, all we need is two servers: the principal at the primary site and the secondary at the DR site. Setup async mirroring between the two, and you’re done. If the secondary server goes down, production still continues. The transaction log isn’t clearing when the log backups occur because the principal still needs to send those log records to the secondary. As long as you have enough disk space where the transaction log resides to support this until the secondary server comes back online, production still continues. Of course if you run out of disk space, users will start receiving errors. But that can take some time and often times is enough time to get the secondary back online.

Availability Groups are more difficult to implement than database mirroring, but the flipside is that they’re much more powerful.

Apache Hadoop 3 Alpha

Kevin Feasel



Andrew Wang looks at the new alpha for Hadoop:

Support for Multiple Standby NameNodes

HDFS NameNode high availability with QuorumJournalManager uses a Paxos quorum to store the NameNode edit log. With a three-node quorum, this change means we can tolerate the loss of any one node and still continue operation.

However, business-critical deployments may wish to run with higher levels of fault-tolerance, e.g. a five-node quorum to be able to tolerate the loss of any two nodes.

QuorumJournalManager already supports an arbitrary number of nodes, but fault tolerance was limited since HDFS was only able to run a single active and single standby NameNode. Hadoop 3 eliminates this restriction by supporting running multiple standby NameNodes. This improves the fault tolerance of HDFS.

This one is huge to me.  It was a sad day when I learned that the “secondary” NameNode was no such thing.

Replication Raiserror Procedure Change

Matt Slocum documents a change to a replication stored procedure in SQL Server 2014 SP2:

I ran into this replication error the other day and I was completely stumped.
Procedure or function sp_MSreplraiserror has too many arguments specified.
We started getting that error message shortly after we had applied SP2 for SQL Server 2014 to a server that is a replication Publisher (source of replicated data).

We dug into the commands that were being replicated and found that there were missing rows in the table on the replication Subscriber (destination for replicated data).  Once the rows were populated the errors stopped.  However, after digging in a bit more, we found that this error has an explainable source.

Read on for Matt’s solution to the issue.

Renaming Schemas

Shane O’Neill shows how to rename a schema (and how to get yourself in trouble trying):

I repeat the above: eh…what???

Where did my table go???

Please tell me I didn’t delete the table? It’s a test system and I took a backup before starting but it’s a whole lot of hassle to recreate the table.

Click through for the full story.

Analysis Services Locking And Blocking

Bill Anton shows how to figure out if your Analysis Services performance problem is due to locks:

This past weekend, after one of my presentations on Analysis Services at SQL Saturday 520, an attendee described a performance issue in his environment that he thought could be related to locking and blocking and asked if there was a way to confirm if this was the cause. The short answer is yes, absolutely!

While I wouldn’t say this is a common issue across *all* analysis services environments, it’s definitely common amongst the subset of analysis services environments that do intra-day processing – which is when the SSAS database is being processed during hours of the day when users might be simultaneously querying it.

This is a very interesting article.  I’d never thought about SSAS taking locks, but it makes perfect sense.

Formula Bar And Measures Grid Font Sizes

Shabnam Watson shows how to change the font size for a couple tiny parts of SSAS:

The small font size may be difficult to read for some people. The easiest and fastest way to change the font size is to click somewhere in the formula bar and use your mouse middle wheel while holding down the Control button on your keyboard. However, the font size will go back to the default size next time you open SSDT.

If you wish to change the font size permanently, you can use the Tools menu. Click on Tools –> Options –> Fonts and Colors –> Environment and choose “Business Intelligence Designers” from the drop down menu. Here you can change the font size. Your changes will not take effect until you restart SSDT.

Unfortunately, the surrounding chrome doesn’t change size, so there are hard limits of how far you can scale up.


September 2016
« Aug Oct »