Day: August 3, 2016

I use stacks for writing expression analysers. Generally I like at least two stacks, probably up to four. They tend to be different sizes and may have other differences. If written as objects, the code becomes much cleaner and easier to read. Why do I write expression analysers? You might imagine that you would never need such a thing, but once you have one, reasons for using it just keep appearing. They are handy for parsing document-based hierarchical data structures, for parsing grammars and for creating Domain-Specific Languages (DSLs). A DSL is handy when you’re writing a complex application because it cuts down dependencies, and allows you to develop scripts for complex workflows without recompiling .

What I describe here is a cut-down version of what I use, just to illustrate the way of creating and using stacks. I extend this basic algorithm, originally from Dijstra’s shunting algorithm, into a complete language interpreter. All it needs is a third stack to implement block iterations and ‘while’ loops. Why not use PowerShell itself as your DSL? I’ve tried that, but my experience is that it is pretty-well impossible to eliminate script-injection attacks, and I also find that there isn’t quite enough control.

For a more prosaic usage of the stack in Powershell (as well as bash), you can push your current location, move to a new directory, perform some action in that new directory, and pop your old location off the stack to go back to where you were before. This is particularly useful for those Powershell modules and cmdlets which leave you in a different directory from where you started.

Comments closed

Ingesting E-Mail Into Hadoop

Published 2016-08-03 by Kevin Feasel

Jordan Volz and Stefan Salandy show how to feed e-mails into Hadoop for almost-immediate analysis:

In particular, compliance-related use cases centered on electronic forms of communication, such as archiving, supervision, and e-discovery, are extremely important in financial services and related industries where being “out of compliance” can result in hefty fines. For example, financial institutions are under regulatory pressure to archive all forms of e-communication (email, IM, social media, proprietary communication tools, and so on) for a set period of time. Once data has grown past its retention period, it can then be permanently removed; in the meantime, such data is subject to e-discovery requests and legal holds. Even outside of compliance use cases, most large organizations that are subject to litigation have some form of archive in place for purposes of e-discovery.

Traditional solutions in this area comprise various moving parts and can be quite costly and complex to implement, maintain, and upgrade. By using the Hadoop stack to take advantage of cost-efficient distributed computing, companies can expect significant cost savings and performance benefits.

In this post, as a simple example of this use case, I’ll describe how to set up an open source, real-time ingestion pipeline from the leading source of electronic communication, Microsoft Exchange.

Most of this post is about setting up the interconnections between Exchange and Apache James, and feeding data in. It looks like this will be part 1 of a multi-part series.

Comments closed

Understanding ROC Curves

Published 2016-08-03 by Kevin Feasel

Bob Horton explains ROC curves and shows how to create them in R:

ROC curves are commonly used to characterize the sensitivity/specificity tradeoffs for a binary classifier. Most machine learning classifiers produce real-valued scores that correspond with the strength of the prediction that a given case is positive. Turning these real-valued scores into yes or no predictions requires setting a threshold; cases with scores above the threshold are classified as positive, and cases with scores below the threshold are predicted to be negative. Different threshold values give different levels of sensitivity and specificity. A high threshold is more conservative about labelling a case as positive; this makes it less likely to produce false positive results but more likely to miss cases that are in fact positive (lower rate of true positives). A low threshold produces positive labels more liberally, so it is less specific (more false positives) but also more sensitive (more true positives). The ROC curve plots true positive rate against false positive rate, giving a picture of the whole spectrum of such tradeoffs.

ROC curves are one of the primary techniques for figuring out if a binary classifier “works.”

Comments closed

Structured Streaming

Published 2016-08-03 by Kevin Feasel

Matei Zaharia, et al discuss how to use structured streaming in Apache Spark 2.0:

In Structured Streaming, we tackle the issue of semantics head-on by making a strong guarantee about the system: at any time, the output of the application is equivalent to executing a batch job on a prefix of the data. For example, in our monitoring application, the result table in MySQL will always be equivalent to taking a prefix of each phone’s update stream (whatever data made it to the system so far) and running the SQL query we showed above. There will never be “open” events counted faster than “close” events, duplicate updates on failure, etc. Structured Streaming automatically handles consistency and reliability both within the engine and in interactions with external systems (e.g. updating MySQL transactionally).

If you want to learn more about streaming data using Spark, check this out.

Comments closed

Solving The Cube Processing Mystery

Published 2016-08-03 by Kevin Feasel

SQL Sasquatch has a follow-up on his last post and answers the question of why there was a gap:

Last week I posted some of my perfmon graphs from an SSAS server. I want to model the work happening on an SSAS server during multidimensional cubes processing – both dimension and fact processing.

That’s here:
SSAS Multidimensional Cube Processing – What Kind of Work is Happening?
http://sql-sasquatch.blogspot.com/2016/07/ssas-multidimensional-cube-processing.html

There was a window at the end of the observation period where data was still flowing across the network, CPU utilization was still high enough to indicate activity, but the counters for rows read/converted/written/created per second were all zero. The index rows/sec counter was also zero.

Check it out.

Comments closed

Unicode

Published 2016-08-03 by Kevin Feasel

Aaron Bertrand discusses two Unicode schools of thought:

One of the more common dilemmas schema designers face, and I’m realizing it’s something that I should probably spend more time on in my presentations, is whether to use varchar or nvarchar for columns that will store string data. As part of my #EntryLevel challenge, I thought I’d start by writing a bit about this here.

In general, I come across two schools of thought on this:

Use varchar unless you know you need to support Unicode.
Use nvarchar unless you know you don’t.

My preference is to start with Unicode. 15 years ago, you could easily get away with using ASCII for most US-developed systems, but the likelihood that you will need to store data in multiple, varied languages is significantly higher today. And having already refactored one application to support Unicode after it became fairly large, I’d rather not do that again…

Comments closed

Human-Readable Ranges

Published 2016-08-03 by Kevin Feasel

Daniel Hutmacher shows us how to build human-readable ranges of integers and dates:

This is a real-world problem that I came across the other day. In a reporting scenario, I wanted to output a number of values in an easy, human-readable way for a report. But just making a long, comma-separated string of numbers doesn’t really make it very readable. This is particularly true when there are hundreds of values.

So here’s a powerful pattern to solve that task.

I really like this. It takes the gaps & islands problem and goes one step further.

Comments closed

Azure ML Updates

Published 2016-08-03 by Kevin Feasel

David Smith walks us through new language engines supported in Azure ML:

ML studio now gives you even more flexibility, with new language engines supported in the language modules. Within the Execute Python Script module, you can now choose to use Python 2.7.11 or Python 3.5, both of which run within the Acaconda 4.0 distribution. And within the Execute R Script module, you can now choose Microsoft R Open 3.2.2 as your R engine, in addition to the existing CRAN R 3.1.0 engine. Microsoft R Open 3.2.2 not only gives you a newer R language engine, it also gives you access to a wealth of new R packages for use within ML Studio. Over 400 packages are pre-installed for use with the R Script module, and you can install and use any other R package (including CRAN packages and your own R packages) via the Script Bundle input port.

I’m interested in the Microsoft R Open language support, as Azure ML’s still using a relatively older version of R (3.1.0).

Comments closed

vNet Peering Within An Azure Region

Published 2016-08-03 by Kevin Feasel

Denny Cherry reports that there is a public preview of a feature to allow vNet peering without setting up a site-to-site VPN connection:

Up until August 1st if you had 2 vNets in the same Azure region (USWest for example) you needed to create a site to site VPN between them in order for the VMs within each vNet to be able to see each other. I’m happy to report that this is no longer the case (it is still the default configuration). On August 1st, 2016 Microsoft released a new version of the Azure portal which allows you to enable vNet peering between vNets within an account.

Now this feature is in public preview (aka. Beta) so you have to turn it on, which is done through Azure PowerShell. Thankfully it uses the Register-AzureRmProviderFeature cmdlet so you don’t need to have the newest Azure PowerShell installed, just something fairly recent (I have 1.0.7 installed). To enable the feature just request to be included in the beta like so (don’t forget to login with add-AzureRmAccount and then select-AzureRmSubscription).

Read the whole thing for details on how to enroll in this feature and how to set it up.

Comments closed

Thinking About Index Design

Published 2016-08-03 by Kevin Feasel

Jeremiah Peschka looks at a scenario in which a heap might be superior to a clustered index:

In this case, we have to assume that Event IDs may be coming from anywhere and, as such, may not arrive in order. Even though we’re largely appending to the table, we may not be appending in a strict order. Using a clustered index to support the table isn’t the best option in this case – data will be inserted somewhat randomly. We’ll spend maintenance cycles defragmenting this data.

Another downside to this approach is that data is largely queried by Owner ID. These aren’t unique, and one Owner IDcould have many events or only a few events. To support our querying pattern we need to create a multi-column clustering key or create an index to support querying patterns.

This result is not intuitive to me, and I recommend reading the whole thing.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31