Press "Enter" to skip to content

Month: June 2016

Incorporating NiFi Into Brownfield Code

Paul Boal discusses how he incorporated Apache NiFi in an existing process:

Typically, data warehousing and ETL tool vendors recommended that we write your own custom components. After all, the target market for ETL tools is a space where the tools are specifically marketed as reducing the need for “error prone and time consuming” manual coding. When I ran across this tutorial on writing your own NiFi processor it occurred to me that NiFi is the exact opposite. It’s both Open Source and designed for extensibility from the ground up. I found it quite reasonable to write a custom NiFi processor that leverages our existing code base.

The existing code is a Java program with separate classes for each device vendor, all with the same interface to abstract the nuances of each vendor from the main data export program. This interface follows a traditional paradigm: login, query, query, query, logout. Given that my input to NiFi above takes in simple username, password, and query criteria arguments, it seems trivial to create a NiFi processor class that adapts the existing code into the NiFi API. Here’s a slightly abbreviated version of the actual code. (In reality, it’s all of 70 lines of code.)

In almost any realistic scenario, you’re not going to have the opportunity to start from scratch.  You will always have legacy components, external dependencies, and existing user bases to satisfy.  I like this article because it moves forward from that starting point.

Comments closed

Starting Extended Events Is Just As Fast

Erin Stellato shows she can create an Extended Event as fast as a Profiler trace:

I haven’t gotten a ton of comments, but I did get a few (thank you to those have responded!), and I decided to take one of them and create a Trace and create an Extended Events session and see how long it took for each.  Jonathan has mentioned before that he can create an XE session as fast as a Trace, and I’ve been thinking that I can as well so I thought I’d test it.  It’s a straight-forward Trace versus Extended Events test.  Want to see what’s faster?  Watch the video here.

I love the “I would pop up the timer on the screen but I don’t know how to do that” bit; very Friday afternoonish.

Comments closed

Qlik Sold For $3 Billion

Alex Woodie reports that Qlik Technologies has been acquired by a private equity firm:

After loading data into a server-based associative, in-memory database, Qlik customers could explore the data in a variety of ways from an AJAX Web GUI, enabling them to create and publish all sorts of reports and dashboards. The approach is not entirely dissimilar to the one taken by its rival, Tableau Software, which has also benefited from the big data boom and the democratization of BI.

The combination of market forces and a keen eye for product development were propellant for growth at Qlik. In 2009, the Radnor, Pennsylvania-based company had 11,400 customers and $157 million in revenues. By 2010, it had grown to 13,000 customers and had an IPO. By 2015, the company boasted 37,000 customers, $612 million in revenue, and a market cap north of $2.8 billion.

Qlik is definitely one of the big players in the visualization market, which includes Tableau, and Power BI/SSRS in Gartner’s Leaders quadrant and a bunch of competitors nipping at their heels.

Comments closed

Getting Started With Security Analytics

Michael Schiebel has an introduction to the thought process behind security analytics:

Now, we’re getting somewhere.  Looking at this graph we see we have four high-level problems we are trying to solve.

  1. (Unknown/Unknown) The first step in realizing that we have a problem is accepting that we may not have the answer.  We may not have the right mental or computational models; or even the right data to find bad things.

  2. (Known/Unknown) We’ve invested time and energy brainstorming what could happen, sought out and collected the data we believe will help, and created mental and conceptual models that SHOULD detect/visualize these bad things.  Now, we need to hunt and seek to see if we’re right.

  3. (Unknown/Known) We’ve been hunting and seeking for some time tuning and training our analytical models until they can automatically detect this new bad thing. Now we need to spend some time formalizing our response process to this new use case.

  4. (Known/Known) Great, we’ve matured this use case to a point that we can trust our ability to detect; maybe even to the point of efficient rules/signatures.  We have mature response playbooks written for our SOC analysts to follow.  Now we can feel comfortable enough to design and implement an automated response for this use case.

I think his breakdown is correct, and also would reiterate that within any organization, all four zones come into play, meaning you have different teams of people working concurrently; you’ll never automate away all the problems.

Comments closed

Remove Chart Clutter

Melissa Yu provides advice on improving your data visualization skills:

Common chart clutter items include:

  • 3-dimensional effects

  • Dark gridlines (use soft gray gridlines or eliminate gridlines when possible)

  • Overuse of bright, bold colors

  • Unnecessary use of all uppercase text (uppercase text is only necessary when calling attention to an element)

Basically, remove every visualization “feature” that Excel 97 gave you…

Comments closed

Power BI Tables Without Data Sources

Chris Webb shows how to create a table in Power BI’s M language without a backing data source:

No data source is needed – this is a way of defining a table value in pure M code. The first parameter of the function takes a list of column names as text values; the second parameter is a list of lists, where each list in the list contains the values on each row in the table.

In the last example the columns in the table were of the data type Any (the ABC123 icon in each column header tells you this), which means that they can contain values of any data type including numbers, text, dates or even other tables. Here’s an example of this

This is a helpful trick.

Comments closed

San Francisco Crime Analysis

Vimal Natarajan shows off some R charts using crime incident data:

By analyzing the plot above, we can arrive at the following insights:

  • The number of crimes steadily decline from midnight and are at the lowest during the early morning hours and then they start increasing and peak around 6 PM in the evening. This is the same insight we arrived in my previous analysis but here we have categorized by the Police district and still see the same pattern.

  • As seen in the previous plot, Park and Richmond districts have the lowest number of crimes throughout the day.

  • As highlighted in red in the plot above, the maximum number of crimes happens in Southern district around 6 PM in the evening.

I would prefer to see code here, but it does serve to give you an idea of what R can do.

Comments closed

Downloading SQL Express 2016

Dave Mason tries out SQL Server Express 2016:

I’m not a fan of the filename “SQLEXPRADV_x64_ENU.exe”. It’s not very descriptive IMO. But if you hover your mouse over the file, there’s a helpful file description tool tip. I’ll probably rename the file anyway.

The download process has changed significantly and I have to admit I’m surprised that I like it so much. I can be set in my ways and averse to change. But once I launched that initial “SQLServer2016-SSEI-Expr.exe” download, everything made sense.

Think back to SQL Server 2012 Express. Remember the “Choose the download you want” dialog? Those file names aren’t very intuitive. I had to Google them every time to make sure I picked the right one. It was slightly better for SQL Server 2014 Express. But still. Yuck!

Sounds like they’ve improved the download experience for Express edition.

Comments closed

Lipwig

Peter Coates shows how to make Hive EXPLAIN plans a lot prettier:

As you probably know, if you prepend the word EXPLAIN to your SQL query and then run it, Hive prints out a text description of the query plan. This lets you explore the effects such variations as code changes, the use of analyze, turning on/off the cost-based optimizer (CBO), and so on. It’s an essential tool for optimizing Hive.

The output of EXPLAIN is far from pretty, but fortunately, a simple pipeline of Linux commands can give you a slick graphical rendition like the one below.

I’m going to have to keep this in mind.

Comments closed

LEN Is For Strings

Kenneth Fisher notes that the LEN function can behave oddly on non-string data types:

Which show you that the FLOAT had to be converted to VARCHAR. You can see the same thing if you try it with various versions of INT or DATE datatypes as well. Like I said earlier. No big deal with INT or even DATE. Those come back in a fairly expected format. (INTs look exactly the same and DATEs come back as ‘YYYY-MM-DD’). FLOAT and REAL however are floating point so they don’t always convert the same way. If you do the conversion deliberately you get this:

Understand your data types; otherwise, it might come back to hurt you later.

Comments closed