Press "Enter" to skip to content

Day: May 5, 2017

More Advice For Data Scientists

Charles Parker provides more Dijkstra-style wisdom for budding data scientists:

Raise your standards as high as you can live with, avoid wasting your time on routine problems, and always try to work as closely as possible at the boundary of your abilities. Do this because it is the only way of discovering how that boundary should be moved forward.

Readers of this blog post are just as likely as anyone to fall victim to the classic maxim, “When all you have is a hammer, everything is a nail.” I remember a job interview where my interrogator appeared disinterested in talking further after I wasn’t able to solve a certain optimization using Lagrange multipliers. The mindset isn’t uncommon: “I have my toolbox.  It’s worked in the past, so everything else must be irrelevant.”

There’s some good advice in here.

Comments closed

Load Testing Kafka

Satish Bhor shows off Pepper-Box, a load generator which can stress test Apache Kafka:

Pepper-Box is a Kafka load generator application that can be used as a plugin for JMeter or standalone utility. It allows sending plain text Kafka messages (JSON, XML, CSV, or any other custom format), as well as Java serialized objects. Pepper-Box includes a template engine and random data generation function which helps to design message in any format. If we use it with JMeter then we can use all JMeter features. Pepper-Box is very useful in streaming analytics and data pipelines implementation, where input data format is tightly coupled with business problems.

Pepper-Box includes four main components.

I’m going to keep an eye on this tool.

Comments closed

Power BI Premium

James Serra explains the Power BI Premium tier:

For costs, it allows an unlimited number of users since it is priced by aggregate capacity (see Power BI Premium calculator).  Users who need to create content in Power BI will still require a $10/month Power BI Pro seat, but there is no per-seat charge for consumption.

For scale, it runs on dedicated hardware giving capacity exclusively allocated to an organization for increased performance (no noisy neighbors).  Organizations can choose to apply their dedicated capacity broadly, or allocate it to assigned workspaces based on the number of users, workload needs or other factors—and scale up or down as requirements change.

They’re throttling down Power BI Free, making it really just for personal use, but I think the Premium tier will help with pricing for adoption.

Comments closed

Storm In .Net

Ravi Peri explains how to use Apache Storm in .NET code on HDInsight:

Topology submissions can fail due to many reasons:

  • JDK is not installed or is not in the Path
  • Required java dependencies are not included
  • Incompatible java jar dependencies. Example: Storm-eventhub-spouts-9.jar is incompatible with Storm 1.0.1. If you submit a jar with that dependency, topolopgy submission will fail.
  • Duplicate names for topologies

/var/log/hdinsight-scpwebapi/hdinsight-scpwebapi.out file on active headnode will contain the error details.

At one point, I was big on Storm and really wanted a .NET client for Storm to take off.  Nowadays, I’d rather use Spark Streaming or Kafka Streams for the same kind of streaming data work.

Comments closed

Partitioning Nullable Columns

Kenneth Fisher looks at what happens when you use a nullable column as a partition key:

So to start with how does partitioning handle a NULL? If you look in the BOL for the CREATE PARTITION FUNCTION you’ll see the following:

Any rows whose partitioning column has null values are placed in the left-most partition unless NULL is specified as a boundary value and RIGHT is indicated. In this case, the left-most partition is an empty partition, and NULL values are placed in the following partition.

So basically NULLs are going to end up in the left most partition(#1) unless you specifically make a partition for NULL and are using a RIGHT partition. So let’s start with a quick example of where NULL values are going to end up in a partitioned table (a simple version).

Click through to see Kenneth’s proof and the repercussions of making that partitioning column nullable.

Comments closed

Power BI: Calculated Measures + SSAS Tabular

Shabnam Watson notes that the May updates to Power BI Desktop allow you to create new calculated measures on a report which connects live to a tabular model:

Ideally the SSAS database has all the measures you need but now you have the capability to add new ones if you need to.

You can control the folder (table/measure group) under which the new measure shows up by using the “Home Table” option from the Modeling tab. I really like this feature as you can create copies of the same calculation and send them to different folders for ease of use.

If you’re interested in getting this added to Multidimensional as well, there is a request you can vote on.

Comments closed