Press "Enter" to skip to content

Author: Kevin Feasel

Kafka And MapR Streams

Ellen Friedman compares and contrasts Apache Kafka with MapR streams:

What’s the difference in MapR Streams and Kafka Streams?

This one’s easy: Different technologies for different purposes. There’s a difference between messagingtechnologies (Apache Kafka, MapR Streams) versus tools for processing streaming data (such as Apache Flink, Apache Spark Streaming, Apache Apex). Kafka Streams is a soon-to-be-released processing tool for simple transformations of streaming data. The more useful comparison is between its processing capabilities and those of more full-service stream processing technologies such as Spark Streaming or Flink.

Despite the similarity in names, Kafka Streams aims at a different purpose than MapR Streams. The latter was released in January 2016. MapR Streams is a stream messaging system that is integrated into the MapR Converged Platform. Using the Apache Kafka 0.9 API, MapR Streams provides a way to deliver messages from a range of data producer types (for instance IoT sensors, machine logs, clickstream data) to consumers that include but are not limited to real-time or near real-time processing applications.

This also includes an interesting discussion of how the same term, “broker,” can be used in two different products in the same general product space and mean two distinct things.

Comments closed

Disable Performance Counters

Dan Taylor shows how to disable SQL Server performance counters using a registry setting:

About 6 or 7 years ago, I had an issue where my SQL Server performance counters were not available when looking within performance monitor. I went thru the steps of unloading and reloading the counters according to Microsoft documentation and by looking at many of the blog posts out there. The performance counters still would not show.

My next step was to dig into the registry and look at the performance counters entry for SQL Server.

I’d have to imagine that the number of good use cases for disabling performance counters is low, but it’s occasionally necessary for troubleshooting.

Comments closed

Finishing The Event Scheduler

Reza Rad has part 3 of his event date and time scheduler up:

This table has three columns: Date, Time, and Duration. I separated the date and time for simplicity of this example. Date to be formatted as YYYYMMDD, and Time as HHMM, and duration as an integer value illustrating hours.

Configuration above means the event starts at 9th of May 2016, at 1:00 pm New Zealand time (this is what my local time is), with duration of 3 hours. I named this table as InputData.

This wraps up his series on Power Query for non-BI developers.

Comments closed

Configure SAP HANA With Impala

Sreedhar Bolneni has a walkthrough on integrating SAP HANA with Impala:

Assuming an existing Cloudera Enterprise cluster with Impala services and HANA instances are running and that the HANA host has access to Impala daemons, configuring the integration is fairly straightforward

  1. Install the Impala ODBC driver on the HANA host.

  2. Configure the Impala data source.

  3. Create remote source and virtual tables using SAP HANA Studio; then test.

There are a lot of screenshots and configuration files to help guide you through.

Comments closed

SSIS + Google Distance Matrix API

Terry McCann shows how to use the Google distance matrix API to calculate a distance from a starting point to an ending point:

The most basic use is to calculate the distance from A to B. This is what we will look at using SSIS. The API takes in an origin and a destination. These can either be a lat & lng combo or an address which we be translated into Lat & lng when executed. To demonstrate the process I have created a database and loaded it with some sample customer and store data. We have 2 customers and 4 stores. These customers are linked to different stores. You can download the example files at the bottom of the page.

I like that this shows just how easy it is to hit an API with an SSIS component.  If I had one wish with this article, I’d use Biml to generate the package rather than talking through the tasks and components.  Regardless, check this out; it’s a great use of the script component.

Comments closed

“RAID” And Backups

Kenneth Fisher explains that you can set up backup strategies similar to different RAID types:

RAID 0

Splitting the backup data between multiple files.

This is actually a fairly common way to speed up a large backup. IO is one of the slowest part of the whole process so by splitting the backup into multiple files we can reduce our backup time by quite a bit. Having multiple files will also increase your restore time but you do have to make sure that all of the files are available. If you lose one file then the whole backup is useless.

This is done by listing multiple locations for the backup.

This is an interesting use of RAID as metaphor.

Comments closed

Data Warehouse Design Tips

Dustin Ryan has part one of a two-part series on data warehouse design best practices:

2. Store additive measures in the data warehouse.

The best type of measures to store in the data warehouse are those measures that can be fully aggregated. A measure that can be fully aggregated is a measure that can be summarized by any dimension or all dimensions and still remain meaningful. For instance, a Sales Amount measure can be summarized by Product, Date, Geography, etc. and still provide valuable insight for the customer.

Measures that cannot be fully aggregated, such as ratios or other percentage type calculations should be handled in the semantic model or the reporting tool. For example, a measure such as Percentage Profit Margin stored in a table cannot be properly aggregated. A better option would be to store the additive measures that are the base for the Percentage Profit Margin, such as Revenue, Cost, Margin, etc. These base measures can be used to calculate the ratio in a query, semantic model, or reporting tool.

The first five tips are non-controverisal and act as a good baseline for understanding warehousing with SQL Server.  Do check it out.

Comments closed

Principal Component Analysis Using R

Nina Zumel delves into principal component regression using R (via R Bloggers):

Data tends to come from databases that must support many different tasks, so it is exactly the case that there may be columns or variables that are correlated to unknown and unwanted additional processes. The reason PCA can’t filter out these noise variables is that without use of y, standard PCA has no way of knowing what portion of the variation in each variable is important to the problem at hand and should be preserved. This can be fixed through domain knowledge (knowing which variables to use), variable pruning and y-aware scaling. Our next article will discuss these procedures; in this article we will orient ourselves with a demonstration of both what a good analysis and what a bad analysis looks like.

All the variables are also deliberately mis-scaled to model some of the difficulties of working with under-curated real world data.

This does read like an academic paper, so it’s pretty heavy reading.  It’s also very good reading from a great writer, so take some time and give it a read if you do data analysis.

Comments closed

Say It With Screenshots

Brent Ozar continues his series on interviewing tactics:

After writing about “For Technical Interviews, Don’t Ask Questions, Show Screenshots”, lots of folks asked what kinds of screenshots I’d show. Here’s this week’s example.

I show each screenshot on a projector (or shared desktop) to the candidate and say:

  1. What’s this screen from?

  2. What does the screen mean?

  3. If it was a server you inherited from someone else, would there be any actions you’d take?

  4. What questions might you want to ask before you take those actions?

  5. Would there be any drawbacks to your actions?

  6. What would be the benefits of your actions?

I have started to use this in interviews and I’m already loving it.  I don’t want people to memorize minutia (“Name all of the policies available in Policy-Based Management”) but if I show a picture of the different policies, that should jog your memory on when you’ve used PBM to solve interesting problems.

Comments closed

Creating M Functions From Parameterized Queries

Chris Webb shows how to take a parameterized query in Power BI and create an M function from it:

All of these examples involve writing M code manually. The big change in the latest version of Power BI Desktop is that you can do the same thing using just the UI.

Let’s take the classic example of combining data from multiple Excel workbooks and update it to show how things work now.

Say you have a folder containing three Excel workbooks containing sales data for January, February and March and you want to load data from all three into a single table into Power BI. The first thing to do is to create a new parameter in Power BI Desktop that returns the filename, including path, of one of the Excel files. Call it ExcelFilePath and configure it as shown here:

This is pretty cool.

Comments closed