Press "Enter" to skip to content

Day: December 15, 2017

Enabling Exactly-Once Kafka Streams

Guozhang Wang wraps up his exactly-once series in Kafka:

When restarting the application from the point of failure, we would then try to resume processing from the previously remembered position in the input Kafka topic, i.e. the committed offset. However, since the application was not able to commit the offset of the processed message before crashing last time, upon restarting it would fetch A again. The processing logic will then be triggered a second time to update the state, and generate the output messages. As a result, the application state will be updated twice (e.g. from S’ to S’’) and the output messages will be sent and appended to topic TB twice as well. If, for example, your application is calculating a running count from the input data stream stored in topic TA, then this “duplicated processing” error would mean over-counting in your application, resulting in incorrect results.

Today, many stream processing systems that claim to provide “exactly-once” semantics actually depend on users themselves to cooperate with the underlying source and destination streaming data storage layer like Kafka, because they simply treat this layer as a blackbox and hence does not try to handle these failure cases at all. Application user code then has to either coordinate with these data systems—for example, via a two-phase commit mechanism—to guarantee no data duplicates, or handle duplicated records that could be generated from the clients talking to these systems when the above mentioned failure happens.

There’s some good information in here, so check it out.

Comments closed

Image Counts For Neural Network Training

Pete Warden shares his rule of thumb for how many images you need to train a neural network:

In the early days I would reply with the technically most correct, but also useless answer of “it depends”, but over the last couple of years I’ve realized that just having a very approximate rule of thumb is useful, so here it is for posterity:

You need 1,000 representative images for each class.

Like all models, this rule is wrong but sometimes useful. In the rest of this post I’ll cover where it came from, why it’s wrong, and what it’s still good for.

Read on to learn where the number 1000 came from and get some good hints, like flipping and rescaling images.

Comments closed

Error Handling In Scala

Manish Mishra gives a few examples of how to handle errors in Scala:

Try[T] is another construct to capture the success or a failure scenarios. It returns a value in both cases. Put any expression in Try and it will return Success[T] if the expression is successfully evaluated and will return Failure[T] in the other case meaning you are allowed to return the exception as a value. However with one restriction that it in case of failures it will only return Throwable types:

def validateZipCode(zipCode:String): Try[Int] = Try(zipCode.toInt)

But Throwing an exception doesn’t make much sense here since it is not much of a calculation. Although we can take this example to understand the use case. If the given string is not a number, it will be a failure. The value from the Try can be extracted in same as Option. It can be matched

As you write more complicated Spark operations, handling errors becomes critical.

Comments closed

An Introduction To seplyr

John Mount guest blogs on the Revolutions blog about seplyr:

seplyr is an R package that supplies improved standard evaluation interfaces for many common data wrangling tasks.

The core of seplyr is a re-skinning of dplyr‘s functionality to seplyr conventions (similar to how stringr re-skins the implementing package stringi).

Read on for a couple of examples of where seplyr can make it easier for you to program with than dplyr.

Comments closed

Creating Dynamic Pivot Tables

Ben Richardson shows how to use dynamic SQL to create pivot tables with arbitrary numbers of pivot elements:

The headings of the columns are the individual values inside the city column. We specified these values inside the pivot operator in our query.

The most tedious part of creating pivot tables is specifying the values for the column headings manually. This is the part that is prone to most errors, particularly if the data in your online data source changes. We can not be sure that the values we specified in the pivot operator will remain in the database until we create this pivot table next time.

For instance, in our script, we specified London, Liverpool, Leeds and Manchester as values for headings of our pivot table. These values existed in the Сity column of the student table. What if somehow one or more of these values are deleted or updated? In such cases, null will be returned.

A better approach would be to create a dynamic query that will return a full set of values from the column from which you are trying to generate your pivot table.

Click through to see how to build this.

Comments closed

Selecting Specific Characters In M

Chris Webb points out a new function in Power BI:

It’s very easy to use: the first parameter takes a text value, the second parameter takes either a text value containing a single text value or a list of single characters, and it returns the text from the first parameter minus all characters that are not in the second parameter. For example, the expression:

Text.Select("Hello", "l")

…returns the text value “ll”

Click through to see an example of how you can use this to filter out punctuation and other unwanted characters.

Comments closed

Reducing Reads In Queries

Bert Wagner has a few tips for improving query performance by reducing the number of reads:

If SQL Server thinks it only is going to read 1 row of data, but instead needs to read way more rows of data, it might choose a poor execution plan which results in more reads.

You might get a suboptimal execution plan like above for a variety of reasons, but here are the most common ones I see:

If you had a query that previously ran fine but doesn’t anymore, you might be able to utilize Query Store to help identify why SQL Server started generating suboptimal plans.

Click through for a few more ideas as well.

Comments closed