Press "Enter" to skip to content

Category: Data Science

Kafka Integration with Knime

Swantika Gupta shows off some of Knime’s ability to integrate with Apache Kafka:

Knime Analytics Platform provides it’s users a way to consume messages from Apache Kafka and publish the transformed results back to Kafka. This allows the users to integrate their knime workflows easily with a distributed streaming pub-sub mechanism.

With Knime 3.6 +, the users get a Kafka extension with three new nodes:
1. Kafka Connector
2. Kafka Consumer
3. Kafka Producer

Click through to see how to configure each and how to enrich your data with Knime.

Comments closed

Real-World Sentiment Analysis Examples

Ines Roldos shares a few examples of sentiment analysis:

Net Promoter Score (NPS) surveys are one of the most common ways of knowing how customers perceive a product or service. Basically, they consist of two stages: first, you ask a customer to score a business from 0 to 10, then you ask them to give reasons for the score they leave with open-ended question.

When it comes to processing the results, the first stage is easy: you just have to calculate the average score. But when it comes to analyzing tons of open-ended NPS responses, the analysis becomes more complicated. Imagine if your team had to tag hundreds of responses manually. Not only it would be a tedious and time-consuming task, it may also lead to inconsistent results derived from different criteria during the tagging process.

Fortunately, sentiment analysis enables you to process large volumes of NPS responses and obtain consistent results in a very fast and simple way.

It might just be the industry I’m in, but I don’t really get excited about sentiment analysis. Still, don’t let my biases influence your thought process too much.

Comments closed

Demand Forecasting with knime

Shubham Goyal walks us through using knime for a product demand forecasting scenario:

In this blog, we are going to see, Importance of demand forecasting and how we can easily create these forecasting workflows with Knime.

Market request forecasting is a basic procedure for any business, however maybe none more so than those in buyer packaged products. Stock, production, storage, delivering, showcasing – each aspect of CPG and retail organizations’ activities are influenced by accurate forecasting. Identifying shoppers’ preferences and their likeness to buy, make these organizations settle on better choices with respect to product offerings, entering new markets and their supply chains, guarantee that stores/stocks are stocked, and limit the danger of stock shortages or overflow.

Click through for the process.

Comments closed

Optimizing a Poisson Survival Model

Joshua Entrop shows off optimx() in R to perform a survival analysis:

In this blog post, we will fit a Poisson regression model by maximising its likelihood function using optimx() in R. As an example we will use the lung cancer data set included in the {survival} package. The data set includes information on 228 lung cancer patients from the North Central Cancer Treatment Group (NCCTG). Specifically, we will estimate the survival of lung cancer patients by sex and age using a simple Poisson regression model. You can download the code that I will use throughout post here

Read the whole thing. H/T R-bloggers

Comments closed

Splitting Data with T-SQL

Chris Hyde shows a few techniques for splitting out data into training, testing, and validation sets:

We see right away that this method failed horribly as all of the data was placed into the same dataset. This holds true no matter how many times we execute the code, and it happens because the RAND() function is only evaluated once for the whole query, and not individually for each row. To correct this we’ll instead use a method that Jeff Moden taught me at a SQL Saturday in Detroit several years ago – generating a NEWID() for each row, using the CHECKSUM() function to turn it into a random number, and then the % (modulus) function to turn it into a number between 0 and 99 inclusive.

I’d have to test it out, but I’d think you could modify method 3 to include a CROSS APPLY to perform one ABS(CHECKSUM(NEWID()) and get exact counts that way without a temp table.

Comments closed

The Basics of Randomized Response

Holger von Jouanne-Diedrich explains how randomized response can protect any single person’s opinion from a pollster while providing insight on the whole population:

So, is there a method to find the respective proportion of people without putting them on the spot? Actually, there is! If you want to learn about randomized response (and how to create flowcharts in R along the way) read on!

The question is how can you get a truthful result overall without being able to attribute a certain answer to any single individual. As it turns out, there is a very elegant and ingenious method, called randomized response. The big idea is to, as the name suggests, add noise to every answer without compromising the overall proportion too much, i.e. add noise to every answer so that it cancels out overall!

Click through for the process. It’s definitely a clever idea.

Comments closed

Contrasting ANOVA against Regression

Stephanie Glen contrasts ANOVA against typical regression techniques using a picture:

If you scour the internet for “ANOVA vs Regression”, you might be confused by the results. Are they the same? Or aren’t they? The answer is that they can be the same procedure, if you set them up to be that way. But there are differences between the two methods. This one picture sums up those differences.

Click through to see that image.

Comments closed

Comparing Gradient Descent to the Normal Equation for Small Data Sets

Pushkara Sharma compares two techniques for regression:

In this article, we will see the actual difference between gradient descent and the normal equation in a practical approach. Most of the newbie machine learning enthusiasts learn about gradient descent during the linear regression and move further without even knowing about the most underestimated Normal Equation that is far less complex and provides very good results for small to medium size datasets.

If you are new to machine learning, or not familiar with a normal equation or gradient descent, don’t worry I’ll try my best to explain these in layman’s terms. So, I will start by explaining a little about the regression problem.

I was surprised by the results.

Comments closed

Understanding the Bayesian Nature of Kalman Filters

Holger von Jouanne-Diedrich gives us an interesting interpretation of Kalman filters:

The Kalman filter is a very powerful algorithm to optimally include uncertain information from a dynamically changing system to come up with the best educated guess about the current state of the system. Applications include (car) navigation and stock forecasting. If you want to understand how a Kalman filter works and build a toy example in R, read on!

The following post is based on the post “Das Kalman-Filter einfach erklärt” which is written in German and uses Matlab code (so basically two languages nobody is interested in any more 😉 ). This post is itself based on an online course “Artificial Intelligence for Robotics” by my colleague Professor Sebastian Thrun of Standford University.

In fairness, I regret only one thing about learning German: that I’ve forgotten so much over the years.

Comments closed

The Basics of Autoregressive Models

Holger von Jouanne-Diedrich explains some of the principels of autoregressive models through a demonstration:

Well, this seems to be good news for the sales team: rising sales! Yet, how does this model arrive at those numbers? To understand what is going on we will now rebuild the model. Basically, everything is in the name already: auto-regressive, i.e. a (linear) regression on (a delayed copy of) itself (auto from Ancient Greek self)!

So, what we are going to do is create a delayed copy of the time series and run a linear regression on it. We will use the lm() function from base R for that (see also Learning Data Science: Modelling Basics).

Read on for some additional understanding.

Comments closed