Understanding Random Forests

Manish Kumar Barnwal explains how random forest algorithms work:

Say our dataset has 1,000 rows and 30 columns. There are two levels of randomness in this algorithm:

  • At row level: Each of these decision trees gets a random sample of the training data (say 10%) i.e. each of these trees will be trained independently on 100 randomly chosen rows out of 1,000 rows of data. Keep in mind that each of these decision trees is getting trained on 100 randomly chosen rows from the dataset i.e they are different from each other in terms of predictions.
  • At column level: The second level of randomness is introduced at the column level. Not all the columns are passed into training each of the decision trees. Say we want only 10% of columns to be sent to each tree. This means a randomly selected 3 column will be sent to each tree. So for the first decision tree, may be column C1, C2 and C4 were chosen. The next DT will have C4, C5, C10 as chosen columns and so on.

This  is a nice article and includes cases when not to use random forests.

Related Posts

The Microsoft Team Data Science Process Lifecycle Versus CRISP-DM

Melody Zacharias compares Microsoft’s Team Data Science Process lifecycle with the CRISP-DM process: As I pointed out in my previous blog, the TDSP lifecycle is made up of five iterative stages: Business Understanding Data Acquisition and Understanding Modeling Deployment Customer Acceptance This is not very different from the six major phases used by the Cross […]

Read More

Exploratory Analysis With Hockey Data In Power BI

Stacia Varga digs into her hockey data set a bit more: Once I know whether a variable is numerical or categorical, I can compute statistics appropriately. I’ll be delving into additional types of statistics later, but the very first, simplest statistics that I want to review are: Counts for a categorical variable Minimum and maximum […]

Read More


June 2017
« May Jul »