StackLite Dataset

David Robinson reports on a new Stack Exchange data set available to the public:

For each Stack Overflow question asked since the beginning of the site, the dataset includes:

  • Question ID
  • Creation date
  • Closed date, if applicable
  • Deletion date, if applicable
  • Score
  • Owner user ID (except for deleted questions)
  • Number of answers
  • Tags

This is ideal for performing analyses such as:

  • The increase or decrease in questions in each tag over time

  • Correlations among tags on questions

  • Which tags tend to get higher or lower scores

  • Which tags tend to be asked on weekends vs weekdays

  • Rates of question closure or deletion over time

  • The speed at which questions are closed or deleted

This is pretty exciting.  Getting good, high-quality data sets for demonstration and pedagogical purposes is time-consuming, so the fact that the Stack Exchange people are tossing one out our way could be a major time-saver.

Related Posts

Bias Correction In Standard Deviation Estimates

John Mount explains how to perform bias correction and explains why it happens so rarely in practice: The bias in question is falling off at a rate of 1/n (where n is our sample size). So the bias issue loses what little gravity it ever may have ever had when working with big data. Most sources of noise will […]

Read More

Explaining Neural Networks With H2O

Shirin Glander explains some of the concepts behind neural networks using H2O as a guide: Before, when describing the simple perceptron, I said that a result is calculated in a neuron, e.g. by summing up all the incoming data multiplied by weights. However, this has one big disadvantage: such an approach would only enable our neural net […]

Read More

Categories

July 2016
MTWTFSS
« Jun Aug »
 123
45678910
11121314151617
18192021222324
25262728293031