Analyzing The StackLite Dataset

Kevin Feasel



Marco Pasin looks at the StackLite data set:

According to Stack Overflow documentation, these are the categories of questions that may be closed by the community users:

  • duplicated
  • off topic
  • unclear
  • too broad
  • primarily opinion-based
Not everyone in the Stack Overflow community is able to close a question. In fact users need to have certain reputation expressed in points (more details here).

To calculate the overall website closure rate is easy. Just use the original “questions_2016” dataset and count how many questions have the field “Closed Date” populated. Over 10% of questions made in 2016 have been closed so far.

If you’re interested in learning more about data analysis, walk through the exercise as well and play around with the data set too.  Hat tip, R-Bloggers.

Related Posts


John Mount explains the vtreat package that he and Nina Zumel have put together: When attempting predictive modeling with real-world data you quicklyrun into difficulties beyond what is typically emphasized in machine learning coursework: Missing, invalid, or out of range values. Categorical variables with large sets of possible levels. Novel categorical levels discovered during test, cross-validation, or […]

Read More

R 3.4.4 Now Available

David Smith notes that R 3.4.4 is now generally available: R 3.4.4 has been released, and binaries for Windows, Mac, Linux and now available for download on CRAN. This update (codenamed “Someone to Lean On” — likely a Peanuts reference, though I couldn’t find which one with a quick search) is a minor bugfix release, and shouldn’t cause […]

Read More


September 2016
« Aug Oct »