Data Modeling And Neural Networks

2018-03-09

I have two new posts in my launching a data science project series.  The first one covers data modeling theory:

Wait, isn’t self-supervised learning just a subset of supervised learning?  Sure, but it’s pretty useful to look at on its own.  Here, we use heuristics to guesstimate labels and train the model based on those guesstimates.  For example, let’s say that we want to train a neural network or Markov chain generator to read the works of Shakespeare and generate beautiful prose for us.  The way the recursive model would work is to take what words have already been written and then predict the most likely next word or punctuation character.

We don’t have “labeled” data within the works of Shakespeare, though; instead, our training data’s “label” is the next word in the play or sonnet.  So we train our model based on the chains of words, treating the problem as interdependent rather than a bunch of independent words just hanging around.

At this point, I want to build the Keras model. I’m creating a `build_model` function in case I want to run this over and over. In a real-life scenario, I would perform various optimizations, do cross-validation, etc. In this scenario, however, I am just going to run one time against the full training data set, and then evaluate it against the test data set.

Inside the function, we start by declaring a Keras model. Then, I add three layers to the model. The first layer is a dense (fully-connected) layer which accepts the training data as inputs and uses the Rectified Linear Unit (ReLU) activation mechanism. This is a decent first guess for activation mechanisms. We then have a dropout layer, which reduces the risk of overfitting on the training data. Finally, I have a dense layer for my output, which will give me the salary.

I compile the model using the `RMSProp` optimizer. This is a good default optimizer for neural networks, although you might try `Adagrad``Adam`, or `AdaMax` as well. Our loss function is Mean Squared Error, which is good for dealing with finding the error in a regression. Finally, I’m interested in the Mean Absolute Error–that is, the dollar amount difference between our function’s prediction and the actual salary. The closer to \$0 this is, the better.

Click through for those two posts, including seeing how close I get to a reasonable model with my neural network.

vtreat

2018-03-16

John Mount explains the vtreat package that he and Nina Zumel have put together: When attempting predictive modeling with real-world data you quicklyrun into difficulties beyond what is typically emphasized in machine learning coursework: Missing, invalid, or out of range values. Categorical variables with large sets of possible levels. Novel categorical levels discovered during test, cross-validation, or […]

R 3.4.4 Now Available

2018-03-16

David Smith notes that R 3.4.4 is now generally available: R 3.4.4 has been released, and binaries for Windows, Mac, Linux and now available for download on CRAN. This update (codenamed “Someone to Lean On” — likely a Peanuts reference, though I couldn’t find which one with a quick search) is a minor bugfix release, and shouldn’t cause […]