What Happens In Deep Neural Networks?

Adrian Colyer has a two-parter summarizing an interesting academic paper regarding deep neural networks.  Part one introduces the theory:

Section 2.4 contains a discussion on the crucial role of noise in making the analysis useful (which sounds kind of odd on first reading!). I don’t fully understand this part, but here’s the gist:

The learning complexity is related to the number of relevant bits required from the input patterns X for a good enough prediction of the output label Y, or the minimal I(X; \hat{X}) under a constraint on I(\hat{X}; Y) given by the IB.

Without some noise (introduced for example by the use of sigmoid activation functions) the mutual information is simply the entropy H(Y)independent of the actual function we’re trying to learn, and nothing in the structure of the points p(y|x) gives us any hint as to the learning complexity of the rule. With some noise, the function turns into a stochastic rule, and we can escape this problem. Anyone with a lay-person’s explanation of why this works, please do post in the comments!

Part two digs in deeper:

The different colours in the chart represent the different hidden layers (and there are multiple points of each colour because we’re looking at 50 different runs all plotted together). On the x-axis is I(X;T), so as we move to the right on the x-axis, the amount of mutual information between the hidden layer and the input X increases. On the y-axis is I(T;Y), so as we move up on the y-axis, the amount of mutual information between the hidden layer and the output Y increases.

I’m used to thinking of progressing through the network layers from left to right, so it took a few moments for it to sink in that it’s the lowest layer which appears in the top-right of this plot (maintains the most mutual information), and the top-most layer which appears in the bottom-left (has retained almost no mutual information before any training). So the information path being followed goes from the top-right corner to the bottom-left traveling down the slope.

This is worth a careful reading.

Related Posts

XGBoost With Python

Fisseha Berhane looked at Extreme Gradient Boosting with R and now covers it in Python: In both R and Python, the default base learners are trees (gbtree) but we can also specify gblinear for linear models and dart for both classification and regression problems. In this post, I will optimize only three of the parameters […]

Read More

Calling Azure Cognitive Services From SSIS

Rolf Tesmer shows off how easy it is to call Azure Cognitive Services from SQL Server Integration Services: My SQL SSIS package leverages the Translator Text API service.  For those who want to learn the secret sauce then I suggest to check here – https://azure.microsoft.com/en-us/services/cognitive-services/translator-text-api/ essentially this API is pretty simple; It accepts source text, source language and target language.  (The API can translate to/from over […]

Read More


November 2017
« Oct Dec »