What Happens In Deep Neural Networks?

Adrian Colyer has a two-parter summarizing an interesting academic paper regarding deep neural networks.  Part one introduces the theory:

Section 2.4 contains a discussion on the crucial role of noise in making the analysis useful (which sounds kind of odd on first reading!). I don’t fully understand this part, but here’s the gist:

The learning complexity is related to the number of relevant bits required from the input patterns X for a good enough prediction of the output label Y, or the minimal I(X; \hat{X}) under a constraint on I(\hat{X}; Y) given by the IB.

Without some noise (introduced for example by the use of sigmoid activation functions) the mutual information is simply the entropy H(Y)independent of the actual function we’re trying to learn, and nothing in the structure of the points p(y|x) gives us any hint as to the learning complexity of the rule. With some noise, the function turns into a stochastic rule, and we can escape this problem. Anyone with a lay-person’s explanation of why this works, please do post in the comments!

Part two digs in deeper:

The different colours in the chart represent the different hidden layers (and there are multiple points of each colour because we’re looking at 50 different runs all plotted together). On the x-axis is I(X;T), so as we move to the right on the x-axis, the amount of mutual information between the hidden layer and the input X increases. On the y-axis is I(T;Y), so as we move up on the y-axis, the amount of mutual information between the hidden layer and the output Y increases.

I’m used to thinking of progressing through the network layers from left to right, so it took a few moments for it to sink in that it’s the lowest layer which appears in the top-right of this plot (maintains the most mutual information), and the top-most layer which appears in the bottom-left (has retained almost no mutual information before any training). So the information path being followed goes from the top-right corner to the bottom-left traveling down the slope.

This is worth a careful reading.

Related Posts

Using DALEX To Explain Black-Box Models

Przemyslaw Biecek explains that there’s more than LIME for explaining black-box models: I’ve heard about a number of consulting companies, that decided to use simple linear model instead of a black box model with higher performance, because ,,client wants to understand factors that drive the prediction’’. And usually the discussion goes as following: ,,We have tried LIME […]

Read More

Comparing Keras In Python Versus R

Dmitry Kisler performs image classification using Keras in both Python and R: From the plots above, one can see that: the accuracy of your model doesn’t depend on the language you use to build and train it (the plot shows only train accuracy, but the model doesn’t have high variance and the bias accuracy is […]

Read More

Categories

November 2017
MTWTFSS
« Oct Dec »
 12345
6789101112
13141516171819
20212223242526
27282930