Adrian Colyer has a two-parter summarizing an interesting academic paper regarding deep neural networks. Part one introduces the theory:
Section 2.4 contains a discussion on the crucial role of noise in making the analysis useful (which sounds kind of odd on first reading!). I don’t fully understand this part, but here’s the gist:
The learning complexity is related to the number of relevant bits required from the input patterns for a good enough prediction of the output label , or the minimal under a constraint on given by the IB.
Without some noise (introduced for example by the use of sigmoid activation functions) the mutual information is simply the entropy independent of the actual function we’re trying to learn, and nothing in the structure of the points gives us any hint as to the learning complexity of the rule. With some noise, the function turns into a stochastic rule, and we can escape this problem. Anyone with a lay-person’s explanation of why this works, please do post in the comments!
Part two digs in deeper:
The different colours in the chart represent the different hidden layers (and there are multiple points of each colour because we’re looking at 50 different runs all plotted together). On the x-axis is , so as we move to the right on the x-axis, the amount of mutual information between the hidden layer and the input increases. On the y-axis is , so as we move up on the y-axis, the amount of mutual information between the hidden layer and the output increases.
I’m used to thinking of progressing through the network layers from left to right, so it took a few moments for it to sink in that it’s the lowest layer which appears in the top-right of this plot (maintains the most mutual information), and the top-most layer which appears in the bottom-left (has retained almost no mutual information before any training). So the information path being followed goes from the top-right corner to the bottom-left traveling down the slope.
This is worth a careful reading.