A quick note: I think all the explanations of RBMs get it backwards, except perhaps for Edwin Chen's post on the subject. As such I will explain RBMs first by explaining their structure, then the Contrastive Divergence algorithm used to learn the weights, then the statistical motivation for the algorithm.
A restricted Boltzmann machine has a simple structure: A set of visible nodes takes input, and each visible node is connected to every node in a layer of hidden nodes. No visible node is connected to another visible node, and no hidden node is connected to another hidden node. Connections go both ways, and have the same weight in both directions, unlike in a multilayer perception, where connections are strictly unidirectional.
Each node's activation is probabilistic, and the probability of activation is given by the sigmoid activation function applied to the sum of the nodes it is connected to, plus a constant bias term. For hidden unit , this is:
Where is the vector of visible unit states and is a vector of weights, and is a constant bias term. The activation is equivalent to the probability of the hidden node being 1, given the state of the visible nodes.
Similarly, the visible node's activation probabilities are:
Typically and are not separated out, but rather, they included in the weights matrix as weights to a bias unit that always outputs one. For this reason, and do not appear in the code below, but it is assumed that the values fed to the activation functions will have an extra entry which is set to one.
In python, this could be coded as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
import numpy as np def sigmoid(v): return 1/(1 + np.exp(-v)) def visible_activation_probability(h,W,b): # h is a vector of length h, the number of hidden nodes # W is size h*v # b is length v # result is size k*v return sigmoid(b + np.dot(h,W)) def hidden_activation_probability(v,W,c): # v is a vector of length v, the number of visible nodes # W is size h*v # W.T is size v*h # c is length h # result is size k*h return sigmoid(c + np.dot(v, W.T))
Of course, this does not explain how to find the values in
The value of this machine is that it can discover the underlying statistical distribution of its inputs. We could use it to fill in partial datasets by activating all the visible nodes that we know, then activating the hidden nodes based on that, then removing the stimulus from the visible nodes and letting them be filled in by the hidden nodes. We can also use it to cluster data. The hidden activations are in some ways analogous to cluster membership of the items.
The next post I will explain the way to learn the activations, then the ones after that will explain why we approach this problem in this way.