*A quick note: I think all the explanations of RBMs get it backwards, except perhaps for* *Edwin Chen's post** on the subject. As such I will explain RBMs first by explaining their structure, then the Contrastive Divergence algorithm used to learn the weights, then the statistical motivation for the algorithm.
*

A restricted Boltzmann machine has a simple structure: A set of visible nodes takes input, and each visible node is connected to every node in a layer of hidden nodes. No visible node is connected to another visible node, and no hidden node is connected to another hidden node. Connections go both ways, and have the same weight in both directions, unlike in a multilayer perception, where connections are strictly unidirectional.

Each node's activation is probabilistic, and the probability of activation is given by the sigmoid activation function applied to the sum of the nodes it is connected to, plus a constant bias term. For hidden unit , this is:

Where is the vector of visible unit states and is a vector of weights, and is a constant bias term. The activation is equivalent to

*the probability of the hidden node being 1, given the state of the visible nodes*.

Similarly, the visible node's activation probabilities are:

Typically and are not separated out, but rather, they included in the weights matrix as weights to a bias unit that always outputs one. For this reason, and do not appear in the code below, but it is assumed that the values fed to the activation functions will have an extra entry which is set to one.

In python, this could be coded as:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | import numpy as np def sigmoid(v): return 1/(1 + np.exp(-v)) def visible_activation_probability(h,W,b): # h is a vector of length h, the number of hidden nodes # W is size h*v # b is length v # result is size k*v return sigmoid(b + np.dot(h,W)) def hidden_activation_probability(v,W,c): # v is a vector of length v, the number of visible nodes # W is size h*v # W.T is size v*h # c is length h # result is size k*h return sigmoid(c + np.dot(v, W.T)) |

Of course, this does not explain how to find the values in

The value of this machine is that it can discover the underlying statistical distribution of its inputs. We could use it to fill in partial datasets by activating all the visible nodes that we know, then activating the hidden nodes based on that, *then removing the stimulus from the visible nodes and letting them be filled in by the hidden nodes*. We can also use it to cluster data. The hidden activations are in some ways analogous to cluster membership of the items.

The next post I will explain the way to learn the activations, then the ones after that will explain why we approach this problem in this way.

Pingback: Restricted Boltzmann Machines in Python – Part 2: learning weights | Richard Weiss

Hi Richard,

In your code snippet, what's k? From your textual description, I'm led to believe that k==1, and there are h hidden units and v visible units. Am I mistaken?

I think you're right, it was probably from an earlier version of the code, thanks for pointing it out.

Why do you think "all RBM explanations" get it backwards?

Explaining a simple particular case of Gibbs sampling with a pseudo-code in python is not even a glimpse of an "RBM explanation".

RBMs are undirected generative models defined by pair-wise coupling potentials within a specific graph (two-layered graph without lateral links). Everything else is a mathematical consequence of this.

I prefer to see the explanation go the other way, from concrete to mathematical, and it took some time to understand the typical approach to explaining RBMs which starts from energy based models and ends with Contrastive Divergence. I'm writing up a post on the maths behind it, including gibbs sampling.

Pingback: Restricted Boltzmann Machines in Python – Part 3: theoretical basis | Richard Weiss

Sometimes it happens that I encounter on various blogs but this one is great.

Amazing tutorial thanks!

I like the way you approached Restricted Boltzmann machines by showing its simple concepts first before diving into the technical details

Thanks Richard for the post. Above you said, "We could use it to fill in partial datasets...then removing the stimulus from the visible nodes and letting them be filled in by the hidden nodes."

What do you mean by removing stimulus? Once machine is trained, will it be just (v1*W)*W transpose?

Yep, you remove stimulus, and then see what the visible nodes are driven to by the hidden nodes. There's also some randomness, so the hidden nodes are set with random values picked based on the distribution of the values they can take.

$16 Cheap NFL Super Bowl Jerseys Wholesale Online With Free Shipping