Proof of concept: Trainable universal encoder architecture



Howdy everyone. I’ve been thinking about the process of encoding arbitrary high-dimensional data for input to networks such as HTM. For low-dimensional data such as scalars, the traditional hand-crafted encoders probably work pretty well. However, high-dimensional data such as images contain a lot of redundant structure, so encoding them as a concatenation of scalar quantities may be wasteful of computation and memory.

I’ve produced a proof-of-concept encoder architecture that uses the methodology of denoising autoencoders: models that are trained to reconstruct input examples in the presence of noise.

Sparse autoencoder: (image source)

Here I investigate training a denoising autoencoder network to form useful intermediate representations that we can use for encoding sensory input into SDRs. It trains in less than 30 seconds on a GTX 1080 Ti GPU.

Although “vanilla” autoencoders have been shown to learn useful low-dimensional representations, converting these real-valued vectors into binary SDRs may lose important structure. So I’ve introduced two auxiliary losses to the training of the system that encourage the forms of sparsity that we’re interested in, with the intention that the sigmoid activation of the intermediate representation is closer to a usable binary vector:

Per-sample hidden-layer sparsity constraint

layer_sparsity = tf.reduce_sum(hidden, axis=1) / hidden_size
layer_loss     = tf.reduce_mean(tf.squared_difference(layer_sparsity, DESIRED_SPARSITY))

This constraint, based on the sum of the sigmoid units along the hidden layer dimension, encourages representations for individual input examples to approach the desired sparsity.

Per-unit batch sparsity constraint

batch_sparsity = tf.reduce_sum(hidden, axis=0) / batch_size
batch_loss     = tf.reduce_mean(tf.squared_difference(batch_sparsity, DESIRED_SPARSITY))

This constraint, based on the sum of the sigmoid units along the batch dimension, encourages each unit in the hidden layer to be active for a desired fraction of the examples in the training batch.

These two constraints together encourage 1) sparsity of representation, and 2) equal division of representational capacity across the hidden layer.

More details are available in the repository below.

Note that more sophisticated networks (i.e. CNNs) can encode MNIST better than the single hidden layer I’m using here, but the intent is for the network to be general enough to encode any modality of high-dimensional data you’re interested in with some degree of quality. Note also that the network nowhere made use of the class labels on the training data, so it can be trained on batches of unlabelled input data of the form HTM was intended to handle.

Feel free to use if you have any desire for this sort of thing. I’m interested in any ideas the community might have for improving the method, and I’d be glad to consider pull requests if you discover anything cool. I’m also interested in ideas for evaluating the quality of these encodings. I’ve got a couple preliminary (and fairly naive) evaluations on the github page, but these can certainly be improved.

Cheers! Happy hacking.

No influence of learning based on the permanence of proximal connections

This looks really interesting, thanks for sharing!