Hi there! I’ve been following HTM and Numenta for a long time, and interacting sparingly. I’ve got some thoughts on brain theory, and have lately started developing a classical neural network model that approximates the HTM algorithm.
This is for two reasons. The first reason is that classical ANN’s are better understood and have a plethora of techniques for every aspect of performance. The second is that I personally understand them very well, and I’m comfortable with navigating that headspace.
My current model uses a series of classical feedforward ANN’s in conjunction with LSTM, softmax, and autoencoding, working together in intelligent "nodes."
The idea goes like this:
Input is encoded as an SDR. (Layer 0)
The SDR is fed to the top layer, which is a feedforward autoencoder, that reduces the input vector by half (more or less.) (Layer 1)
The autoencoded vector is passed to 1 or more LSTM units. (Layer 2)
The vector of LSTM outputs, LSTM memory cells, and top level autoencoded vector are passed to a three or more layer (deep) feedforward ANN. (Layer 3)
The output of the ANN is passed to an SDR generator (Layer 4a) and a link selector (Layer 4b).
Layer 1 serves the purpose of generating a first pass model of the SDR input. ANN input vectors are dimensionless, and gradient descent can tease apart relationships in the patterns of input vectors. Autoencoders (single layer) create a low dimensionality model of the input space without losing resolution. This becomes important later, when you’re creating hierarchies. Layer 1 is trained by storing the 100 previous inputs, and batching with simple gradient descent. Training is triggered if the output doesn’t match the input.
Layer 2 takes the autoencoded output and produces a prediction of the next encoded input. It is trained by storing the previous input to predict the current input. Training is triggered by prediction failure over a certain threshold. LSTM can store data over almost an arbitrarily high number of cycles, so relationships between inputs over time are captured. The output represents what the node thinks it’s going to see next, in the form of an autoencoded/compressed vector.
The prediction and current input vectors are combined and feed to a deep neural network to refine the prediction. The deep ANN creates a more nuanced model over time, with the advantage of having the LSTM memory cell contents for context.
The refined prediction is passed through 4a, which is a copy of layer 1, and serves to generate an sdr from the prediction. This is the signal from the node. The output SDR is half the size of the input SDR, or smaller, allowing 2 or more outputs to be combined as input to the next node in the network.
The prediction and encoded current input are passed to a three layer link selector network, using softmax. The output vector represents a list of node addresses, or cluster addresses (a cluster is a group of nodes.) 4a both stores a list of connected nodes and selects which nodes are signaled. Connections can be recursive, for nodes without direct top layer input. (Layer 3)
The top layer of nodes are activated. If the second layer of nodes selected have more potential sdr’s than their input vector size, then the sdr with the highest softmax output is preferred, and in the case of a tie, randomly chosen from the winners. In the case of not enough potential inputs, a random first level node output is selected, and the selector layer is updated to reflect a new link. If no activation occurs, a counter is incremented. If a node goes for some number of cycles without activation, it is reinforced in the selection layers of preceding nodes.
The cycle is complete when all selected nodes have been activated no more than once. The new cycle is begun, and if recursive output from the prior cycle has a higher precedence than current input(s), current inputs are disregarded. If current inputs have equal activation values, current inputs are preferred. Both inputs are compared to activation prior during training, and the better match is reinforced, even if the better match was not selected for that cycle.
Training occurs during node activation, individually, and during network activation, globally. All training reinforces prediction of the current inputs using the values of t-1. Because each LSTM memory cell is given first level context in Layer 3 of each node, global contexts begin to form over time, as memory cell outputs from nodes all over the network begin to influence predictions.
I’m still working on node-to-node links, to reflect different variations of synapses, and methods of clustering so the system builds its own hierarchies.
The i/o is binary, and encoded vectors can be extracted from any node and decoded back to top level SDR’s . Comparing autoencoders lets you remove redundant nodes, and triggers a pruning pass. If two nodes are producing equivalent encodings, the one with the deeper Layer 3 adds a new layer to its deep network, and the other is scrambled, or reset. This produces global plasticity, and provides a reinforcement metric.
Anomaly detection is done by comparing predicted encoding against actual outputs with MSE.
I’m 30% through my code, and still thinking my way through this. I’ve split apart the Numenta neuron model into functional aspects of each node. I think I’ve captured the essence of an HTM column in my concept of a node. With the selector layer, 4a, I’ve abstracted the cell to cell synapses, so a certain level of granularity is obscured. You can still recreate a one-to-one model of an HTM equivalent.
I think the advantages are in automatic hierarchy generation, simplicity in feature extraction, and computational efficiency - the entire system should take less than 1000 lines of code, and ANN’s are really fast. It doesn’t require the entire network to be loaded in memory, as the links in 4b can refer to one or more nodes that can be loaded from memory, or reside on separate hardware.
Motor control and embodiment can be achieved directly by applying the output SDR of any given node or cluster to a control.
So the idea operates on SDR’s as the data storage mechanism, is real-time, plastic, predictive, easily interfaced, and can operate in memory constrained systems with dynamic loading, or utilize networks of devices.
It’ll probably be another couple months before I have any sort of working code, and I still need to do a lot of reading and theory.
I think it’s just enough past half baked to share, though, and I would hugely appreciate criticism or new ideas.
I mostly code in Lua, but I’m entertaining the idea of going with c from here on out. I’ll package the code I have on github in the next several weeks. Thanks for taking a look!