Thoughts and musings, alternative implementation of the theory

Hi there! I’ve been following HTM and Numenta for a long time, and interacting sparingly. I’ve got some thoughts on brain theory, and have lately started developing a classical neural network model that approximates the HTM algorithm.

This is for two reasons. The first reason is that classical ANN’s are better understood and have a plethora of techniques for every aspect of performance. The second is that I personally understand them very well, and I’m comfortable with navigating that headspace.

My current model uses a series of classical feedforward ANN’s in conjunction with LSTM, softmax, and autoencoding, working together in intelligent "nodes."
The idea goes like this:
Input is encoded as an SDR. (Layer 0)
The SDR is fed to the top layer, which is a feedforward autoencoder, that reduces the input vector by half (more or less.) (Layer 1)
The autoencoded vector is passed to 1 or more LSTM units. (Layer 2)
The vector of LSTM outputs, LSTM memory cells, and top level autoencoded vector are passed to a three or more layer (deep) feedforward ANN. (Layer 3)
The output of the ANN is passed to an SDR generator (Layer 4a) and a link selector (Layer 4b).

Layer 1 serves the purpose of generating a first pass model of the SDR input. ANN input vectors are dimensionless, and gradient descent can tease apart relationships in the patterns of input vectors. Autoencoders (single layer) create a low dimensionality model of the input space without losing resolution. This becomes important later, when you’re creating hierarchies. Layer 1 is trained by storing the 100 previous inputs, and batching with simple gradient descent. Training is triggered if the output doesn’t match the input.

Layer 2 takes the autoencoded output and produces a prediction of the next encoded input. It is trained by storing the previous input to predict the current input. Training is triggered by prediction failure over a certain threshold. LSTM can store data over almost an arbitrarily high number of cycles, so relationships between inputs over time are captured. The output represents what the node thinks it’s going to see next, in the form of an autoencoded/compressed vector.
The prediction and current input vectors are combined and feed to a deep neural network to refine the prediction. The deep ANN creates a more nuanced model over time, with the advantage of having the LSTM memory cell contents for context.
The refined prediction is passed through 4a, which is a copy of layer 1, and serves to generate an sdr from the prediction. This is the signal from the node. The output SDR is half the size of the input SDR, or smaller, allowing 2 or more outputs to be combined as input to the next node in the network.
The prediction and encoded current input are passed to a three layer link selector network, using softmax. The output vector represents a list of node addresses, or cluster addresses (a cluster is a group of nodes.) 4a both stores a list of connected nodes and selects which nodes are signaled. Connections can be recursive, for nodes without direct top layer input. (Layer 3)

The top layer of nodes are activated. If the second layer of nodes selected have more potential sdr’s than their input vector size, then the sdr with the highest softmax output is preferred, and in the case of a tie, randomly chosen from the winners. In the case of not enough potential inputs, a random first level node output is selected, and the selector layer is updated to reflect a new link. If no activation occurs, a counter is incremented. If a node goes for some number of cycles without activation, it is reinforced in the selection layers of preceding nodes.

The cycle is complete when all selected nodes have been activated no more than once. The new cycle is begun, and if recursive output from the prior cycle has a higher precedence than current input(s), current inputs are disregarded. If current inputs have equal activation values, current inputs are preferred. Both inputs are compared to activation prior during training, and the better match is reinforced, even if the better match was not selected for that cycle.

Training occurs during node activation, individually, and during network activation, globally. All training reinforces prediction of the current inputs using the values of t-1. Because each LSTM memory cell is given first level context in Layer 3 of each node, global contexts begin to form over time, as memory cell outputs from nodes all over the network begin to influence predictions.

I’m still working on node-to-node links, to reflect different variations of synapses, and methods of clustering so the system builds its own hierarchies.

The i/o is binary, and encoded vectors can be extracted from any node and decoded back to top level SDR’s . Comparing autoencoders lets you remove redundant nodes, and triggers a pruning pass. If two nodes are producing equivalent encodings, the one with the deeper Layer 3 adds a new layer to its deep network, and the other is scrambled, or reset. This produces global plasticity, and provides a reinforcement metric.

Anomaly detection is done by comparing predicted encoding against actual outputs with MSE.

I’m 30% through my code, and still thinking my way through this. I’ve split apart the Numenta neuron model into functional aspects of each node. I think I’ve captured the essence of an HTM column in my concept of a node. With the selector layer, 4a, I’ve abstracted the cell to cell synapses, so a certain level of granularity is obscured. You can still recreate a one-to-one model of an HTM equivalent.

I think the advantages are in automatic hierarchy generation, simplicity in feature extraction, and computational efficiency - the entire system should take less than 1000 lines of code, and ANN’s are really fast. It doesn’t require the entire network to be loaded in memory, as the links in 4b can refer to one or more nodes that can be loaded from memory, or reside on separate hardware.

Motor control and embodiment can be achieved directly by applying the output SDR of any given node or cluster to a control.

So the idea operates on SDR’s as the data storage mechanism, is real-time, plastic, predictive, easily interfaced, and can operate in memory constrained systems with dynamic loading, or utilize networks of devices.

It’ll probably be another couple months before I have any sort of working code, and I still need to do a lot of reading and theory.

I think it’s just enough past half baked to share, though, and I would hugely appreciate criticism or new ideas.

I mostly code in Lua, but I’m entertaining the idea of going with c from here on out. I’ll package the code I have on github in the next several weeks. Thanks for taking a look!

2 Likes

You write, “I’m still working on node-to-node links, to reflect different variations of synapses, and methods of clustering so the system builds its own hierarchies.”

I was going to start my comments in reaction to that quote, but instead my point of departure had better be the statement made above that “Motor control and embodiment can be achieved directly by applying the output SDR of any given node or cluster to a control.

I do not see how applying the output to a control can be part of rational, goal-directed activity without higher mechanisms of conceptual thought. I will keep checking back here for further developments.

1 Like

The idea for embodiment would be to insert an sdr representing motor state as a top level input, and to unfold the prediction from deeper inside the network as the instruction for t+1.

A trivial example of this would be something like rod balancing - https://goo.gl/images/5Kcn7V - you can impose an arbitrary target, the desired state, by inserting the sdr of that state as the predicted input for a node at least 2 links deeper from the top level. The feedback from that level would influence the intermediate predictions, forming an interpolation that results in a series of motor instructions tending toward the desired goal.

Unpredictable state changes - think about adding wind, or random jolts to the pole being balanced - would trigger changes to the input sdr patterns that the deeper nodes could use to predict the correct series of motor instructions required to get back to the desired state.

Temporal memory in this system is implicit, rather than explicit, but I think it’s equivalent to the Numenta algorithm.

You set goals for behavior by coding one or more input states, and overriding the “folded up” predictive sdr of nodes deep in the network. The deeper those nodes are, the more total nodes above them in the hierarchy will be used to achieve a goal. A top level node could be used directly as motor control, but would suck at it. A deep hierarchy would perform much better, and the reinforcement mechanism is very simple.

You can apply the idea to visual saccades, audio interpretation, touch, temperature, or complex behavior like manipulation of direct text interface in conjunction with a visual feed of a screen. I’m definitely interested in setting up a chatbot, and then eventually mapping semantic knowledge databases like Cyc, wordnet, and so forth to clusters of nodes.

The fitness metric is universal - successful predictive power. Implicit structures let you direct behaviors globally by inserting arbitrary states as sdrs anywhere in the network, and recursive connections between nodes lets the system attain whatever level of complexity is necessary to achieve a particular state over time.

1 Like

Another aspect I didn’t mention is that the entire network is differentiable, so a fixed network can unfold an arbitrary number of sequences, since the contents of the memory cells are encoded in the signal SDR’s, alongside the output of the selectors in 4b. If you streamed video into a fixed network, then after processing the entire video, the end state could be unfolded into the entire video stream. I’m not sure what level of compression would be achieved, however it seems significant.

This ties the system as a whole into Marcus Hutter’s theory of intelligence as compression - AIXI , http://www.hutter1.net/ai/aixigentle.htm .

Sequences are not tied to temporal relationships, either - the theory is applicable to spatial prediction, such as Photoshop space filling, audio sequence prediction, or even corrupted file repair. I think one of the first experiments I want to try will be visual shape completion - train a system on black and white images of a dozen or so shapes, and use it to predict whole shapes when images are partially obscured, in varying orientations and sizes. This would demonstrate the capacity for generalization, prediction, and abstraction of linear relationships.

This should be an easy launching off point. I’ll start in on converting my code to c tonight, and maybe have something hacky to show off over the weekend.

1 Like

And I’ve forgotten almost everything about using c and just spent a week tangled in the weeds of relearning - I’m going to start with a luajit implementation. At least then I can move the needle.

1 Like