Eric Laukien, Richard Crowder, Fergal Byrne
(Submitted on 13 Sep 2016)
Efforts at understanding the computational processes in the brain have met with limited success, despite their importance and potential uses in building intelligent machines. We propose a simple new model which draws on recent findings in Neuroscience and the Applied Mathematics of interacting Dynamical Systems. The Feynman Machine is a Universal Computer for Dynamical Systems, analogous to the Turing Machine for symbolic computing, but with several important differences. We demonstrate that networks and hierarchies of simple interacting Dynamical Systems, each adaptively learning to forecast its evolution, are capable of automatically building sensorimotor models of the external and internal world. We identify such networks in mammalian neocortex, and show how existing theories of cortical computation combine with our model to explain the power and flexibility of mammalian intelligence. These findings lead directly to new architectures for machine intelligence. A suite of software implementations has been built based on these principles, and applied to a number of spatiotemporal learning tasks.
Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Dynamical Systems (math.DS)

Recalling video sequences stored in Ogmaâs Feynman machines.
Network size: 4 layers, 128x128. Connection radius of 6.
Training time: Approx. 20 seconds per video (depending on video length).

Rolling MNIST conveyor belt anomaly detector demo on Youtube:

Detecting anomalous MNIST digits as they rotate by.
Network size: 4 layers, 64x64, 48x48, 32x32, 16x16.
Training time: Approx. 30 seconds.

Regular class of various three digits rotate by, the system detects non-three digits when prediction errors pass a certain threshold.

First I found the ideas interesting. In particular the 4 layer hierarchy
demonstration suggests it scales. This is particularly impressive given
your local circuits between encoder and decoder look inherently unstable,
because they modify the data fed as input to each other. Obviously they are
stable, which is difficult to achieve in practice.

The rolling MNIST digit example is impressive but Iâm anxious about what I
can âtake awayâ from it - how big was the training and testing sets of
digit images? I guess not that big if you only trained for 30 seconds? Also
in the video I didnât see any 3- ambiguous digits such as 8. The set sizes
make a big difference to the quality of the problem posed.

I may have missed it but I think it would be good to see the effect of
adding layers - 1,2,3,4 compared to larger, flat architectures. Ideally
there would be an efficiency benefit to deep rather than wide.

Iâd really like to see some more conventional benchmarks even if your
method doesnât do very well. This is because itâs fine to have a general
method that does X, but also does new feature Y, even if itâs not as good
at X. But to understand the capabilities/limitations we need transferable
benchmark results.

My biggest worry right now would be the name. As someone mentioned on
Reddit, youâre doing yourselves a massive disservice with the Feynman name.
Neural Turing machines are a derivation of Turingâs concept. Boltzmann
machines use his equation. Using Feynman doesnât seem like a homage, it
will have people looking for the connection to your algorithmâŚ QED?! If
nothing else, this will serve as a distraction from the work itself. It
would be better to name it after yourselves, or better, some aspect of the
uniqueness of the approach (e.g. Takens-something).

Finaly thought - the training time seems really quick. It would be nice to
put some context in there e.g. measuring it in FLOPs to convergence and
comparing against another method (most neuroscientists use e.g. Matlab so
clock-based times canât be fairly compared to your GPU implementation).

Yes, we can scale arbitrarily deep. Normally, we just add layers until the performance (on the task) starts to saturate versus the slowdown due to the added layers. The best depth depends on the task; many tasks are fine with just two layers, most are in the 4-6 range, and a few need up to 8 levels. Weâll include details of these hyperparameters with each demo when we release them.

Itâs little buried in the paper, but weâre really excited about using the Feynman Machine in distributed environments, where the bottom few layers run on a low-power client (phone, tablet, Raspberry Pi), and we can dynamically augment it with a stack of upper layers running in the cloud. Inter-region comms are just SDRs, so itâs really robust and performant.

In terms of width, we normally start with a guess of 128 (ie 128x128 units per region), and then go bigger or smaller as appropriate (on GPUs we can make pretty big regions). The heuristic is that a region is either too small, or big enough, and we can see a big drop in the capability if we drop below that boundary. We should do a paper on depth/width tradeoffs, thanks for the question.

I believe the regions are stable because of the fixed spatial and temporal sparsity - ahem, density - of the encoder hidden outputs, and because the decoder is used only for learning in the encoder, not in the activation phase. The temporal sparsity is important, and there are a few ways to achieve it, like having activity-dependent decaying biases on each unit, or zeroing the ramping activations after firing. Also, the top-down feedback tends to smooth out the effect of the fast-changing encoder on the downward pass. Finally, the use of local receptive fields and local inhibition tends to limit instability. Weâve tried a number of connection schemes and rejected the ones that cause runaway activity.

Iâll ask @ericlaukien to answer this one in depth, but briefly itâs not meant to be an exhaustive demonstration of performance on MNIST itself, but a demo of visual odd-man-out detection, using failure to model the spatiotemporal evolution of the stimulus to indicate a visual anomaly.

The training time usually refers to the network running and learning at full-speed (hundreds of fps on a consumer GPU), not the framerate in the videos, which is slowed right down so we humans can see whatâs going on.

We test each iteration on more than a dozen tasks to identify how it rates relative to previous designs. Some of the tasks are proxies for standard benchmarks, reduced in size to speed the workflow, others are full-size. Most designs weâve tried are either really poor at most things, or pretty good at most things and great at some subset. Weâll be pushing out updates about performance on standard tasks over the next few weeks and months.

I understand that point, but we really like the name for many reasons. I genuinely believe that if Feynman had lived, he would have discovered this connection years ago, as he spent the last decade of his life working with Hopfield, Wolfram and others, on cellular automata, dynamical systems and neural networks. His sister, Joan, who used Takensâ Theorem in one of her seminal papers, would surely have been a part of it too - if so, we might have to call it the Feynman-Feynman Machine. Sheâs still working after retiring a few years ago, so perhaps we can ask her.

Yes, one of @ericlaukienâs main quality criteria is whether something takes seconds or minutes to train. Any more than a few minutes usually indicates youâve gone wrong somewhere. As I said earlier, on many tasks you need quite a big layer size, because you want to have very sparse SDRs but still have many units on, as temporal sparsity requires more active units per SDR. And the only way to run big layers (512x512 or bigger) is on a GPU, the CPU versions are hundreds of times slower.

@riccro is looking at Matlab integration, so people will be able to plug this into their current workflow.

Thanks again for the great questions and comments. Itâs a pleasure to be able to talk details now after several months of radio silence!

There are several classes of 3âs being used, but more importantly, it never sees what an anomalous digit looks like at training time. It only sees 3âs. The demo will be made available soon, so people can experiment with it to test for any criteria they like. Right now it gets 100% accuracy, we can make the task more difficult until it doesnât though, and report where the problems lie.

For performance tests, we will definitely look into doing some proper FLOPS tests as you suggest, right now itâs not very scientific, we just run it and if we get bored in 1 minute then itâs too slow
Overall though, it should be several hundreds of times faster than backpropagation-based systems, if only for the fact that it doesnât need to run several iterations on a replay buffer. However, whether this hurts the power of the algorithm is unknown right now.

Of course itâs easy for me to sit back and demand more rigor and more
results and more analysis but I know these things take a lot of time.

One thing though - training on all the '3âs would be good to see, because
this changes the quality of the problem posed. When only a fraction of the
3 digit images are used, the algorithm is able to overfit these templates
and anomaly classification becomes easier, because the non-anomalies are
very well defined. If you had a wider range of â3â samples (ideally use the
whole training set, then test on all the test set â3â images) this would be
much harder.

I donât quite understand the paperâs Predictive Hierarchies section (p 11-12) â why do you say you need to add extra perceptrons or a feed-forward net in order to get predictions? Arenât these predictions already formed by the decoder (eq. 14)?

My guess: when you say âadd several perceptronsâ for predictive hierarchy type 1 that is just referring to the decoders already described.

Also, you seem to have real-valued inputs x but given that the signals between layers are binary SDRs, I guess it would be fine also to use binary SDR input, as HTM does. Obviously there is a still a need for a stimulus encoding step, whether into real or binary form, so I guess you will want to build stimulus encoders for sound, text etc.

Iâm excited to see what can come from this new approach.

Thanks for the comments/questions. I think weâll stick with Feynman

Yes, the standard hierarchy - the Sparse Predictive Hierarchy - is just the encoder-decoder pairs as described in the previous section (pp 7-11). The alternative described in pp 11-12 is called the Routed Predictive Hierarchy, and is basically a stack of encoders each paired with a layer from another kind of network (the pseudocode in S2 uses an inverted feedforward multi-layer perceptron for the slave network).

The difference between the two is that in the SPH, feedback and lateral inputs are combined additively (either before, or more usually after, being transformed), while in the RPH they are multiplied. Because the encoder outputs are usually SDRs, this causes the encoder to route only a subset of the feedback signals passing down the hierarchy. This is a kind of âspatiotemporal Dropoutâ, which chooses unique sparse subnetworks of the controlled network for each timestep. This acts as a regulariser for the slave network, and also makes it faster to train, because only a few units need be considered for each update, and the gradients are typically much simpler.

Yes, the external inputs can be either real or binary, whatever your data is, and we typically find k-sparse SDRs are best for internal connections for a number of reasons. We prefer to do sensory encoding (in the HTM sense of X-to-SDR) in the bottom encoder, as we want to get the predictions out in the same or similar form as the data, and this allows both encoding and decoding to be jointly learned.

The bottom encoders described in the paper are just randomly-initialised affine transformations, with the temporal persistence at derived input and stimulus levels, followed by k-sparse inhibition. This works directly for most kinds of data weâve thrown at it, and is whatâs normally used in every level above the bottom. Weâve also been looking at encoders for the bottom level which do more work in generating SDRs in order to give them better properties for feeding to the upper levels; weâll be talking about these in an upcoming paper.

We have a verification task which trains on text, character-by-character, and then produces text off its own predictions. This consumes just one-hot vectors and produces a softmax output to do the prediction of the next character. We plan to examine what the system does with cortical.io SDRs in coming months.

Thanks for the reference. As we said on Reddit, these ideas are closely related to Reservoir Computing as well as Echo State networks (which are often described as evolutions of Hopfield Nets). It looks from our experiments that very many kinds of networks can be used inside the encoders (including HTM L4-L2/3 networks) as long as they fulfil a few key principles.

Some updates: George Sugihara (who is one of the leading people working on analysis using Takensâ Theorem) sent me this link:

This paper provides empirical evidence of dynamical systems coupling of brain areas at the mesoscale. Ironically, the authors used Takensâ Theorem and Sugihara Causality to extract the spatiotemporal causal structure of the measured signals. Most of the causal structure disappeared when the monkeys were unconscious.

@fergalbyrne Hopefully weâll be seeing more Feynman Machine updates soon! As I update my notes Iâm running across a few questions:

Iâm unfamiliar with diffeomorphism. In your 2015 paper you explain it âmeans that key analytic properties of the observed system, in particular its divergence characteristics, are replicated in the reconstruction. Predictions made using the replica have the same properties as the near-term future of the observed system.â Iâd like to know more about these key analytic properties. Do you have a good resource where I can read about diffeomorphism?

Letâs use a Feynman Machine vision example. Say the input into a Feynman Machine is a 32x32 grid of pixels. Is this 32x32 grid therefore just one observable local state variable of the entire global state space manifold? If so, what are the other state variables or are they just unknown?

Hi @ddigiorg, thanks for your excellent gloss on the paper.

On Diffeomorphism, as ever, Wikipedia is your friend:

A diffeomorphism is an isomorphism of smooth manifolds. It is an invertible function that maps one differentiable manifold to another such that both the function and its inverse are smooth. [my emphasis]

A diffeomorphism is just a particularly user-friendly function, which allows you to take a problem in one space, solve it some âeasierâ space, and translate your solution back to the âdifficultâ space to apply it. For example, Bengio describes the function of a Deep Learning network as âuntanglingâ the high-curvature manifolds of data in input space by transforming them into a space where they can be linearly separated, say for classification.

Brains, HTMâs, and our systems take this idea to the next level, by taking into account that the data not only lives on and near a low-dimensional manifold, but that it moves around on the manifold as time passes, and that the laws of motion are what causes the data to stay in the manifold, or alternatively that the manifold itself is defined by those laws of motion. In essence, the manifold is the union of a dense but zero measure set of trajectories, which locally look like near-parallel bundles of streamlines, and in general form a vector field dictating the future flow of states in the system.

Takensâ Theorem states that, for smoothly evolving systems, we can use delay embedding to reconstruct a manifold in âour spaceâ which has essentially the same laws of motion (up to some smooth spatial warping - the diffeomorphism). The preserved properties of most interest for predictive modelling are the Lyapunov exponents, which describe how nearby streamlines diverge (exponentially in a chaotic system) as time goes forward.

Timothy Sauer showed (links from this excellent short review) that this result (for a 1-dimensional signal lagged k times) extends to k-dimensional signals (with or without lags), and Sugiharaâs work uses mixtures of lagged and unlagged signals.

In HTM, in brains and in the FM, we just absorb whatever the input stream gives us and rely on competitive inhibition to extract the most predictive signals for reconstruction, and on learning to improve predictive power with experience.

To answer your second question, perhaps I should clarify. The input to the FM is a time series of snapshots of data coming from the external system. In your example, each snapshot is a 32x32 or 1024-dimensional vector of pixel values. The whole point of this theory is that the signal need only be derived lawfully (possibly in some unknown, complex way) from the hidden state variables of the external system, and the theorem says the signal carries all the information needed for effective reconstruction of the process generating the signal. This last is important: from the point of view of the agent (brain, HTM or FM), the phenomenon being observed is the signal it generates; anything happening in the âreal systemâ which does not causally affect the signal will simply not be part of the agentâs reality.

Outside of the regime in which the theorem must hold (ie most of reality), empirical evidence indicates that the success of reconstruction and thus predictive modelling will gradually degrade as the causal coupling breaks down. The agent can only learn to model some proportion of the reality from the signal alone, the proportion rising with more exposure to the signal (and with improved adaptive learning in the agent), and being capped at some limit by the extent to which useful information is carried by the signal, and/or by capacity limits in the agent. HTM, FM and cortical agents usually try to have surplus capacity to capture the essentials of the phenomenon, and employ coordinated learning algorithms to maximise their ability to quickly adapt to the inputs.

This is perhaps the primary distinction between Deep Learning and HTM/FM/cortical systems: operators of the former use huge datasets, sequential decorrelation of inputs and regularisation to provide far more âmomentsâ than parameters, in an effort to avoid overfitting, while the latter systems exploit the known properties of initially random connections and temporal causal structure, treat the temporal connections between data points following trajectories as carrying as much or more of the information as each spatial snapshot, and thus build models greedily from the minimum of data.

Thanks for the clarification Fergal. I think I now understand the gist of diffeomorphism and how it applies to Takens theorum. I was getting lost in the formal mathematical definition. Classic DaveâŚ

Iâm starting to get a better intuition for these concepts, and now I can really appreciate how powerful dynamical systems principles are for understanding intelligence. It gives a really succinct and formal definition for how our brains, HTM, FM and other machine intelligence systems work.

Basically our universe has âlawsâ which manifest âorderâ which our brains âobserveâ. We may think of the universe as a big chaotic dynamical system and picture it as a huge manifold of states.

The âlawsâ are the hidden states of the manifold. Letâs use gravity as an example of a hidden law. We cant see gravity waves.

The âorderâ is events or behaviors that reoccur, often repeatedly. These are the trajectories that occur close to each other on our manifold. In our gravity example, a ball when dropped always goes to the floor. Planets revolve around the sun. Black holes âsuckâ.

The âobservationâ is the lens through which the brain, HTM, or FM can see the effects of the universeâs laws. By observing something through time we come to understand the hidden principles behind it. By seeing the ball fall to the ground many times we can infer a hidden law is occurring and predict similar events could occur in the future.

Therefore, the FM with a 1024 vector of pixels is just like our eyes: it may see events happening through time and build a model of the hidden âlawsâ behind the evenâs occurrence. Iâm curious to see if a FM would be able to predict where objects go in a simple 2d environment with some simulated physics.

I know itâs an oldish thread but Iâd like to chip in if I may. My masters was focused almost entirely on Takensâ theorem. My supervisor, the late Jaroslav Stark (BTW his Starkâs Embedding theorem incorporates stochastic elements) described diffeomorphisms as the âshadow spaceâ of the generating dynamical system. If we think of measurements on the generating space as mapping from Rn to R1 (canât see how to do superscripts; itâs my first post) they are a low dimensional projection of the generating system. As the generating system evolves on itâs manifold our projected measurement becomes a time series. (scalar or vector). Takens showed that under quite general conditions, sufficient measure was preserved to construct a manifold that was diffeomorphic to the original space. It is interesting to note that our visual systems reconstruct a 3-D world from the 2-D projection on the back of our eye. The projective plane was probably the first manifold projection studied during the Renaissance in Italy; the cross-ratio being the preserved measure.
I also find it interesting that Platoâs Theory of Forms first tackled this âunshadowingâ notion.
FWIW Iâve found this site searching for ideas to apply convolutional networks in measurement derived manifolds.
thanks
pat