Eric Laukien, Richard Crowder, Fergal Byrne
(Submitted on 13 Sep 2016)
Efforts at understanding the computational processes in the brain have met with limited success, despite their importance and potential uses in building intelligent machines. We propose a simple new model which draws on recent findings in Neuroscience and the Applied Mathematics of interacting Dynamical Systems. The Feynman Machine is a Universal Computer for Dynamical Systems, analogous to the Turing Machine for symbolic computing, but with several important differences. We demonstrate that networks and hierarchies of simple interacting Dynamical Systems, each adaptively learning to forecast its evolution, are capable of automatically building sensorimotor models of the external and internal world. We identify such networks in mammalian neocortex, and show how existing theories of cortical computation combine with our model to explain the power and flexibility of mammalian intelligence. These findings lead directly to new architectures for machine intelligence. A suite of software implementations has been built based on these principles, and applied to a number of spatiotemporal learning tasks.
Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Dynamical Systems (math.DS)
Eric Laukien, Richard Crowder, Fergal Byrne
“Evolution is often limited by its inability to escape local maxima in the fitness landscape”
I like that line.
We’re having the customary vigourous banter on Reddit if anyone wants to join in/vote.
Video prediction demo now on Youtube:
Recalling video sequences stored in Ogma’s Feynman machines.
Network size: 4 layers, 128x128. Connection radius of 6.
Training time: Approx. 20 seconds per video (depending on video length).
Rolling MNIST conveyor belt anomaly detector demo on Youtube:
Detecting anomalous MNIST digits as they rotate by.
Network size: 4 layers, 64x64, 48x48, 32x32, 16x16.
Training time: Approx. 30 seconds.
Regular class of various three digits rotate by, the system detects non-three digits when prediction errors pass a certain threshold.
I’m sure the MNIST video will perk up some ears on reddit
I upvoted on reddit but prefer to discuss here.
First I found the ideas interesting. In particular the 4 layer hierarchy
demonstration suggests it scales. This is particularly impressive given
your local circuits between encoder and decoder look inherently unstable,
because they modify the data fed as input to each other. Obviously they are
stable, which is difficult to achieve in practice.
The rolling MNIST digit example is impressive but I’m anxious about what I
can ‘take away’ from it - how big was the training and testing sets of
digit images? I guess not that big if you only trained for 30 seconds? Also
in the video I didn’t see any 3- ambiguous digits such as 8. The set sizes
make a big difference to the quality of the problem posed.
I may have missed it but I think it would be good to see the effect of
adding layers - 1,2,3,4 compared to larger, flat architectures. Ideally
there would be an efficiency benefit to deep rather than wide.
I’d really like to see some more conventional benchmarks even if your
method doesn’t do very well. This is because it’s fine to have a general
method that does X, but also does new feature Y, even if it’s not as good
at X. But to understand the capabilities/limitations we need transferable
My biggest worry right now would be the name. As someone mentioned on
Reddit, you’re doing yourselves a massive disservice with the Feynman name.
Neural Turing machines are a derivation of Turing’s concept. Boltzmann
machines use his equation. Using Feynman doesn’t seem like a homage, it
will have people looking for the connection to your algorithm… QED?! If
nothing else, this will serve as a distraction from the work itself. It
would be better to name it after yourselves, or better, some aspect of the
uniqueness of the approach (e.g. Takens-something).
Finaly thought - the training time seems really quick. It would be nice to
put some context in there e.g. measuring it in FLOPs to convergence and
comparing against another method (most neuroscientists use e.g. Matlab so
clock-based times can’t be fairly compared to your GPU implementation).
Thanks for the great questions @davidra!
Yes, we can scale arbitrarily deep. Normally, we just add layers until the performance (on the task) starts to saturate versus the slowdown due to the added layers. The best depth depends on the task; many tasks are fine with just two layers, most are in the 4-6 range, and a few need up to 8 levels. We’ll include details of these hyperparameters with each demo when we release them.
It’s little buried in the paper, but we’re really excited about using the Feynman Machine in distributed environments, where the bottom few layers run on a low-power client (phone, tablet, Raspberry Pi), and we can dynamically augment it with a stack of upper layers running in the cloud. Inter-region comms are just SDRs, so it’s really robust and performant.
In terms of width, we normally start with a guess of 128 (ie 128x128 units per region), and then go bigger or smaller as appropriate (on GPUs we can make pretty big regions). The heuristic is that a region is either too small, or big enough, and we can see a big drop in the capability if we drop below that boundary. We should do a paper on depth/width tradeoffs, thanks for the question.
I believe the regions are stable because of the fixed spatial and temporal sparsity - ahem, density - of the encoder hidden outputs, and because the decoder is used only for learning in the encoder, not in the activation phase. The temporal sparsity is important, and there are a few ways to achieve it, like having activity-dependent decaying biases on each unit, or zeroing the ramping activations after firing. Also, the top-down feedback tends to smooth out the effect of the fast-changing encoder on the downward pass. Finally, the use of local receptive fields and local inhibition tends to limit instability. We’ve tried a number of connection schemes and rejected the ones that cause runaway activity.
I’ll ask @ericlaukien to answer this one in depth, but briefly it’s not meant to be an exhaustive demonstration of performance on MNIST itself, but a demo of visual odd-man-out detection, using failure to model the spatiotemporal evolution of the stimulus to indicate a visual anomaly.
The training time usually refers to the network running and learning at full-speed (hundreds of fps on a consumer GPU), not the framerate in the videos, which is slowed right down so we humans can see what’s going on.
We test each iteration on more than a dozen tasks to identify how it rates relative to previous designs. Some of the tasks are proxies for standard benchmarks, reduced in size to speed the workflow, others are full-size. Most designs we’ve tried are either really poor at most things, or pretty good at most things and great at some subset. We’ll be pushing out updates about performance on standard tasks over the next few weeks and months.
I understand that point, but we really like the name for many reasons. I genuinely believe that if Feynman had lived, he would have discovered this connection years ago, as he spent the last decade of his life working with Hopfield, Wolfram and others, on cellular automata, dynamical systems and neural networks. His sister, Joan, who used Takens’ Theorem in one of her seminal papers, would surely have been a part of it too - if so, we might have to call it the Feynman-Feynman Machine. She’s still working after retiring a few years ago, so perhaps we can ask her.
Yes, one of @ericlaukien’s main quality criteria is whether something takes seconds or minutes to train. Any more than a few minutes usually indicates you’ve gone wrong somewhere. As I said earlier, on many tasks you need quite a big layer size, because you want to have very sparse SDRs but still have many units on, as temporal sparsity requires more active units per SDR. And the only way to run big layers (512x512 or bigger) is on a GPU, the CPU versions are hundreds of times slower.
@riccro is looking at Matlab integration, so people will be able to plug this into their current workflow.
Thanks again for the great questions and comments. It’s a pleasure to be able to talk details now after several months of radio silence!
To elaborate on the MNIST anomaly detection demo:
There are several classes of 3’s being used, but more importantly, it never sees what an anomalous digit looks like at training time. It only sees 3’s. The demo will be made available soon, so people can experiment with it to test for any criteria they like. Right now it gets 100% accuracy, we can make the task more difficult until it doesn’t though, and report where the problems lie.
For performance tests, we will definitely look into doing some proper FLOPS tests as you suggest, right now it’s not very scientific, we just run it and if we get bored in 1 minute then it’s too slow
Overall though, it should be several hundreds of times faster than backpropagation-based systems, if only for the fact that it doesn’t need to run several iterations on a replay buffer. However, whether this hurts the power of the algorithm is unknown right now.
Thanks for the detailed answers Fergal and Eric
Of course it’s easy for me to sit back and demand more rigor and more
results and more analysis but I know these things take a lot of time.
One thing though - training on all the '3’s would be good to see, because
this changes the quality of the problem posed. When only a fraction of the
3 digit images are used, the algorithm is able to overfit these templates
and anomaly classification becomes easier, because the non-anomalies are
very well defined. If you had a wider range of ‘3’ samples (ideally use the
whole training set, then test on all the test set ‘3’ images) this would be
I don’t quite understand the paper’s Predictive Hierarchies section (p 11-12) – why do you say you need to add extra perceptrons or a feed-forward net in order to get predictions? Aren’t these predictions already formed by the decoder (eq. 14)?
My guess: when you say “add several perceptrons” for predictive hierarchy type 1 that is just referring to the decoders already described.
Also, you seem to have real-valued inputs x but given that the signals between layers are binary SDRs, I guess it would be fine also to use binary SDR input, as HTM does. Obviously there is a still a need for a stimulus encoding step, whether into real or binary form, so I guess you will want to build stimulus encoders for sound, text etc.
I’m excited to see what can come from this new approach.
Thanks for the comments/questions. I think we’ll stick with Feynman
Yes, the standard hierarchy - the Sparse Predictive Hierarchy - is just the encoder-decoder pairs as described in the previous section (pp 7-11). The alternative described in pp 11-12 is called the Routed Predictive Hierarchy, and is basically a stack of encoders each paired with a layer from another kind of network (the pseudocode in S2 uses an inverted feedforward multi-layer perceptron for the slave network).
The difference between the two is that in the SPH, feedback and lateral inputs are combined additively (either before, or more usually after, being transformed), while in the RPH they are multiplied. Because the encoder outputs are usually SDRs, this causes the encoder to route only a subset of the feedback signals passing down the hierarchy. This is a kind of “spatiotemporal Dropout”, which chooses unique sparse subnetworks of the controlled network for each timestep. This acts as a regulariser for the slave network, and also makes it faster to train, because only a few units need be considered for each update, and the gradients are typically much simpler.
Yes, the external inputs can be either real or binary, whatever your data is, and we typically find k-sparse SDRs are best for internal connections for a number of reasons. We prefer to do sensory encoding (in the HTM sense of X-to-SDR) in the bottom encoder, as we want to get the predictions out in the same or similar form as the data, and this allows both encoding and decoding to be jointly learned.
The bottom encoders described in the paper are just randomly-initialised affine transformations, with the temporal persistence at derived input and stimulus levels, followed by k-sparse inhibition. This works directly for most kinds of data we’ve thrown at it, and is what’s normally used in every level above the bottom. We’ve also been looking at encoders for the bottom level which do more work in generating SDRs in order to give them better properties for feeding to the upper levels; we’ll be talking about these in an upcoming paper.
We have a verification task which trains on text, character-by-character, and then produces text off its own predictions. This consumes just one-hot vectors and produces a softmax output to do the prediction of the next character. We plan to examine what the system does with cortical.io SDRs in coming months.
A couple of updates:
The paper made MIT Tech Review’s Best of the Physics arXiv for the week it appeared .
We’ll be releasing the first of the software from the paper tomorrow on our Github page.
The question is if the human brain is mostly an adaptive memory system that molds itself to the lower dimensional manifolds that real world data exists on. That would greatly reduce the complexity necessary for any planing or reasoning system that might exist in the human brain.
That kind of makes sense because evolving a billion logic gates as in a CPU is not really so possible. A black box memory model (with say stacked autoencoders) could be an easy way to get to human level AI. Maybe all you need is one of these Dell EMC products:
Also cent per actuator robotics is on the doorstep:
You may be already familiar with this, but I came across a machine learning technique that uses dynamical systems called reservoir computing. Here’s the article where I encountered it: http://phys.org/news/2016-10-self-learning-tackles-problems-previous.html
Thanks for the reference. As we said on Reddit, these ideas are closely related to Reservoir Computing as well as Echo State networks (which are often described as evolutions of Hopfield Nets). It looks from our experiments that very many kinds of networks can be used inside the encoders (including HTM L4-L2/3 networks) as long as they fulfil a few key principles.
Some updates: George Sugihara (who is one of the leading people working on analysis using Takens’ Theorem) sent me this link:
This paper provides empirical evidence of dynamical systems coupling of brain areas at the mesoscale. Ironically, the authors used Takens’ Theorem and Sugihara Causality to extract the spatiotemporal causal structure of the measured signals. Most of the causal structure disappeared when the monkeys were unconscious.
I’m unfamiliar with diffeomorphism. In your 2015 paper you explain it “means that key analytic properties of the observed system, in particular its divergence characteristics, are replicated in the reconstruction. Predictions made using the replica have the same properties as the near-term future of the observed system.” I’d like to know more about these key analytic properties. Do you have a good resource where I can read about diffeomorphism?
Let’s use a Feynman Machine vision example. Say the input into a Feynman Machine is a 32x32 grid of pixels. Is this 32x32 grid therefore just one observable local state variable of the entire global state space manifold? If so, what are the other state variables or are they just unknown?
EDIT: I answered one of my own questions.
Hi @ddigiorg, thanks for your excellent gloss on the paper.
On Diffeomorphism, as ever, Wikipedia is your friend:
A diffeomorphism is an isomorphism of smooth manifolds. It is an invertible function that maps one differentiable manifold to another such that both the function and its inverse are smooth. [my emphasis]
A diffeomorphism is just a particularly user-friendly function, which allows you to take a problem in one space, solve it some “easier” space, and translate your solution back to the “difficult” space to apply it. For example, Bengio describes the function of a Deep Learning network as “untangling” the high-curvature manifolds of data in input space by transforming them into a space where they can be linearly separated, say for classification.
Brains, HTM’s, and our systems take this idea to the next level, by taking into account that the data not only lives on and near a low-dimensional manifold, but that it moves around on the manifold as time passes, and that the laws of motion are what causes the data to stay in the manifold, or alternatively that the manifold itself is defined by those laws of motion. In essence, the manifold is the union of a dense but zero measure set of trajectories, which locally look like near-parallel bundles of streamlines, and in general form a vector field dictating the future flow of states in the system.
Takens’ Theorem states that, for smoothly evolving systems, we can use delay embedding to reconstruct a manifold in “our space” which has essentially the same laws of motion (up to some smooth spatial warping - the diffeomorphism). The preserved properties of most interest for predictive modelling are the Lyapunov exponents, which describe how nearby streamlines diverge (exponentially in a chaotic system) as time goes forward.
Timothy Sauer showed (links from this excellent short review) that this result (for a 1-dimensional signal lagged k times) extends to k-dimensional signals (with or without lags), and Sugihara’s work uses mixtures of lagged and unlagged signals.
In HTM, in brains and in the FM, we just absorb whatever the input stream gives us and rely on competitive inhibition to extract the most predictive signals for reconstruction, and on learning to improve predictive power with experience.
To answer your second question, perhaps I should clarify. The input to the FM is a time series of snapshots of data coming from the external system. In your example, each snapshot is a 32x32 or 1024-dimensional vector of pixel values. The whole point of this theory is that the signal need only be derived lawfully (possibly in some unknown, complex way) from the hidden state variables of the external system, and the theorem says the signal carries all the information needed for effective reconstruction of the process generating the signal. This last is important: from the point of view of the agent (brain, HTM or FM), the phenomenon being observed is the signal it generates; anything happening in the “real system” which does not causally affect the signal will simply not be part of the agent’s reality.
Outside of the regime in which the theorem must hold (ie most of reality), empirical evidence indicates that the success of reconstruction and thus predictive modelling will gradually degrade as the causal coupling breaks down. The agent can only learn to model some proportion of the reality from the signal alone, the proportion rising with more exposure to the signal (and with improved adaptive learning in the agent), and being capped at some limit by the extent to which useful information is carried by the signal, and/or by capacity limits in the agent. HTM, FM and cortical agents usually try to have surplus capacity to capture the essentials of the phenomenon, and employ coordinated learning algorithms to maximise their ability to quickly adapt to the inputs.
This is perhaps the primary distinction between Deep Learning and HTM/FM/cortical systems: operators of the former use huge datasets, sequential decorrelation of inputs and regularisation to provide far more “moments” than parameters, in an effort to avoid overfitting, while the latter systems exploit the known properties of initially random connections and temporal causal structure, treat the temporal connections between data points following trajectories as carrying as much or more of the information as each spatial snapshot, and thus build models greedily from the minimum of data.
Thanks for the clarification Fergal. I think I now understand the gist of diffeomorphism and how it applies to Takens theorum. I was getting lost in the formal mathematical definition. Classic Dave…
I’m starting to get a better intuition for these concepts, and now I can really appreciate how powerful dynamical systems principles are for understanding intelligence. It gives a really succinct and formal definition for how our brains, HTM, FM and other machine intelligence systems work.
Basically our universe has “laws” which manifest “order” which our brains “observe”. We may think of the universe as a big chaotic dynamical system and picture it as a huge manifold of states.
- The “laws” are the hidden states of the manifold. Let’s use gravity as an example of a hidden law. We cant see gravity waves.
- The “order” is events or behaviors that reoccur, often repeatedly. These are the trajectories that occur close to each other on our manifold. In our gravity example, a ball when dropped always goes to the floor. Planets revolve around the sun. Black holes “suck”.
- The “observation” is the lens through which the brain, HTM, or FM can see the effects of the universe’s laws. By observing something through time we come to understand the hidden principles behind it. By seeing the ball fall to the ground many times we can infer a hidden law is occurring and predict similar events could occur in the future.
Therefore, the FM with a 1024 vector of pixels is just like our eyes: it may see events happening through time and build a model of the hidden “laws” behind the even’s occurrence. I’m curious to see if a FM would be able to predict where objects go in a simple 2d environment with some simulated physics.
I know it’s an oldish thread but I’d like to chip in if I may. My masters was focused almost entirely on Takens’ theorem. My supervisor, the late Jaroslav Stark (BTW his Stark’s Embedding theorem incorporates stochastic elements) described diffeomorphisms as the ‘shadow space’ of the generating dynamical system. If we think of measurements on the generating space as mapping from Rn to R1 (can’t see how to do superscripts; it’s my first post) they are a low dimensional projection of the generating system. As the generating system evolves on it’s manifold our projected measurement becomes a time series. (scalar or vector). Takens showed that under quite general conditions, sufficient measure was preserved to construct a manifold that was diffeomorphic to the original space. It is interesting to note that our visual systems reconstruct a 3-D world from the 2-D projection on the back of our eye. The projective plane was probably the first manifold projection studied during the Renaissance in Italy; the cross-ratio being the preserved measure.
I also find it interesting that Plato’s Theory of Forms first tackled this ‘unshadowing’ notion.
FWIW I’ve found this site searching for ideas to apply convolutional networks in measurement derived manifolds.