A different point of view on building AI system

stepan · November 24, 2017, 3:09pm

Hi the HTM community,

I want to present you a different approach to developing an AI model. I hope it would be interesting.
All critics, questions, and opinions are welcome.

First, a little about my work. I developed a new type of artificial neural network (this type is closest to auto-associative networks, I suppose). I well aware that the HTM community position is “the model have to implicate the real structure of the brain”, but “we don’t need to create an exact replica, just a system that demonstrates the important properties” (Subutai Ahmad).

Instead of modeling the brain (as you do) or modeling the specific task (as many AI researchers do), I was trying to make a network which could work on any task without adjusting the architecture for the specific one.
I started with the model, which I got from the «On intelligence» book, and some other scientists work. That model did work on classification task. However, when I was trying to apply the network to controlling “an animal in a maze”, I discovered that the model has to be able to process temporal sequences of patterns to do that task.
Therefore, I modified the model, made it work on the current task and moved to another. Unlike most researchers, I wasn’t trying to get the state-of-the-art results on just one task. I was curious, “Could I make the network do this? Would it be able to do that?”. Thus, I continued to find tasks, which my model would fail on, and improve the architecture.
I noticed that the more complex the model, the more “narrow” it is. So, in a process of testing on different tasks, my model has become very simple. Currently, it has very general architecture.

When I started comparing my network with the HTM, I was surprised how similar they are. That’s not the structure or algorithms. But they are based on the same principles and have a lot of common properties.

Let’s see:

Temporal sequence processing. My network works with sequences of patterns and uses prediction to do that. Like your model.
Continuous learning. I completely agree that the model has to work in “real time” on a continuous stream of data. The network simultaneously learns new data and processes input.
The model processes any type of input (visual, audio, etc.) the same way. Like the HTM with its SDR, my network operates the sequences of binary patterns. For the network, there is no difference between pattern from vision and hearing. That property also allows combining different kinds of input (as you combined sensory pattern with the location in the recent paper).
Predictions are the key. Actually, for me, that was the most doubtful part of your theory. Nevertheless, appears to be that my model works pretty similar. I just call it differently, “associative activation”. Because, for example, if we are recalling what we had on lunch, that’s not really a “prediction”. But we use the same “prediction” mechanism to do that.
Inhibition. Initial model did not include any inhibition but appears that inhibitory connections are absolutely necessary for some tasks.
Creating new connections between neurons.
Only small part of neurons activates on each timestep.
Hierarchy. There is enough evidence that the brain processes information hierarchically and uses generalizations.

But now, let me point on some differences between my network and the HTM.

My network doesn’t have layers. Yes, the network is hierarchical but has no separate layers. I started with fixed hierarchical structure (when a receptive field of neurons from N layer can be only among neurons from N and N-1 layers), but I discovered that it just doesn’t work for some tasks. Now, the receptive area of a neuron can consist of neurons all over the network and combine representations from different levels. I would be glad to provide more details and examples. How do you think, could the biological neuron have the receptive field in multiple layers regardless of their location?
Actually, my network doesn’t have parameters at all (*). The network is universal for every task. Its architecture is dynamic. From the beginning, the network has zero neurons. I know, that is against biology, but you do the similar in your model. You make connections between existing neurons, but for me easier to just make new neuron with necessary connections. And this approach actually does work.
Different patterns don’t mix up. Similar patterns associatively connect, but new data do not overwrite existing data. That property gives the network ability to learn new tasks without spoiling old knowledge.
By the way, 2 and 3 lead to another interesting property.
Theoretically infinite capacity. The network can receive new inputs, learn new things, and grow. But because on each timestep only small part of the network (associatively connected) become active, even a huge network can work with the same performance as a small one. And we can create neurons and connections till we run out of space on the hard drive.
Hierarchical learning. That one better explains by an example:
First, we train the network to recognize squares and circles.

Then we train it to recognize a “button”, which consists of this circle and this square:

Next, we feed the network with this picture:

This picture has almost no overlapping inputs with the previous one but it still will be recognized as a “button” (because it still consists of a circle and a square). We don’t need to train the network on all combinations of squares and circles for robust recognition. This property allows to learn new thing much faster when it consists of features, the network already knows.
This example is very simplified. If you interested, I’ll give you the real results and the detailed description of how the network do this task.
For my network, there is no difference between input and output. That property gives it the ability to produce really flexible output, not predefined in any way. And also it allows the network to do such cool things like “imaginable” input. Let me show you:
First, we train the network to recognize a square:

Then we show it a partial square:

It still will be recognized as a square. You may notice, that in case B2 the receptive field doesn’t contain anything. So the network will activate “predicted” input (red dotted line) and proceed to move until it will encounter bottom angle.

By the way, the network controls the moves of the receptive area. That’s like saccades. This move is an output, but simultaneously it’s an input (because the next prediction clearly depends on the direction of the move). This approach actually makes visual recognition very robust.
Generalization and inhibition. Appears to be that generalization cannot be “automatic”. It has to be dynamic. That one is very contrary with the HTM, so let’s look at the picture:

In the case A, you would probably recognize figure 1 as some kind of fruit. In the case B, figure 3 is exactly the same as A1, but I suppose you’ll take it for a ball.
This task can’t be done by just relying on self-organizing topology. That’s why the model needs inhibitory connections, which can deactivate representation depending on the input and the context.
"One-shot learning". Some tasks require learning by just a few examples. Hierarchical learning allows doing that.
Self-learning. The network can evaluate its own actions. That’s done the same way - through associative activations. Appears to be that property is absolutely necessary for dynamic generalization.

Everything described above can be done by just creating and using associative connections.
The network clearly goes against the biology. But I didn’t intend to model the brain. I was simply trying to make a universal architecture. For the past 3 years, I tested the network on such tasks as classification, visual recognition (with saccades), text dialog, dialog + visual recognition, “animal in a maze”, “Tic-tac-toe” (including quite interesting “blindfold” version), “Pac-Man” game and some logical tasks (including the “Winograd Schema Challenge”).

My network doesn’t wipe out the HTM. I think it’s more like a “look from different point of view”. And I would love to see your “neuroscientific” opinion about my model.
I greatly appreciate any questions and critics. I would be glad to provide more information if you got interested in something specific.

(*) The network doesn’t have usual parameters, like learning rate, count of layers, activation function, etc. But it has only one parameter. That’s a count of representations (neurons), which can be recognized (activated) simultaneously. That one has a huge effect on the complexity of tasks, which the network able to perform. And it actually makes a pretty interesting correlation with some anthropology experiments.

P.S. In addition, on the topic “How the allocentric locations are encoded for SMI?” @rhyolight mentioned that “we’re trying to figure out how and where this location signal is generated”.
My system doesn’t include the location signal. It uses another technic to achieve the same goal. Can I propose my method for your consideration?

rhyolight · November 24, 2017, 3:38pm

Interesting work! I assume you are still using an “HTM Neuron”? Meaning you are simulating the dendritic spike and predicted states of neurons?

There’s got to be some parameters. How many other cells can a neuron connect to? What is the maximum number of dendritic segments a neuron can grow? How many max synapses on a segment? How easily are they created? How fast do they grow and degrade? These are all global properties of groups of HTM neurons. I don’t really want the answers to all these questions, just wondering how they can be hard-coded.

fine2100 · November 24, 2017, 4:18pm

Hi Stepan
Interesting to read your story, seems like you have cleverly composed something very interesting.

I have one question: You say you don´t have any layers… do you mean physical or virtual layers? In my opinion any time based separation of data (some data are past and anticipated from the past now, future or assumed in the future now, present streaming in real time, repeated in the moment of decision, and from the now (like start/stop moving)…no matter how data are physically stored and recalled, they belong to different data sources in time, and they are thus layered virtually if store in the same physical media…looking forward to read your reaction…because one thing is what is done, another is how the thinking behind i aligned to certaqin principles that we are not always completely aware of that we are following…

Basically all this modelling, taking biology or nor biology as an axiom, ends up being modelling some aspects of physics.

But I really like that input is no different that output - do you mean representation and/or content/meaning wise?

Regards
Finn

stepan · November 24, 2017, 5:05pm

Hello fine2100,
Really glad to see your reaction. Try my best to answer your questions.

Yes, you’re right. There always should be some separation. Many of the current researchers propose, that it goes by the time delay between layers, so we would have a few patterns simultaneously.
In my model, there is no timing. The separation goes through the context.
About classic timing separation. An example where it wouldn’t work:
Let’s say I send you a message: “My”. I guess, You would be wondering what’s it about.
Next day I send you a message: “name is”. And now you would be getting a clue. But how? No delay could possibly keep the previous pattern “My” active for a whole day.
I propose that the context, like “a message from Stepan” associatively activates pattern “My”. Than “My” → “name is” - connects to current pattern. What do you think?

The shortest description of my theory: “Everything goes by associative activations from current input and context”. A context is also just associatively activated neurons.

I’m sorry, I’m not a native speaker. Could you please rephrase?
The network is autoassociative. The output was an input once. Then the network got the ability to associatively activate it.
For example, we have a receptor R1 (a dot) and an acceptor M1 (a move “up”). First, we activate R1 and M1 simultaneously - they connect together. Then when the system sees a dot R1 it activates M1 and makes a move “up”.
Exactly the same goes for other inputs (R1 activates R2) and even high-level representation.

stepan · November 24, 2017, 5:28pm

Your questions are very insightful. The model of neuron is the key. Explanations would be much easier if I just draw schemes of the model of neuron and the structure of the network.
But I still didn’t publish my work and the details of the structure and the algorithm is all I have right now.
I hope it won’t stop you from reading. Your opinion is very valuable to me.

But I’ll try to answer your questions as detailed as possible.

“I assume you are still using an “HTM Neuron”?”. No, I’m not using “HTM Neuron”. I fact, my model of neuron is even simpler than conventional one. But it allows using associative connections between neurons and creates a hierarchy.
“you are simulating the dendritic spike and predicted states of neurons” I kind of simulated the dendritic spikes and predicted states of neurons. But in very simplified form. However, appears to be that it’s enough to perform most tasks.
“There’s got to be some parameters.” Nope, the only parameter, which could be changed by the user, is the parameter I mentioned on the topic. The network doesn’t have weights. Everything defines by associative connections, which neurons have between each other.
“What is the maximum number of dendritic segments a neuron can grow?”. In my model, the neuron’s dendrites actually define the neuron (can’t be two neurons with the same set of dendrites). The dendrites don’t change. But axons (or it’s terminals, I guess) are constantly adding.
I can give examples of that once we remember some pattern - it never changes. But this pattern can associatively connect to new patterns. Like a feature can become a part of a more complex object.
In my program, I restrict the max count of dendrites per neuron, but only because it does not work in parallel. Theoretically, the number of dendrites could have no limit.
“How easily are they created?”. There is no activation function. They literally “fire together, wire together”. Simultaneously active neurons associatively connect into a pattern (which also represented by a neuron ).
“How fast do they grow and degrade?”. My system is not biological. Grows (connects) instantly. And on test tasks, I didn’t see a need for degrading neurons or connections.

That’s not just thoughts. I tried a lot of different approaches (I have over a hundred versions of a program and each version - some significant change in the network’s algorithm). The only criteria were “would it work on every task”. Most of them didn’t.
But the current version of the program can do everything I described. And the theory explains a lot.

curt · November 24, 2017, 5:54pm

Can you give us a little more specifics of how your neurons work rather than just talk in abstractions?

Are the inputs binary 1 0 values? Or numeric? Or something else? It sounds like you might be using binary input signals (two-value signals).

Do your neurons produce a single 1 or 0 as output and what you call “fire” means a 1 output?

Is your network synchronous? Meaning at a given time step, your network is given a new input vector of binary values, then your code calculates the activation of all the neurons? And are some of the neurons in your large and growing set of neurons defined as the “outputs” of the system? Or are all the neurons considered possible “outputs”?

you talked about training your network without ever telling us what type of learning is taking place. Is it all an unsupervised learning system where it’s just learning to recognize patterns in the inputs (like HTM?). Unsupervised systems can’t learn to do something like solve a maze. Such a learning problem requires some type of definition of a goal. How do you define goals? How does it learn to get through the maze is “good”? Are you doing reinforcement learning (reward based) or some other way of defining the output behavior goal?

You talked about association learning without being specific. You talked about “wire together fire together” but not about how an association is created. Is the association a hard binary association (these two signals are associated or not associated). To implement stronger learning in noisy environments you need a weighted probabilistic version of associations. In some form – like a threshold measure where the association doesn’t get made until the “fire together enough” or something like that.

When you were asked about “Growth” they were not talking ability. They were asking how fast does your number of neurons grow over time for a typical learning experiment? Does the number of neurons your system dynamically add tend to just keep growing linearly for as long as you train? Does the growth rate (neurons added per time steps) tend to slow over time as it learns all there is to know?

In a real-world noisy environment, the growth rate is likely going to be unbounded and problematic for your design if I’m guessing correctly about what you are doing. I think you said you never deleted neurons but only added them.

I have more questions, but if you are willing to share more specifics I would like to see them.

I think you are on the right path to more general learning and I always like to see the design approaches people have come up with general learning systems.

Curt

rhyolight · November 24, 2017, 6:15pm

Hierarchy is certainly possible without the HTM Neuron.

I would be interested in how you did this without separating out proximal vs other dendritic segments (the HTM Neuron).

Same in HTM. No two neurons have the same dendrites, as a property of the system.

Ok, maybe not, but synapses on those dendrites should change, even go away entirely. If axons are constantly connecting to dendrites, then synapses are being created. How can there be no parameter to decide how quickly these synapses are created? And there must be no concept of synaptic permanence either then? HTM depends on Hebbian learning, which comes along with a set of parameters. Does your method rely on Hebbian learning?

This is not how it works in the brain or HTM. The brain forgets things and learns new things. I understand that your system is supposed to grow infinitely, but then the search space increases for matching patterns.

I don’t understand this. I was asking how easily new synapses are created, that is not an activation function. When your system starts, you said there are no synapses. So what happens to create a synapse?

This is a core concept in HTM. How do you get unsupervised learning without it?

Let’s see the code!

stepan · November 24, 2017, 7:27pm

Hello Curt,

That’s a pleasure to see how you formulate your questions.
Ok, convinced, let’s see some specifics.

Basically, that’s the structure:

Simple, isn’t it?
The input is a binary pattern (receptor can be active or not active). Dimensionality is no fixed.
The algorithm just connects together everything active on current timestep through another neuron (which usually created on current timestep).

The network is synchronous. The network is auto-associative.

On one timestep we activate receptor R1 (dot) and acceptor M1 (move). They connect through new neuron O1. On some other timestep when we give the input R1 it activates move M1 and representation of that pattern O1.
The system doesn’t have any predefined inputs and outputs. So it has to be trained to make them (in the program I mostly use “reflexes” to create initial connections). But that makes the system really flexible.
After such initialization, the network starts to activate every connection and every neuron. And we have to inhibit the wrong actions.

My model uses reinforcement learning. First versions were unsupervised, they could even do classification and clusterization, but, as you noticed, it required a threshold. Again, appears to be that this approach doesn’t work for some tasks.
By the way, feedback can be either positive (+1) or negative (-1). And it’s just another input (receptor neuron).

About supervised and unsupervised. At first, the system is fully supervised. But I mentioned that reinforcement is just another neuron, which also can be associatively activated.
By activating previous feedbacks, the network can receive a new reinforcement. It starts to evaluate its own actions (activations) and corrects its own mistakes without supervision. It never goes completely unsupervised, don’t require feedback every timestep.

About sequences, hierarchy, and context. I guess, the picture will give you a hint:

Any questions?

About “Growth” and how fast goes the creation of new neurons. For real-world tasks, the speed is always one new neuron per each timestep. It’s memorizing the current situation (pattern of active neurons). And on a constant stream of data, all patterns are new.
Yes, that cause some problems in my very unoptimized program. However, the algorithm doesn’t calculate activation rate for every neuron in the network. Only for ones, which directly connected to current input and context. But, indeed, on real-world tasks, it could be a very large number of neurons.
In my program, I do have a function, which deletes neurons. But theoretically, that’s “a crutch” for my program and the perfect parallel system wouldn’t need that.

Little thought about deleting and forgetting.
In my life, I heard a lot of jokes and funny stories. But if you would ask me to tell you some, I would be able to remember maybe couple dozens of them at most.
However, if you start to telling me one, which I already heard somewhere, I would suddenly remember it (even if it seems to be completely forgotten). Perhaps, I even would be able to tell you the ending before you finish it
So, I don’t know how we forget things. But the memories can be restored (patterns activated) through the right context.
By the way, jokes and funnies are much easy to remember (activate) because those sequences of patterns include positive feedback. But that’s a topic for completely different conversation.

I look forward to your opinion and further questions.

stepan · November 24, 2017, 8:15pm

Thanks for your interest, Matt.

It’s all goes through associative connections. You can look for pictures in my answer to @curt.
I guess, in my model, the dendritic spikes are activations of neurons from the same pattern (on the picture R1 activates M1). Predicted states - neuron, which represents currently recognized situation (pattern) (on the picture that is O1). If there are dendritic connections between O1 and something else (like O1 connected to M2 on the last picture), then M2 would likely to activate then M1 in context of activated O1.

Uff, I hope it would be understandable with the picture.

I guess I answered those questions in reply I just did for @curt. Please, read it there. If something unclear, don’t hesitate to ask.

Unfortunately, I don’t know anything about forgetting. On small and simplified test tasks it mostly works without it But you are right, that mechanism is necessary for a “normal” system.
The search space is increasing in the process, but limited. In the program neuron is a class which stores links to associatively connected neurons. So, calculations are performing only for neurons, which directly connected to current input and context. Thus we don’t have to go over all network for “matching patterns”.

Sorry, didn’t mention on the main topic. The learning is mostly supervised. Some details in reply to @curt.

Sorry again, but I didn’t publish my work. And right now I’m preparing for it. The details of the algorithm are too important.
Maybe you would like some other way of demonstration? Any suggestions?

By the way, to be able to debug the program I had to know exactly how activations have to go in each case, on each task. That’s actually very interesting. The theory allows explaining a lot of “human” tasks in a general way.

For example, “how the model (or your model) could learn to count the number of occurrences of some letter in a word”?
This task seems extremely simple, it could be easily done by simple “narrow” algorithm. But it requires a lot of knowledge when you do it “in general”. And general way gives a lot of advantages when it comes to real “human” tasks (for example see “The Winograd schema challenge”).
Would you be interested to see this?

curt · November 24, 2017, 9:11pm

So, I don’t know how we forget things. But the memories can be restored (patterns activated) through the right context.

I don’t know how the brain works, but there are some obvious issues to understand due to the nature of the problem.

My interest is in real-world learning problems such as a robot learning to interact with it’s environment – trying to duplicate human and animal learning skills basically – general AI stuff.

In this type of environment, we have sensors that produce massive amounts of very noisy data. we can see 1000 images of a cat with the robot “eyes” and no two images will ever be the same – not even “close” to the same.

So in your network, you seem to be looking at actual binary patterns as activation for your nodes. So if the inputs are “100110101” that’s the pattern a node is activating on. In smaller toy (low dimension) environments, an approach like that can work well. In a high bandwidth high noise environment, we will see small patterns all the time (a pixel value of 40 will happen a lot, but so will all the other 255 possible pixel values for a single color channel). If you try to build nodes to recognize precise pixel patterns (say 3 RGB values 8 bits each) you will have a 24-bit pattern that shows up in something close to 2^24 different combinations of 1’s and 0’s. You would need something near 2^24 neurons (16 million nodes) just to recognize all the patterns that come from a single pixel of a video image (robot eye) – and the number of combinations of these patterns would be massive and grow out of control very quickly if you tried to create nodes dynamically to recognize new patt3erns as they showed up.

This is a classic scaling problem – we don’t have enough hardware in the universe to make it work. Not even “fill up up your disk” with virtual nodes will even begin to touch the nature of the problem of recognizing a real-world cat walking in front of a real-world robot using video data as “eyes”.

So we must generalize. We must compress in some way. We must take a massive fire-hose stream of data from the external sensors and compress it down to internal representations of N nodes. Where N is some number that represents how much hardware we willing to throw at the problem. Our hardware is always limited.

Real-world learning is an inherent lossy compression problem. We must throw most the data away. Guaranteed. The brain has to be doing this as well. What we can store in the brain, is infinitely smaller than what the sensors are sending to the brain over our lives.

All this relates to your “forgetting” issue. We must “Forget” almost ALL of what we receive in our sensory stream and these learning systems must do that as well. Forgetting is not the hard problem. We can’t remember everything in real-world problems, so we must answer the question of what do we choose to remember – how do we know which of the 1 bit out of billion we “remember”?

Our learning system must implement some system for forgetting just by the fact it’s impossible to remember everything.

The way I like to look at the general problem is that we must use these learning networks to represent the state of the environment as accurately as possible, using the limited N nodes we have to work with (N could be a 100 nodes, or a trillion nodes – the problem is the same either way). If you have a high bandwidth sensory feed like a video stream how would you compress it down to only 10 internal signals in 10 nodes of a learning network so that those 10 bits of data (node activations) at each time step, best represented the 100 million bits that come in for each time step?

If you have a general algorithm to compress any large raw sensory input stream to some small internal repetition, then you just pick the size of your internal network to give you the resolution needed to solve a given learning problem. This is much like how as an engineer we can pick what number of pixels to use a camera to give us the resolution needed to solve some problem at the lowest cost. The light entering the lens has massive amounts of data, but we reduce that massive data down to X pixels of information the camera hardware where we can choose any number X we want to build a camera for.

The learning network needs to work the same way. We pic the number of nodes we want to use to set the resolution of “understanding” the system can have – and feed it any massive stream of data we want to, and it compresses that data down to N bits by throwing MOST the data way, but ending up with the best possible N bit representation of the data we can create.

The general approach to this problem that seems to lead to some useful results is to compress using both spatial and temporal predictions. If input bit X at time T predicts input bit Y at time t+1 then we can represent this temporal pattern by one bit internally, so we have created compression.

But in the real life, there are no (or very few) 100% predictions. Bit’s seldom correlate at 100% rate so we can’t compress them down without loss of data.

But what we can do, is pick the compression mappings that lose the least amount of data possible.

If input bit A correlates at a 10% rate to input bit B, and an 80% rate to C, then the AC pattern is more useful to “remember” than the AB pattern because having a node that “recognizes” the more common AC pattern allows the system to “remember” more data than wasting an entire node on the “AB” pattern which is rare.

Your approach to creating one new node per cycle (remembering one more new bit of correlation data) is a way of allocating your hardware to the stuff that is most common which produces a good internal representation.

But in a very high information environment of real-world sensors, you can’t get away with remembering "actual’ bit patterns (HTM suffers from this as well). There are just too many to remember.

So we must move to systems that use a probabilistic approach to pattern recognition. So bit inputs that activate a node need something like a weight that represents some measure how likely that bit is to be on when the node it feeds is on. Then the system learns to adjust these weights to make the nodes activate in ways that best represent all the inputs.

Your “only N nodes fire at once” logic (I think that’s what you implied by your “one” parameter") I assume must be some measure of-of how “well” the nodes match their patterns or something to pick which N nodes will be active? Or to adjust weights to keep more than N firing at once or something?

The end result is that the internal nodes need to learn to represent the input data patterns that happen the most but which don’t overlap with the patterns other nodes are representing.

It all boils down to a big lossy compression problem where the number of bits we compress down to, is fixed (N nodes to represent the state of the environment with). We can’t escape the need to solve this problem by dynamic allocation because we will then just have to figure out how to throw out nodes as fast as we are creating them because we will run out of hardware very quickly. But it’s possible that some nice efficient shortcuts to solving this class of problem can be found by dynamically building a network based on what patterns are seen, and dynamically pruning the least used nodes.

Computer compression algorithms like Lempel–Ziv–Welch use this technique to build an expanding pattern recognizing tree based on what is seen but since the goal is to lossless compression, they never prune the tree or throw any data out. For these AI problems, we have to limit our trees to N nodes and throw MOST the data way (to reach human-like learning).

So your answer to how we forget is just the inverse of what we are able to remember – which is a small tip of the iceberg of the most common patterns we are exposed to. What a system like this can remember, is only what is most common in its environment.

So, when what is common shifts over time, what we can remember shifts with it and what we forget is what is no longer common. Learning system like this can only remember the very small tip of the iceberg of what is common in the environment. The less we are exposed, the more we “forget”. If we live in a world full of cats, our brain fills up with lost of details about cats. But if we then move to a world that has no animals, and live there for years, our brain slowly erases the details of the cats because the nodes used to recognize ct figures are slowly being re-tuned for new features in the new environment, making the resolution of our memory of cats fade over time.

Our entire ability to recognize a pattern like “cat” can be understood in this “remember what is most likely” approach is because a real-world cat creates lots of redundant data in our sensory streams – the real cat makes our sensory stream predictable so the learning network forms an abstract patterns of cat as a way to label the predictability that exists in the sensory data. But to be effective in a real-world high bandwidth high noise environment, it must all be done using very probabilistic learning systems – not systems based on absolute bit patterns.

stepan · November 25, 2017, 1:39am

Thank you, Curt. That is very informative.
Your work is clearly important and you’re great at explaining it in simple terms.

You’re absolutely right. When I was applying my network to visual recognition task, it worked well only on very small and very simple images. But if I tried to make it process relatively big image, the system just run out of performance margin.

But then I found a very simple and, appears to be, very promising and general approach to visual recognition. I think it solves “the scaling problem” and makes the network very robust (not only to noise but to pretty big deformation).
That’s quite different from conventional, but, please, try to understand. I’ll do my best to explain it. And I have some figures.

But first, some notes:

Maybe, that’s my bad. I didn’t explain well enough. The network creates a new node to memorize the new pattern, the recognition goes through old patterns (existing neurons).

Pruning is clearly necessary. Something like this approach works in my current program. But I don’t know if this is it.

About 15 years ago I saw a real-life Hummingbird. I saw it just for a few seconds. I never saw it after that case but I still remember the bird very brightly. Are you getting my point?

Now, let’s move to cats recognition.
Main differences from common visual recognition approach:

The receptive field is not the whole image, just small area (can be even 1 pixel, but that’s not interesting). Then bigger it would be, that faster would be recognition (but requires exponentially more performance). Further, I will call this receptive field on the image the scanning window.
The scanning window could be moved. And the network completely controls those moves. Thus, it chooses itself in which direction to look.

I only have figures with simple examples, but I guess you’ll get the idea.

The input looks like (3x3 scanning window):

That’s easy to convert into binary form (the network only has 9 receptors in this case!).

I have two possible ways to train the network. Let’s name those ways “supervised” and “unsupervised”.

“Supervised” way - to control the network and give feedback as much a possible (fastest way).
“Unsupervised” way - make some set of simple “reflexes”, which gives the network initial actions and periodic feedback (slow, but an automated way).

Thus, the network would be able to learn to move the scanning window and receive sequences of input patterns. And to do some output and receive a feedback.

But what’s going on inside the network, how could it recognize anything?

First, the network is always trying to generalize. To activate all neurons and connect them together. That’s the really annoying thing The network has to receive negative feedback to eliminate wrong generalizations. Fortunately, after a while, it starts to correct its own mistakes by self-evaluation.
By generalization, the network connects those features, which represents similar objects (and leads to positive feedback). The generalization goes like this:

There is a pixel missing on the second image. Maybe it is noise or maybe the key difference. The network doesn’t know. It just connects it

The neurons I1, I2, I3 - input dots from the second picture (I2 is not an actual input, it is “imaginary”). The neurons O1 and O2 - represent situations (patterns) from previous two pictures. They activated together, so they become connected through neuron H1. Thus activation of neuron O1 will cause the activation of neuron O2.
Conventional neural networks do the same thing actually. Only they use other methods to do that.

Now, comes the hard one. Hierarchy, high-level representations and sequences are the same thing in the model. That’s not obvious, but it works. Let’s see:

Neurons I1, I2, I3 - first timestep of the sequence; the neuron O1 - represents the second timestep; the neuron H1 - the third timestep of the sequence; etc.
Neurons I1, I2, I3 - input features (dots); the neuron O1 - represents high-level feature (line); the neuron H1 represents even “higher” feature (robust line, regardless of noise); etc.

Don’t forget, that those recognized neurons (O1, O2) become the context for next timestep. The context makes a huge influence on activation of the next patterns (that’s like prediction in the HTM). And new input patterns continue to come. So, in real system representations from different levels mix up.

Thus, the network:

Operates only input from a small scanning window (solves “the scaling problem”). Doesn’t matter how big the image is. It only takes more time to look it through.
Makes and uses generalization, high-level representations. We don’t need 2^24 combinations for recognition. By the way, it makes the network resistant to noise.
Uses sequences of patterns. So, for the model, there is no actual difference between a static image and a video. By the way, ability to move scanning window and use this moves for recognition makes the network resistant to deformation of the object.
And also solves the location signal problem. Every input pattern is perceived relative to previous patterns and made moves.

No need for specific location signal.

By the way, we can really improve results by using hierarchical learning. Or by using connections to words and sentences.
Something like this: If the network already trained to recognize “wheels”, we can just tell it with the text like “bike has 2 wheels”. So, the network would be able to recognize “bike” without training on images of “bikes”.
This is not an actual example, but I tested on similar tasks. Associative connections to words work as well.

On test recognition tasks usually was enough to make 3-5 moves to successfully recognize the object. But I used simple objects (MNIST, for example) and small scanning window.
I can make some demo video, if necessary.

Uff, I tried to be brief, but write a wall of text and still skipped some parts. I hope they weren’t critical for understanding. Please, ask if something important is missing or seems to be wrong.

Look forward to seeing your reply

ali_m · November 25, 2017, 4:13am

My goodness… you’re great

we will be so glad.

stepan · November 25, 2017, 10:48am

Hello, Ali.

I glad to see your interest. In your topic, you asked a very important question.
I think, no matter how “allocentric” the location is, there will be problems with generating a robust and universal signal.

That’s known, that human brain uses saccadic eye movements for recognition.

I propose that those movements are very important for recognition task and provide necessary “location information”.
I also want to point that those patterns of movements are not “hardcoded” in our brain. There was research (I’ll find links, if necessary) that we learn to do saccades.

Along with saccades (which we do not control consciously), important role plays “controlled” eye movements, head movements, our body and everything else, which change an input from the retina.

Now, about how it goes in my system.
Let’s look at those pictures:

First, the model is empty. Then it starts to make some reflex (or even random) movements of the scanning area. Some of those doesn’t help, but others lead to generalization and successful recognition.
For example, if the model sees an eye, it moves right (arrow A on figure A) until it encounters another eye or end of the face. But then the model would probably make a move to see if there is a nose. The move from the second eye goes in the context of the current sequence of visual patterns and made moves. Like this:

That makes the model resistant to changing a size of the image (fig. B1). But I want to note that the network still would be able to distinguish those images. That’s just a dynamic generalization.
Also, there are mechanisms that make the model resistant to very significant deformations of an image (like fig. B2, for example). I doubt I ever saw the face from this angle before, but I still able to recognize it.

Thus, the ability to take into account the relative position of elements is enough for robust recognition. No separate location signal.
Same goes for sensory recognition (especially, when we don’t see the object).

Maybe that’s not so obvious, but the same applies to time. @jhawkins has proposed that there are internal timing signals. But that’s all can be done through context and associative activations.
For example, think what’s the difference between last year and this year, yesterday and today, 2 hours ago and 3 hours ago. Without “narrowing” the model by any timing signal.

All critics and questions are welcome. I simplify the explanations, so can unintentionally skip some important parts.

ali_m · November 25, 2017, 6:49pm

Right… and if the output recognized object is associated with it’s location, that’s enough to form a hierarchy of objects. But by it’s own, that’s probably not enough to form a hierarchy of concepts, like horse [intersect] elephant [includes] four-feet, that’s why the location should be represented in such a way that it can be intersected with the pattern of the feature. So the location should also have some soft (fuzzy) format.
Anyway your dynamic neuron addition sounds like an impressive idea. We’re looking forward to see your work as soon as it’s published.

bela.berde · November 25, 2017, 7:16pm

Really sorry, but this image is difficult to understand.

thanh-binh.to · November 25, 2017, 9:43pm

@stepan thank for your work. What is the max image resolution that your system will work? Could you please make some videos visualizing the results of image recognition tasks like Mnist?

stepan · November 26, 2017, 4:33am

Thank you, for your reply, Ali.

Actually, the ability to form a hierarchy of concepts and to use indirect connections between concepts - is one of the strongest sides of my model.
The network just doesn’t have a fixed hierarchy, that’s all done by associative activations dynamically.

When we see a foot of an elephant, it invokes a lot of associatively connected concepts. And this context can activate horse’s foot.
I want to point that the scheme is very simplified and doesn’t include a scheme of activations (because it’s too complex to draw). But I can assure you that it’s all can work through activations of associatively connected neurons in the network.

About “four-feet”. To use that concept, first, we have to teach the network the concept of numbers. Then teach it to count abstract objects. And only then the network would be able to use “four-feet” concept.
That’s very interesting tasks. I did solve those tasks and tested the model on them. Those tasks require the ability to perform a number of cognitive processes, which currently unreachable for “conventional” machine learning methods.
Would you be interested to see how the network able to do some “human” cognitive processes?

By the way, I can make some examples that fixed hierarchical representation (even very soft and fuzzy) cannot be universal and would fail in some cases (in most cases, when it comes to abstractions).

stepan · November 26, 2017, 5:41am

Hello, @bela.berde.

Thank you for your question.
Yep, that’s really different from “conventional” network structure. But it’s really simple. What’s not simple, that’s to explain how such elementary structure able to implement various of complex tasks.

Let me change the scheme a little.

First timestep:

The network receives an input from the receptor R1. Neuron R1 become active.
Let’s say R1 is a neuron from scanning area (receptive field) size 1x1.

When the model sees a dot in this receptive area, the neuron R1 become active.

For this example, we already have neuron O1, which connected to neuron R1 and “acceptor” M1 (I don’t know if this is the right term. For the model is just a neuron, which makes the system do some action (make a move or type a letter)). Let’s say M1 makes the system move scanning area left.

R1 makes impulse to O1. O1 activates M1. So the system moves scanning area left.

And now we also have a context, which consists of neuron O1. It as an additional input on the next timestep (we analyze the next input in the context of previously activated representation - that’s how work sequences and hierarchy).

Second timestep:

Neuron O1 active from the previous timestep. Also, the system sees the dot in the receptive area, so neuron R1 become active too.

The activation of neuron R1 activates M1 through O1. So now we have activated “context” O1 from the previous timestep and newly recognized O1 (it will become, a context for the next timestep).

Also, we create a new neuron O2. It’s receptive field consists of currently active neurons (O1, R1, M1).

By using the negative feedback we can make the network move up (M2) instead of left (M1), when it already did move left (in the context of O1). I didn’t show inhibition on this scheme.

I hope, I clarified it for you, @bela.berde. Do you want to know something in particular?

This structure allows using any kind of sequences and very complex hierarchical representations, which includes features with different levels of abstraction. That makes the network really flexible and suitable for performing “general intelligence action”.

ccoo · November 26, 2017, 6:57am

hello,stepan, thank for your work. I’m a little confused about the process.here is my understanding and questions about the connecting process.

The neurons I1, I2, I3 - input dots from the second picture (I2 is not an actual input, it is “imaginary”). The neurons O1 and O2 - represent situations (patterns) from previous two pictures. They activated together, so they become connected through neuron H1. Thus activation of neuron O1 will cause the activation of neuron O2.

first time-step, l1,l2,l3 is active and they are connected through O1;
second time-step, l1 and l3 are active.but is O1 active too? if O1 is active, why don’t connect it to O2;if O1 is not active, then in which time-step O1 and O2 will active together and connect through H1?

stepan · November 26, 2017, 9:15am

Hello, @ccoo.

Thank you for your interest.
I just combined all time-steps into one picture. But apparently, I missed some important explanation.

Pictures:

Representations for those pictures respectively:

First time-step:
The input is this picture:

Thus, the input is neurons I1 and I3.

Neuron I2 activates through O1, so it’s like system sees (imagines) this picture:

That’s generalization.
But we can train the network to distinguish those pictures, if necessary.

By the second time-step, the network has 2 recognized representation (neurons): O1 and O2. Those neurons will make the context of the next time-step.

Second time-step:
In this example, no actual input (from receptors), only representations from the previous time-step.

On the third time-step, only H1 would be active.

Connections of the neuron can’t be changed after it’s created. But we can create new neuron to “overrule” previous. So we don’t spoil the memory and be able to remember old things in right context.
Did I answer your question?

Topic		Replies	Views
“Prediction” from the first principles Lounge	75	4292	August 13, 2018
[1609.03971] Feynman Machine: The Universal Dynamical Systems Computer Tangential Theories	20	4372	July 27, 2017
Simple Cortex Tangential Theories	38	3929	October 8, 2017
Trying to understand how to advance HTM Tangential Theories	19	1795	September 24, 2018
Temporal pooling and generalization General Neuroscience	29	3257	December 3, 2017

A different point of view on building AI system

Related topics