Overcoming catastrophic forgetting


I have been thinking about how to avoid catastrophic forgetting while evolving neural nets and I decided on highly evolving one partitioned subset of the weights at a time. That’s based on an assumption that current deep nets have far more weights than strictly needed to allow gradient descent find its way down local minima free search paths. With fewer weights I’m kinda making the assumption that lots of harmful local minima would appear.
Then this information appeared:
I can try both. I also got my deleted email account back by some kind of Lazarus effect.


The recent papers about saddle points in deep neural nets are very revealing in a number of ways. Another thing they indicate is how very overcomplete current deep neural nets are. There are massive numbers of high quality solutions to the same training set to be found in the weight space of the net. My idea at the moment is if you pick even 100 weight variables in a net with say 1 million variables and really pound those 100 with an evolutionary optimization algorithm the net should be able to make good progress. Such a small number out of 1 million makes the next choice of 100 almost orthogonal (non-interfering) with the last choice. Interesting, I’ll try it.


I’m running the code now. Even with limited hardware it is running fast enough that I should be able to tell if the concept is working. I find the evolutionary algorithms do their best and most creative work in about 100 dimensions so I’m trying that first.
There does seem to at least 3 parts to intelligence:
1/ Instinct. Knowledge learnt over many generations and hardwired in. Which is comparable to current deep neural networks.
2/ Manifold learning. This can be a fairly passive form of data absorption from the natural world.
Where you learn what things go together and what things follow one another. Equivalent to what Numenta are doing or Hopfield networks.
3/ Dynamic learning and planning. Where you use the knowledge in the learnt manifold to plan actions.
The third one I don’t have much of a clue how to do. Oh, well.


Temporal pooling, surely.


Okay, I’ll research temporal pooling a bit.
I should be able to tell if the neural net evolution code is working out by tomorrow. If so I put the code in the public domain and likely do a Java version.
What parts of AI are actually working out well at the moment?

Evolutionary algorithms have been working well on engineering problems for a number of years.

Deep neural networks clearly are working. What is less clear is how best way to train them. There is a ton of engineering work you could do to find far better ways than gradient descent. I like the idea of getting creative in say 100 dimensions at a time, relying on the overcompleteness of the net to allow more freedom of choice. That way little clusters of intelligence form inside a larger net.

What I call manifold learning just driven by the input data maybe not too many people are looking at, but I think it could be very useful for an agent trying to plan its future actions.
You could imagine a deep neural net being allowed to access a manifold memory system rather than having to try to design its own memory storage/access behavior. That is a much simpler problem.

Reinforcement learning, temporal pooling I haven’t coded so I don’t know.


Back to account switching, but that is not my fault (malware). Anyway adapting the neural net training to what is best for the evolution strategies (ES) algorithm looks like it is doing a good job.

I guess you have a choice - something that is maladaptive for net but good for the optimizer, or something that good for the net but maladpative for the optimizer.
There would seem to be good reasons why you should keep the optimizer happy first and foremost, you end up with a net full of little reusable units of intelligence. Trying to optimize all the weights together in one gigantic jumble maybe is not so good.
Anyway these are engineering problems that will be sorted out over the next few years to the satisfaction of all.


That concept looks like it works:
I used the function x>=0 y=sqr(x), x<0 y=-sqr(-x) as a nonlinear activation function.
It is actually a continuous version of x>=0 y=1, x<0 y=-1 in the sense that the square root function has 1 as a fixed point under iteration.
I will user a much faster approximation if I decide to keep that activation function:
If you ran the code on an nVidia jetpack board (or better setup) you could definitely see the evolution process happening in real time. On a 2 core Intel CPU it still takes 2 or 3 hours to see some progress. On a jetpack that would reduce to 2 or 3 minutes.


After 10 hours on a cheap laptop:
Using a GPU it would take about 10 minutes to get to the same stage.