Artificial Life concept


The concept is to have a (randomly initialized) single layer network doing a dimensional reduction (DR) to a much smaller evolved controller network. The controller network decides when and what the DR net learns. Also the controller net will be connected to an information reservoir (a random projection feeding back into itself) to act as short term memory and to allow timing/sequencing.
I can evolve deep networks on my laptop if the number of dimensions isn’t too big, less than 2^12 (=4096). The other things I also have.


I guess rather than having a separate gating function to control when data is entered into either of the 2 memory types mentioned you can do something along the lines of 1970’s tri-state logic chips. That is have a truncation function that will lob off a certain amount of the magnitude of values intended to be stored in memory. Obviously if you lob off more than there is the result is zero. Zero being a no store indicator. Any remaining values (of magnitude greater than zero) are to be entered into the memory. That should allow good control of the statistics of memory update while reducing system complexity.


Incoherent explanation.


And I still haven’t implemented any version of it yet. You can look at this paper regarding very similar random projection based neural networks in the meanwhile:

I have to provide an environment for the Alife to exist in and decide an exact concise architecture.
Rome wasn’t built in a day.


I was having a problem evolving deep networks where the information loss caused by the dot product weight and sum operations was greater than the computational gains per layer.
By allowing more ways for information to pass through each layer I am seeing markedly improved results. Maybe I can get on with the Alife thing soon.
Oddly enough the work Incoherent is apt, since extensive use is made of random orthogonal projections. Anyway:


You’ll learn to appreciate him :slight_smile:


The reason for the delay is that I was investigating evolved deep neural networks for the concept but I wasn’t seeing the improvements over shallow nets that I had hoped for. The reason is likely limited hardware. I’m looking at category orthogonal associative memory as an alternative.
The idea is to make a very large number of weak associations using a large training set (and low learning rate) and then layer by layer correct and refine those associations down to specific categories. The hope is that approach will lead to smart decision boundaries. Anyway the idea is ahead of any mathematics to support it. Ie. these are empirical experiments.
This code is illustrative only:


This paper on Linear Maps with Point Rules has given me some good clues about what I was doing wrong with deep neural networks:
There is a lot of maths in that document but you can just scan through it and pick out things that are useful to you.
It’s very unusual for a PhD student just to disappear off the map after completing his or her studies. Where’s Venkatesh gone?

The Venkatesh thesis would seem to be quite significant and provide many
improvements you could make to current deep neural networks to get
better rotation and translation invariance etc. I am already getting much better results from evolved deep neural networks using the ideas.

Edit 2: Presumably the same guy:


Using the hints from the thesis paper above I have managed to arrange for quite rapid evolution of deep random projection neural networks using the product of 2 linear random projections as the activation function. That gives far better and quicker results than rectifier linear units or similar.


Off topic there is quite a good science magazine here:
you can expect the output of a random projection to have a Gaussian
distribution for most input sets. The product of 2 of those is no
longer Gaussian, but rather some sort of spiky distribution. That would
encourage the formation of sparse, distinct, spiky vector response
patterns in a deep neural network. I found by experiment that bringing
that spiky pattern back to a more evenly distributed Gaussian with a
further random projection helped. Presumably by making efficient use of
all the weights.
Maybe it all sounds very nebulous. I think there is progress though. There’s the associative memory aspect done, there’s an evolvable deep network that hints at smart behavior. I need to link the 2 together with some truncation function for gating. Then deal with the real difficulty of providing a digital environment for it.


Sometimes google just logs me in whatever way it wants (ie Fraz_J.)
Micro-concept: If you multiply a point-wise sparse vector by some weight matrix only a few of the weights are involved. If you multiply a pattern-wise sparse vector then all weights will have an effect. You can go from a point-wise sparse vector to a pattern-wise sparse vector using a random projection without loss of information (because you can use an invertible RP.)

Going the other way isn’t really possible, from some arbitrary pattern-wise sparse vector to a point-wise sparse vector without prior knowledge of the patterns (or having to learn those patterns) - due to entropy???


Java code for product of 2 linear random projections neural network:


I guess the simplest “digital environment” for an alife is just reading text and seeing how well it can predict. The alternative is some sort of game like system which would take a lot of programming effort.


I have written code for an alife. I still have to create an environment for it.
I decided to use random projections for (noisy) multiplexing and demultiplexing access to external short term and long term memory from the controller network. That simplifies the system considerable.
One aspect that is causing some cognitive dissonance is the experts insist than non-linearity is necessary for worthwhile deep networks. My practical experiments show me that linear all the way is quite fine. All I can say is that that higher dimensional dot products are not really linear in their response. Maybe you can make an analogy and say each layer in a linear net is like a collection of convex and concave lenses, stretching and shrinking different aspects of the input.
I’ll watch this video and see if it can explain things:


I suppose then that a deep neural network is a partition forest. Where each layer is funneling information to some among many branches going upward (forward) and such a process is being predominately driven by the dot product operation. The nonlinear activation functions only being there to break symmetries.
It might be unhelpful for a symmetry such as y=net(x), -y=net(-x) to exist. Such a net couldn’t learn y=net(x), z=net(-x), y not equal z.

How is a dot product nonlinear? If you take the dot product of some random vectors with a fixed vector the result will have a low magnitude in higher dimensions due to cancellation effects. Only a small subset of possible vectors will result in any significant output (a selective filter.)

Then you have a (fuzzy) decision forest whose branches are determined by dot products. Depending on the net topology after each branching all the results can be merged before another branching process. It is maybe a little more dense in some respects than an ordinary decision forest.


The perspective I am forming then is that a deep neural network is a branch and merge, branch and merge soft/fuzzy decision forest. There are two main factors you can then control. You can increase the selectivity of the dot product filters by (say) taking the square or reduce the selectivity by using a squashing function (ie. different choices of activation function.) Secondly you can exactly the control the way you merge the results. For example the early stages of a convolutional network put great effort into controlling the merging process.


One thing I am doing differently is evolving the deep linear networks rather than using backpropagation (BP.) And that’s linear as in linear algebra.
If you follow Tishby’s argument
you can see that by using BP you are learning the leaves and higher branches of the decision forest first. Which is kinda the wrong way around. That explains why randomly skipping layers during training with BP can be helpful. With evolution you are learning all the layers in concert.
I suppose if the network is plastic enough it may not matter if you use BP or evolution.
Anyway I am starting to be able evolve 10 layer deep linear networks of dimension 4096 on a cheap laptop with a 2 core CPU. That’s a good sign that if you moved up to GPU based hardware things might get interesting.


It’s hardly worth saying but a deep neural network is then a list of
if patternA1 then pointA1
if patternA2 then pointA2

Then the points are lined up and it’s
if patternB1 then pointB1
if patternB2 then pointB2


In logic terms you can figure out that there are both (fuzzy) AND and OR operations, what’s missing is a NOT operator.


You could describe such a system as pattern based fuzzy logic.
I think you don’t need a NOT operator because sparsity can stand in its place? I’m not super sure about that.
Anyway here is the best reference I could find on the dot product in higher dimensional space:
( Number of almost-orthogonal vectors section)

Maybe you could organize a convolutional deep neural network along the lines of the FFT algorithm using such notions. I’ll try to code such a thing.


I found this modern paper about fuzzy logic:
So useful.
I also did a small blog post about the matter in general:


I was kind of diverted time wise figuring out at least some aspects of how deep neural networks work. Anyway this is a candidate Alife solution I just got to the alpha 0.0 stage.
I still have to find some simple problems to try it on. If it shows any signs of doing anything interesting I will do fuller testing and validation. The comment on the github page is a bit disjointed due to copy and paste time saving.