Alternating Loss Functions for Evolution Algorithms

One problem with evolution algorithms is they get trapped, maybe in some local minimum or at a difficult saddle point. One way out that avoids local restarts is to alternate between different loss (error measurement) functions during evolution.

For a neural network etc the L2 loss function is the sum of the squares of the differences to the target , the L1 loss function the sum of the absolutes of the differences.

If the system is not trapped then a reduction in one loss function usually results in a reduction in the other. However as the system gets trapped a decrease in one results in an increase in the other. The other has actually stepped uphill. When you alternate it may be high enough to allow an escape from the local minimum, at least it has a better chance. At least you can say the system is being pulled from one configuration to another via random walks and there are certainly more chances of finding a way down that way than being permanently stuck in one place.
I tried it and it works quite well. I don’t want to put a web-page up showing it at the moment due to some hacking type activity going on at the moment.
Of course you can use more loss functions.
For biological evolution the loss function is never that consistent which may help speed it along despite having less than idea material to work with.

Anyway it is something that is obvious when you have been told it. If you ever want to cite it in a paper you can say “it is obvious that…”


May you not then apply the same idea when training conventional deep neural networks??? Alternating the loss function from time to time!!!


I’m surprised this has not been tried yet?


Interesting… I’m sure someone must have tried it.

ahh… yeah. After Googling, people have tried switching loss functions from time to time. But it mostly doesn’t matter. As neural networks are mostly convex optimizing problems. Switching loss functions simply switches your landscape of errors. Which might help in training by avoiding local minima. But batch-norm, dropout, momentum, etc… can also avoid stucking at a local minima. And most times that’s enough. (Adam and Adamax is especially good at this.)

I’ll try it anyway.


I guess for neural networks batch training has a similar effect. I suppose it is worth seeing if there is any positive effect on any of the metrics.
For evolutionary algorithms it seems more interesting.

What is it that allows batch training without catastrophic forgetting between batches? I guess it is down to heavy over-parameterization causing repetition code error correction, that then is somewhat noise resistant to changes the next batch makes.
You can’t really use batching with the sort of lean neural networks I try to evolve because you definitely get catastrophic forgetting between batches. That’s where the amount of compute available likely matters, or I could try with more parameters.

After messing with switching loss functions. It doesn’t work well.
Depending on the task and the switching functions. The results may vary from it doesn’t improve to it totally doesn’t work. It seems like either one loss function is good enough or they are working against each other.