Also note that epochs and batches are two different things. In each epoch we train the model on the entire training dataset, splitting it into some number of batches.
If you are talking about neural networks and training them on temporally-coherent data. No, this is a super, extra super bad idea. Unlike HTM, NNs are pron to catastrophic forgetting. Where after a few weight updates that doesnāt take account of the original function distribution, the NN completely forgets how to perform those functions. (they overfit to your small sample of data)
This is also why we shuffle the dataset before training.
This was obviously an uninformed question, but I appreciate the responses. The material was very helpful. They were also helpful on Twitter:
I think I understand the problem with SGD. The āstochasticā part of SGD basically disallows any type of fine-grained temporal context to get into the system. Applying GD in batches gets you part of the way there, but the mean in the cost function is necessary to remove noisy gradients, which would prevent the GD algorithm from finding a decentdescent .
Some of what we work with is time sequences of response to a tire passing over the scale.
There are many factors involved in both the wheel/chassis system of the truck and the various spring-rate and masses of the scale assembly.
I have experimented with this and so far - have not achieved enlightenment.
The hope is that I will find an application of a neural network at work, in this case, the filter section. Alas - not yet.
Both the feedforward and the feedback happens in parallel per minibatch.
Thus, it HAS access to all of the information of the previous ātime stepā(i.e. the actviations cause by the previous minibatch input).
You can make use of the information if you want.
Individual backpropagations(gradient descent through the layers) arenāt computationally expensive either.
They have about the same computational complexity as the feedforward stage has.
The formula between the feedforward and the feedback stage are quite similar, too.
Thatās where it got its name, ābackpropagationā.
Training of deep learning models is costly because of the number of iterations it demands to work well.
What youāre suggesting(minibatch size of 1) might not be parallel enough to computed efficiently by GPU.
Iām currently implementing Deep HTM for GPU and indeed it has minibatch size of dozens.
Iām not there yet, but I think it wouldnāt have any particular problem because of the fact that it does.
If you think it might cause a problem of some kind, please let me know how.
I think that could not be the case because of the encoder layers that process the input and feed it to the SP layer.
The input doesnāt get fed directly. Thus, it gets less affected by the type of the input.
But there will be some kind of effect for sure.
Iāll keep that in mind. Thanks!
@rhyolight, just to clarify SGD will work just fine with batch size of 1, but you will need to do a lot more weight updates. I trained small convnets like that, and saw no accuracy drop.
It would actually be nice to use minibatches of size 1, and they would probably result in improved performance and lower overfitting; but the benefit of doing so is outweighed the massive computational gains provided by minibatches.