Why not apply SGD after every data sample?

My first thread on the #engineering:machine-learning forum! So it is fitting that this is probably a complete n00b question. :blush:

Why not just apply Stochastic Gradient Descent (SGD) after each data sample instead of batching them into epochs?

This would give the data a temporal context, and allow us to more easily code in Temporal Memory logic into hidden layers.

I assume the answer is because it is too computationally expensive? But that is not obvious to me.

5 Likes

See section Mini-batch gradient descent of https://cs231n.github.io/optimization-1/

Also note that epochs and batches are two different things. In each epoch we train the model on the entire training dataset, splitting it into some number of batches.

3 Likes

Thanks, and another article Iā€™m reading on this to inform the discussion (thanks @subutai) called A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size.

If you are talking about neural networks and training them on temporally-coherent data. No, this is a super, extra super bad idea. Unlike HTM, NNs are pron to catastrophic forgetting. Where after a few weight updates that doesnā€™t take account of the original function distribution, the NN completely forgets how to perform those functions. (they overfit to your small sample of data)

This is also why we shuffle the dataset before training.

5 Likes

This was obviously an uninformed question, but I appreciate the responses. The material was very helpful. They were also helpful on Twitter:

I think I understand the problem with SGD. The ā€œstochasticā€ part of SGD basically disallows any type of fine-grained temporal context to get into the system. Applying GD in batches gets you part of the way there, but the mean in the cost function is necessary to remove noisy gradients, which would prevent the GD algorithm from finding a decent descent :wink: .

3 Likes

ā€¦ unless of course you are feeding time in as one of the axis parameters to be part of the manifold formed.

Not the sequence of presentation but as part of the data set.

2 Likes

Do you know of any experiments where this was useful?

Some of what we work with is time sequences of response to a tire passing over the scale.
There are many factors involved in both the wheel/chassis system of the truck and the various spring-rate and masses of the scale assembly.

I have experimented with this and so far - have not achieved enlightenment.

The hope is that I will find an application of a neural network at work, in this case, the filter section. Alas - not yet.

2 Likes

Itā€™s also related to how GPU works.

Both the feedforward and the feedback happens in parallel per minibatch.
Thus, it HAS access to all of the information of the previous ā€œtime stepā€(i.e. the actviations cause by the previous minibatch input).
You can make use of the information if you want.

Individual backpropagations(gradient descent through the layers) arenā€™t computationally expensive either.
They have about the same computational complexity as the feedforward stage has.
The formula between the feedforward and the feedback stage are quite similar, too.
Thatā€™s where it got its name, ā€œbackpropagationā€.
Training of deep learning models is costly because of the number of iterations it demands to work well.

What youā€™re suggesting(minibatch size of 1) might not be parallel enough to computed efficiently by GPU.

Iā€™m currently implementing Deep HTM for GPU and indeed it has minibatch size of dozens.
Iā€™m not there yet, but I think it wouldnā€™t have any particular problem because of the fact that it does.
If you think it might cause a problem of some kind, please let me know how.

1 Like

It would be extremely dependent on the type of data being processed and the semantic characteristics of the encoding.

1 Like

I think that could not be the case because of the encoder layers that process the input and feed it to the SP layer.
The input doesnā€™t get fed directly. Thus, it gets less affected by the type of the input.
But there will be some kind of effect for sure.
Iā€™ll keep that in mind. Thanks! :slight_smile:

@rhyolight, just to clarify SGD will work just fine with batch size of 1, but you will need to do a lot more weight updates. I trained small convnets like that, and saw no accuracy drop.

Ilya Sutskever seems to agree:

It would actually be nice to use minibatches of size 1, and they would probably result in improved performance and lower overfitting; but the benefit of doing so is outweighed the massive computational gains provided by minibatches.

In case you donā€™t know who Ilya Sutskever is: Ilya Sutskever - Wikipedia

1 Like

Oh, and for temporal context you would typically use attention mechanisms. Itā€™s strange no one has mentioned that.