Why not apply SGD after every data sample?

rhyolight · August 13, 2019, 3:51pm

My first thread on the #engineering:machine-learning forum! So it is fitting that this is probably a complete n00b question.

Why not just apply Stochastic Gradient Descent (SGD) after each data sample instead of batching them into epochs?

This would give the data a temporal context, and allow us to more easily code in Temporal Memory logic into hidden layers.

I assume the answer is because it is too computationally expensive? But that is not obvious to me.

mrcslws · August 13, 2019, 4:29pm

See section Mini-batch gradient descent of https://cs231n.github.io/optimization-1/

Also note that epochs and batches are two different things. In each epoch we train the model on the entire training dataset, splitting it into some number of batches.

rhyolight · August 13, 2019, 4:32pm

Thanks, and another article I’m reading on this to inform the discussion (thanks @subutai) called A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size.

marty1885 · August 13, 2019, 4:34pm

If you are talking about neural networks and training them on temporally-coherent data. No, this is a super, extra super bad idea. Unlike HTM, NNs are pron to catastrophic forgetting. Where after a few weight updates that doesn’t take account of the original function distribution, the NN completely forgets how to perform those functions. (they overfit to your small sample of data)

This is also why we shuffle the dataset before training.

rhyolight · August 13, 2019, 5:07pm

This was obviously an uninformed question, but I appreciate the responses. The material was very helpful. They were also helpful on Twitter:

I think I understand the problem with SGD. The “stochastic” part of SGD basically disallows any type of fine-grained temporal context to get into the system. Applying GD in batches gets you part of the way there, but the mean in the cost function is necessary to remove noisy gradients, which would prevent the GD algorithm from finding a decent descent .

Bitking · August 13, 2019, 5:25pm

… unless of course you are feeding time in as one of the axis parameters to be part of the manifold formed.

Not the sequence of presentation but as part of the data set.

rhyolight · August 13, 2019, 5:29pm

Do you know of any experiments where this was useful?

Bitking · August 13, 2019, 5:33pm

Some of what we work with is time sequences of response to a tire passing over the scale.
There are many factors involved in both the wheel/chassis system of the truck and the various spring-rate and masses of the scale assembly.

I have experimented with this and so far - have not achieved enlightenment.

The hope is that I will find an application of a neural network at work, in this case, the filter section. Alas - not yet.

hsgo · August 14, 2019, 12:48pm

It’s also related to how GPU works.

Both the feedforward and the feedback happens in parallel per minibatch.
Thus, it HAS access to all of the information of the previous “time step”(i.e. the actviations cause by the previous minibatch input).
You can make use of the information if you want.

Individual backpropagations(gradient descent through the layers) aren’t computationally expensive either.
They have about the same computational complexity as the feedforward stage has.
The formula between the feedforward and the feedback stage are quite similar, too.
That’s where it got its name, “backpropagation”.
Training of deep learning models is costly because of the number of iterations it demands to work well.

What you’re suggesting(minibatch size of 1) might not be parallel enough to computed efficiently by GPU.

I’m currently implementing Deep HTM for GPU and indeed it has minibatch size of dozens.
I’m not there yet, but I think it wouldn’t have any particular problem because of the fact that it does.
If you think it might cause a problem of some kind, please let me know how.

rhyolight · August 14, 2019, 2:16pm

It would be extremely dependent on the type of data being processed and the semantic characteristics of the encoding.

hsgo · August 14, 2019, 2:27pm

I think that could not be the case because of the encoder layers that process the input and feed it to the SP layer.
The input doesn’t get fed directly. Thus, it gets less affected by the type of the input.
But there will be some kind of effect for sure.
I’ll keep that in mind. Thanks!

michaelklachko · August 16, 2019, 7:58pm

@rhyolight, just to clarify SGD will work just fine with batch size of 1, but you will need to do a lot more weight updates. I trained small convnets like that, and saw no accuracy drop.

Ilya Sutskever seems to agree:

It would actually be nice to use minibatches of size 1, and they would probably result in improved performance and lower overfitting; but the benefit of doing so is outweighed the massive computational gains provided by minibatches.

In case you don’t know who Ilya Sutskever is: Ilya Sutskever - Wikipedia

michaelklachko · August 16, 2019, 8:03pm

Oh, and for temporal context you would typically use attention mechanisms. It’s strange no one has mentioned that.

Topic		Replies	Views
Learning in the HTM algorithm NuPIC	14	1584	March 31, 2022
TemporalMemory for prediction Engineering question	35	1765	September 24, 2019
The grokking challenge? Machine Learning	36	1618	June 3, 2023
Application of HTM in today’s ML frameworks Machine Learning	50	3391	March 28, 2019
Test and training in HTM NuPIC Community Fork	4	653	February 3, 2020

Why not apply SGD after every data sample?

Related topics