STREAMER: Streaming Representation Learning and Event Segmentation in a Hierarchical Manner A

My bet this is the next big thing, or at least the architecture/ideas/principles from this work will inspire it.

Thanks Numenta for sharing author’s presentation on your channel


Which one do you think is important in a model?

  1. A predictor mechanism
  2. Creating a suitable environment for making a good connection based on the higher intensity of the input information.

And how to know which part of the input information Is necessary to be encoded?

1 Like

All three above are important.

The last one is quite, new, and they mention

In our approach, each layer is trained with two primary objectives: making accurate predictions into the future and providing necessary information to other levels for achieving the same objective. The event hierarchy is constructed by detecting prediction error peaks at different levels, where a detected boundary triggers a bottom-up information flow. At an event boundary, the encoded representation of inputs at one layer becomes the input to a higher-level layer.

So in other words, each layer not only attempts to do its best in predicting what immediately follows, but also to know when their prediction will be inevitably poor(-er) and those sudden dips into predictability are used as boundaries between chunks that are packed & encoded to be passed to higher levels.

To understand why this idea could be a breakthrough, let’s assume we have a toy language model reading a text and attempting to predict next character or letter. Since the model is relatively small and narrow sighted, and since there-s a limited amount of words in any dictionary, after “diction” the following “a”, “r”, “y” will have higher and higher likelihood. That will happen with every word in the sequence - latter letters are more easy to predict, but after the word itself ends then it is much harder to guess the first letter of the following word.

That’s what triggers a delimiter and prompts the toy model to pack/encode the last letters since the previous delimiter into a new “token” and pass it to the higher level.

This way the toy model becomes a tokenizer/embedder for the layer above, that sees mostly words, at a different time scale.

It segments raw stream of data into “things”. And now we also have a pretty general definition for what makes a “thing”: It is an island of predictability. A region/patch in the data that when seen only parts of it is easy to predict the other parts. And what isn’t a thing too: patches that when partially covered, the model cannot guess how the hidden parts look.


Yes… Understood… looks interesting as model takes new information when error to correct the error, similar to the brain.


In the context of an LLM (if at all possible), it’s not clear from the video if each token in each island (e.g. a sentence is an island of words or a paragraph is an island of sentences) could attend to any other tokens in the same level with some form of multiheaded attention modules? Or can each token only attend to tokens in the same island? I think in LLMs each input token needs to have a direct (or indirect) way to attend to every other input tokens.


This is the kind of research I have been looking for. My question is why 3 layers? Was that for simplicity to prove the model has merit and they will scale? Or is there a technical reason in the design/ algorithm that causes this as a requirement? Why not six? Perhaps that is to come.


@Jeremiah_Barkman I guess yes, that’s the reason. And for the use cases they show, it seems to work.

@roboto I’m speculating a bit here from what I understood:

The concept can be implemented with whatever basic sequence predictors one prefers, LSTM, transformer, whatever. It should be able to use HTM’s TM too by using its anomaly detection events as delimiter signal.

What is important to clarify is that each level runs its own predictor (its own RNN, LSTM, TM or transformer).
How the predictor within a given level represents its own context (aka past “tokens”) is specific to its implementation - e.g. RNNs have a hidden state, transformers have a window of N past tokens. In the video they mentioned using transformers with a window size of 50 tokens.
What is also important to notice is

  • the first, lowest level transformer will have in its window the past 50 characters and generate “words” by packing every few 2-10 characters,
  • the second will see the past 50 “words” received from level1 and group them in “sentences” by packing words seen from last one ,
  • the one above it (level 3) will see the past 50 “sentences” it received.

I put “words” and “sentences” in quotation marks because these might not match our comprehension of what word or sentence mean, it is just how the each level decides to cluster or segment its input by itself, and pass each block as a single embedding.

And there is also a lateral “exit” predictor, which is fed in predictions from this 3-level stack to make a better prediction for the next character.
The exit predictor might be fed in its window:

  • from level 3 the past 20 sentences + predicted next one,
  • from level 2 past 20 words + predicted next word
  • from level 1 past 20 characters + predicted next character

And make its own assumption about what the real next character should be based on this hybrid, 63 token context.

At least that’s what I understand from it.

1 Like

So instead of a single large predictor with huge window of characters, there are several predictors each with its own perspective about the past, feeding each other.

A somewhat similar idea of stacking transformers is in Meta’s Megabyte yet that seems to differ by:

  • having only two levels, a small “tokenizer” transformer and a large one that is fed in tokens from the small one,
  • it doesn’t seem to apply any surprise-driven logic to segment input of a level, it just splits it in fixed sized chunks, e.g. every 8 characters the lowest transformer compresses them in a token
1 Like

OK I understand what you mean. If the predictor uses a transformer and if each of the last 50 tokens can attend to each other (doesn’t matter if the token being attended to belongs to the same segment/island or not) then I’m satisfied even if the context window length is only 50. The 50-long context window might need the aid of retrieval augmented generation. If that’s the case in what ways can this research improve current LLMs?

Aside from the matter of having to define a fixed number of levels like someone else mentioned, at higher levels, segments/islands may not be made up of contiguous(?) tokens (not made up of sequential/adjacent tokens). Also it might be the case that each token could belong to one or more segment/islands at higher level… maybe

1 Like


My thinking is that a prediction mechanism can be used to detect co-occurring inputs (forming a pattern) based on the frequency of occurrence of these two - or more - inputs relative to other patterns in the data. So if the pattern (“A”, “B”) occurs in the data with higher frequency than the pattern (“A”, “C”), then a prediction model given “A” will have lower prediction error with “B” than with “C”. Therefore, recognizing patterns in the data at multiple levels of abstraction. These patterns don’t have to be just temporal, spatial patterns help us recognize objects, forming a part-whole hierarchy.

The prediction process itself forms the representations (through temporal encoding) and decides on which part of the input is necessary to be encoded, not only for predicting the future input at the same level, but also for predicting higher level and lower level future inputs that cover different levels of event detail and context.



At level (l), each token can only attend to other inputs in the same “predictability island” during temporal encoding. Attending to previous islands is done indirectly through the top-down and bottom-up inference connection during hierarchical prediction. For example, a word can predict the next word by attending to a summarized representation of the previous sentence, at level (l+1) and the previous paragraph, at level (l+2), as well as the most recent character from level (l-1). These levels provide information that help with prediction at each level, without having to encode the full context explicitly as currently done in LLMs.

Current LLMs struggle to encode large context because they don’t summarize inputs as they process the tokens. By summarizing and using these summarized representations (top-down connections) for future predictions, we can reduce the memory requirements of such models. The main premise here is that summarizing should be based on predictability/redundancy; the information being summarized at level (l) must be redundant to level (l+1). We should not use a fixed kernel size to summarize.


Thanks @Jeremiah_Barkman

There is no technical reason that limits stacking up more layers, except for availability of data with long videos. STREAMER is trained initially with one layer then as training progresses we stack more layers based on the data availability. There needs to be enough data to train the higher layers because they receive sparser inputs, and therefore require more time to be trained.

@cezar_t Thanks for starting the discussion :slight_smile:

Yes, any recurrent or transformer architecture can summarize the inputs. The ablations show that there isn’t a big difference between the performance of a transformer, LSTM and a GRU, simply because the context to be summarized is short and is being reset at every prediction peak.

However, the number of tokens to be summarized depends entirely on the data and the prediction capability of the layer, not a fixed N value. The sequence is reset at event boundaries, causing the input sequence to the transformer to start over. The transformer only summarizes what it can predict and is considered redundant to the layer above. The value N refers to the last N prediction error values, which is used for calculating the threshold using a running window average, but does not affect the sequence length given to the temporal encoding function.


Creating a suitable environment for making connection(only if there is higher intensity part in the input information) which kind of creates plasticity. Does not it invariably leads to birth of prediction mechanism in the model.


Hy @Ramy_Mounir , thanks for joining here.
I might bother you with a few more questions, starting with:

Assuming first (lowest) layer has a history size of 50. Since the last message to the upper layer it received 10 new tokens (actually embeddings from the CNN) and now it decides to trigger a new message to its upper layer in the hierarchy.

  1. How does it compress these newest 10 vectors in a single one?
  2. By resetting its context you mean it clears its own history context (of up to 50 embeddings) after every message to the higher level? I thought that keeping it, by simply rolling over the past 50 tokens should help it improve its own predictions.


Interesting! I’m unfamiliar with the idea of using input intensity to drive connections. Can you elaborate and perhaps give a toy example of what higher intensity part is in Text or Vision? and the intuition behind how modeling the intensity leads to predictive behavior?

@cezar_t Sure, happy to.

Here is a toy example, lets say we have the characters (“b”, “i”, “g”, " ", “d”, “o”, “g”), the temporal encoding transformer is always applied on the set of inputs. So starting with ‘b’ and trying to predict ‘i’; only ‘b’ goes through the transformer and we create a representation of it (i.e., z_{t}), then this representation tries to predict ‘i’. If it can, ‘i’ is added to the set of inputs (i.e., {‘b’, ‘i’}). If at any point it cant predict the next input, we send the current representation (i.e., z_{\text{latest}}) to the layer above and reset the set of inputs to that input we could not predict. When the set of inputs {“b”, “i”, “g”, " "} tries to predict “d” and cant, the current representation we have for {“b”, “i”, “g”, " "} is sent up and the set of inputs become {“d”}. With more layers added, the prediction process attends to other levels’ representations (words, sentences, etc.) to predict better with context.

  1. How does it compress these newest 10 vectors in a single one?

The premise here is that if a z representation can predict the next input, then this z is a good representation of the current set of inputs + this new input. If it cant, then it is only a good representation of the current set of inputs. Learning to predict drives the representation quality to become better by extracting the features that are helpful for the prediction.

  1. By resetting its context you mean it clears its own history context (of up to 50 embeddings) after every message to the higher level? I thought that keeping it, by simply rolling over the past 50 tokens should help it improve its own predictions.

The set of inputs is not the same as the history of 50 prediction errors. This history of prediction error only helps to figure out when to trigger an event boundary and reset the actual set of input embeddings. It is best not to think too much about the history of 50 errors, as it can be replaced by a fixed threshold and still work to some extent. The set of inputs is the actual context that is summarized, but the history of 50 error values do not go through the transformer.

Using a fixed context size for the encoder is exactly what i’m trying to avoid in this architecture. It’s much easier to summarize some observations if these observations are part of one coherent event (or one object), with all its observations being predictable and frequently co-occur. For example, it would be easy to summarize the representation of a cup because all of the features on the cup are predictable given other features on the cup. But if I give you a window of half a cup and half background, it would be very hard to come up with a consistent representation that would be redundant to the higher level, because the background keeps changing relative to the cup, and therefore does not form one object.

Maybe this would help:
x_t^{(l)} is the observation at time t and level (l)
\hat{x}_t^{(l)} is the prediction of the obseration at time t at level (l)
\cal{X}_t^{(l)} is the set of inputs at level (l)
z_{t}^{(l)} is the representation of set \cal{X}_t^{(l)} after going through the transformer f^{(l)}
p^{(l)} is the hierarchical prediction function that collects the latest representations from all layers and predicts the next input \hat{x}_t^{(l)}


Thanks, it seems I didn’t got it right, all I explained above is quite… approximative :sunglasses: .
I’ll have to reread the paper.

A few more general questions:

  • I see some of the data sets have also sound, subsets of them recorded inertial sensor data or even eye movement.
    Have you considered encoding extra info and somehow include it in the low level embedding stream?
    Intuitively, the more context the better - e.g. certain noises might predict a certain moving pattern. Or certain head movement could “predict” image changes.
    Or there are some important obstacles preventing you running such tests?
    One can think of different channels - image, sound, head movement, eye movement, etc… and somehow they could either provide each its own embedding at every time step or combine all in a single “multimodal” embedding, etc…

  • I expect wildlife recordings have lots of boring passages with no actual change in the frame - e.g. when the bird isn’t in the nest or it sleeps.
    Generating time passing embeddings when image encoder does not significantly change its output could be more useful than trying to predict future from a context filled with the same, unchanged embedding .
    e.g. having tokens meaning “no change since past 2 frames”, “…4 frames” … “100000 frames” …
    Or in these situations the lowest level simply won’t trigger any event?


I don’t see any obstacles to building multimodal representations with STREAMER. Other than just adding the audio or other modalities to the input layer, we can build multiple hierarchies for the different modalities sharing the same hierarchical structure (event boundaries). The task here would be not only to predict the future in one modality, but also to predict the modalities from each other. We end up with a model that can predict/generate audio from video and video from audio. What i’m trying to build is a pattern recognition model that detects and predicts co-occurrences of inputs. These patterns can be temporal (as in STREAMER), spatial (recognizing patterns between pixels or small patches), or multimodal (recognizing co-occurring patterns between modalities), or ideally all of them together conditioned on motor signal. Since this is tested on egocentric videos, it would be very interesting to condition the predictions on head movement. I believe this would result in much better predictions, and therefore more precise event boundary detection.

Good point, the way I designed the loss threshold is to move with the error signal such that if we have long boring events with nothing happening, the prediction error over the moving average window becomes small and the model ends up being sensitive to subtle changes. I am not sure if there are cognitive psychology experiments that support this theory. It would be interesting to look it up…

1 Like

If my inputs become boring, the inbuilt LLM starts chatting to keep me entertained.


Damn, when I thought I finally had a original idea worth investigating… someone else comes and publishes a full paper about the exact thing before me.

Welp, its my fault for procrastinating so much.