Alright, now it is clear. Thank you!
Edit: I made some simulations and the following idea could successfully detect and learn new objects, but it had some bad effects on the inference part, so it’s not the right solution.
Just guessing but it may not be that difficult:
Let’s assume that a learned feature/location pattern (P1) causes the output layer to represent an already learned object (this is in ‘infer mode’ since the feed-forward activation was strong enough).
Let’s call the output layer’s activation A1.

After P1, a new, unseen pattern (P2) arrives. Because it isn’t a learned pattern the feed-forward activation will be less “strong”, so the output layer switches to ‘learn mode’, and chooses an other set of active cells to represent the new object (A2).

Now, as described in the paper, the output layer’s active neurons adapt their connections to the last time step’s activations:

So the A2 neurons will be more likely to become active, if the A1 neurons are activated.
At the next epoch we feed P1 to the system again. If we follow the output layer’s activation rule then only the A1 cells will be active (because A2 neurons don’t have feed-forward activations):

However if we assume that the feature/location signal is a “strong” signal, then the A1 cells could be active enough to make A2 cells active aswell:

After the activation of the output layer is determined, active cells in the output layer grow connections to the input layer’s active cells. So A2 neurons will also learn the P1 pattern…
The only change that I made in the output layer’s activation rule is that if the layer is in ‘infer mode’ (so if the feed-forward signal is strong enough) then cells in the layer don’t need feed-forward input to activate, it’s enough if the basal signal is strong enough.
I don’t know if it can happen in the brain (probably not), but to me it sounds logical to say that if a neuron gets a strong activation signal then it can activate other cells in the same layer.
BTW when I say “strong” feed-forward signal then I mean that a cell’s overlap is above a threshold. If the active cells in the input layer is let’s say 10, and during initialization we set 50% of the permanences to be connected, then we could say that a neuron has “strong” signal if it’s overlap is >= 9.
So if the signal is a strong signal then we would switch to “infer mode”, and during inference cells can be activated without feed-forward signal.
However if the feed-forward signal is not strong enough then we would select the ‘learn mode’ (=fix the output layer’s activation and learn the new object).