One important limitation of transformers is their limited, fixed size active memory, or how much of the recent past can they “see” in the current conversation.
This is exacerbated by prompting - which is using a “prelude” of text in order to steer LLM’s output towards a desirable output. A longer prompt can provide more complex context but it consumes more of the already limited history window That is exacerbated by the need to reinject the same prompt again and again in order to prevent the model drifting away from the prompt’s influence. Or shall it be named spell?
Here-s one potential trick that might either work or miserably fail but I think it is easy to test by whoever has the possibility and knowledge on how to … mess with a LLM during interference.
The trick would be:
- prompt the model as good as we find appropriate.
- during inference of the last prompt token, save intermediate outputs of each attention head from each transformer block.
- during following conversation instead of simply forwarding the new output of each attention block, do a weighted average between the corresponding saved prompt intermediate result and the current one, such at every new token the model would be slightly biased towards the context computed during prompt phase.
Well, most likely fails but it doesn’t sound like something that is too expensive or difficult to try.
I think the same effect would be obtained by slightly changing the biases after the prompt towards “magnifying” that particular state, so there-s no need to mess with the execution itself, just bias (the biases of) the model towards a state or “perspective” we want to influence further conversation then keep going without re-prompting.