SDRs for unlimited large dictionaries in language models?

Browsing some old papers on the subject of language modelling I found why they used limited size dictionaries, and even more than a decade later LLMs tokenizers are capped to 50k or 30k long dictionary. And they are so goofy by splitting many words in unintuitive … pieces.

The answer is they use end-to-end one-hot representation both for input and output.
Which means for a embedding size of e.g. 4k x 50k word encoder or decoder would need a 200M size trainable array at each end. Which is a significant cost.

For a “real” intelligence one would expect the opposite - to pop out new words for new things.

And here is how SDRs might help - instead of a fixed one-out-of-all-50k-known-words matrices at the input and output, using random 64bit-out-of-4k long SDR that may handle all words the AGI could ever come up with. And a few extra.

That popped out when I counted all collisions on 200k randomly generated SDRs they averaged 0.5 bits. A simple ANN search would make decoding (at the output) pretty fast while the trainable “dictionary” matrices would be an order of magnitude smaller, e.g. 4 k x 4k instead of 4k x 30k

PS and it would save a lot of the expensive context window, models now use ~1.25 tokens for each real word, since attention is still needed so much, cutting off the extra 0.25 tokens would save ~36% in getting it…

2 Likes

I think what you are proposing is similar to the semantic folding approach of CorticalAI. The main difference is that they are attempting to bind token strings to SDRs through correlations with other nearby tokens in the hopes of capturing some of its semantic context.

2 Likes

That’s a very competitive space nowadays, bert-small-uncased alone has 34 million downloads in the last month! RAG on a very large scale I presume.

1 Like