SDRs for unlimited large dictionaries in language models?

Browsing some old papers on the subject of language modelling I found why they used limited size dictionaries, and even more than a decade later LLMs tokenizers are capped to 50k or 30k long dictionary. And they are so goofy by splitting many words in unintuitive … pieces.

The answer is they use end-to-end one-hot representation both for input and output.
Which means for a embedding size of e.g. 4k x 50k word encoder or decoder would need a 200M size trainable array at each end. Which is a significant cost.

For a “real” intelligence one would expect the opposite - to pop out new words for new things.

And here is how SDRs might help - instead of a fixed one-out-of-all-50k-known-words matrices at the input and output, using random 64bit-out-of-4k long SDR that may handle all words the AGI could ever come up with. And a few extra.

That popped out when I counted all collisions on 200k randomly generated SDRs they averaged 0.5 bits. A simple ANN search would make decoding (at the output) pretty fast while the trainable “dictionary” matrices would be an order of magnitude smaller, e.g. 4 k x 4k instead of 4k x 30k

PS and it would save a lot of the expensive context window, models now use ~1.25 tokens for each real word, since attention is still needed so much, cutting off the extra 0.25 tokens would save ~36% in getting it…

2 Likes

I think what you are proposing is similar to the semantic folding approach of CorticalAI. The main difference is that they are attempting to bind token strings to SDRs through correlations with other nearby tokens in the hopes of capturing some of its semantic context.

3 Likes

That’s a very competitive space nowadays, bert-small-uncased alone has 34 million downloads in the last month! RAG on a very large scale I presume.

1 Like

Actually no. What I propose here is to replace a single number per token produced by a LLM tokenizer e.g. between 1 to 30000 if the dictionary size is 30k

There are no semantic associations between the assigned number and its associated word except the purpose is to have a unique numeric representation for each token,

The dictionary size is chosen to balance between expressivity (a larger dictionary is more … complete) and the computing cost of multiplying a quite hefty matrix of size = embedding_size X dictionary_size at the output of the model.

What I suggest here is an unlimited dictionary size WITH a limited input/output matric by replacing one-of-N with top-k-of-N for each word.

top-k-of-N allows for a number of tokens many larger than N by assigning a random k values for each token.
e.g. if k is 4 and N is 10k

In current LLMs each token gets assigned and integer e.g.
“car” becomes 7522 and “boy” is 14522 and there is a limited number of tokens, so for words that are not found within 30k size dictionary they tokenizer must implement “clever” ways to split them into a series of multiple tokens found within the dictionary. So the “dictionary” needs to contain lots of sub-words.

In what I propose, N is limited to e.g. 10k and
“car” will be assigned an arbitrary k-long SDR e.g. [344,1922, 6291, 8159]
instead of the single value 7522 above

Since the current integer tokens have no semantic value (uniqueness is their purpose) then a very sparse, arbitrary SDR for each word would do as well .

e.g. assigning 100/10000 SDRs for each word with the simple rule being to minimize overlap between SDRs so they can be uniquely and unambiguously identified, we can have a virtually unlimited dictionary size with fixed N matrices at the input and output of the model.

1 Like