At 1d/2d, assuming you can get a functioning system of encoding, this keeps it in the realm of being somewhat intuitively interpretable to mere mortals. More than that might help out, but any of these encoding systems are, essentially, fuzzy-hashing which could be used to generate keys for storing input values in a hashmap (where the xy coordinates are the key, and the inputs are the value)… or delaying the input of the values by a timestep (given a chain of tokens), the the key N-1 receiving the value for N would create some forward chaining. At least that’s my humble thought at the moment.
We may honestly be referring to similar things, varying only in our chosen vocabulary.
That scheme is interesting, a similar one bugs me where the associative memory, instead of directly storing following tokens for given context embeddings, it:
splits contexts in short and long, long ones are used as memory keys.
memory stores partial MLP matrix slices.
which are used to dynamically assemble a 1 hidden layer MLP which is trained to learn short context->next token relations.
Think of it as a LoRA, except there is no “core” matrix upon which lora adaptations are added, the matrix is dynamically built from rank-1 lora slices restored from the associative memory.
If the long context SDR key sparsity is e.g. 7% , then the memorized parameter space is >200x larger than the active one.
Given the long term context is slowly changing, a very large scale memory, with hundreds of billions of parameters retrievable from disk should be feasible.
I kinda got interested in trying to implement a system for that but training LLMs is impractical for most of us.
I wonder if anyone here is a linguist or someone versed in stuff related languages in general.
I’m thinking about trying to build a TLM (Tiny Language Model) using a madeup simplified proxy language.
By proxy language I mean simplified strings that could act as a replacememt for real world data and and could be programatically generated on the fly but have most of the properties we think a real language must have but in a simplified form that a small model could learn.
For example a typical markov chain parrot might generate something like:
“cats are cute, thats why humans have developed lungs, and I want food for my door.”
But your typical GPT will generate something more like
“cats are cute, thats why humans have deveoped a liking to them as pets, and my cat wants more food”
Theres a obvious long range dependency between the word “cat” and “pet” so a good model must be capable of remembering whats been said previously.
This is just an example but I bet there are several other properties a language has that we must be able to model properly.
I wonder if we could make a simplified “language” with only a handful of possible words that still captures those properties and is programatically generatable.
Thankyou, this seems perfect for me. i thought about taking a LLM and masking out the least common words while it generates text but it honestly seemed like a pain to edit other models and I’m not even sure if my hardware can handle it.
A very simple “language” can be generated by having a bot walking on MNIST digits.
Imagine it there are words for commands and words that describe what robot “sees”.
Robot actions can be turn left-right-back and step forward. After each action the “environment” responds with what robot “sees” - the content of the nearest 3x3 pixel square he is looking at.
So there are 4 action words and 2**9 = 512 environment response words.
This simple simulator, even with random actions, would generate a textual description grounded onto whatever the underlying “reality” the robot walks into is
PS this can be transitioned into a RL problem by rewarding the bot for harvesting the “ink” from the 28x28 MNIST digit, and penalizing for each new action.
That’s basically the same as training the model on synthetic tasks
Either way, you’re embedding priors through data. It’s definitely something that’s studied on a small scale, but you can try really expanding the dataset and researching other improvements over this idea.
Yeah, I think What I really want is to capture the essence of what makes natural language hard in a very simple synthetic dataset with a small vocabulary.
I like this idea because it allows for rapid prototyping of new architectures.
Very interesting paper. Very strange conceptual process… asking an AI to effectively teach (provide the material) and then grade an AI.
From the results in Fig 4 it would have been interesting to see if the incremental gains continued for the model with a hidden size of 512 with 16 and 20 layers. Particularly for the grammar and consistency.
Figure 9… 21M 1 layer…
”Can cows fly?”, Alice asked her mother.
”Yes!”, her mother replied. Alice and her mother went to the barn.
Then the effects of the LSD wore off…
I think training the model on synthetic tasks makes sense. We could actually look for ‘reversible’ facts in training data, and reverse them and train on the reverse too. Like “Dave and John are programmers.” Could also imply: “Dave is a programmer.” “John is a programmer.” or “If you see a programmer, it may be Dave.”/ “If you see a programmer, it may be John.” This would perhaps have the effect of generalizing on the categorization tasks and it would effect ‘see through’ the concept from both angles. Obviously not everything works that way. But there is a human labelling the inputs at some point, and I know we have a certain degree of science already done, so MAYBE we could use some of it to determine, at the time we’re training, what labels we can get for free by reversing or re-wording things that we are saying, and training on ALL of it. I mean, it’s kind of what we do as humans. We learn about a concept, we don’t just think of it ‘forward’ and then play it out of order in our brains later. We are exposed to it from all sides in many interactions throughout our lives. So, essentially, humans in real life ARE training on the forward and backward scenarios. To reproduce this behavior we can use a rather stupider language processor to prepare our training data, most likely. Still, it doesn’t decrease the amount of effort required to get it going, but it may eventually reduce the amount of human effort once we get a good baseline set. I think the existing LLMs would be able to deal with it because all we would be effectively doing was feeding in a ton of more training data. (That would, to a casual observer, seem to be just a restatement of existing training data. It’s my hypothesis that this would effectively cause it to generalize, for lack of a better term for what you call that – at the cost of more compute during training.) I know it’s kind of disappointing that it costs so much to get this effect using a method like this, but I mean, you really only have to do it so many times before you can automate it. Perhaps I’m in left field, but I think we humans do a lot more brute force training to learn the things we know than I think most people realize.
I hope you are a creative short stort writer and a liar (this time) or a plagiarist; or, if you aren’t, that ChatGPT is programmed and trained to be so scarily efficient a plagiarist that it ought to be prohibited to compete with people as a short story writer.