Can Transformers generate a story backwards, from the conclusion?

MaxLee · November 8, 2023, 11:07pm

At 1d/2d, assuming you can get a functioning system of encoding, this keeps it in the realm of being somewhat intuitively interpretable to mere mortals. More than that might help out, but any of these encoding systems are, essentially, fuzzy-hashing which could be used to generate keys for storing input values in a hashmap (where the xy coordinates are the key, and the inputs are the value)… or delaying the input of the values by a timestep (given a chain of tokens), the the key N-1 receiving the value for N would create some forward chaining. At least that’s my humble thought at the moment.

We may honestly be referring to similar things, varying only in our chosen vocabulary.

Same

cezar_t · November 9, 2023, 6:35am

That scheme is interesting, a similar one bugs me where the associative memory, instead of directly storing following tokens for given context embeddings, it:

splits contexts in short and long, long ones are used as memory keys.
memory stores partial MLP matrix slices.
which are used to dynamically assemble a 1 hidden layer MLP which is trained to learn short context->next token relations.

Think of it as a LoRA, except there is no “core” matrix upon which lora adaptations are added, the matrix is dynamically built from rank-1 lora slices restored from the associative memory.

If the long context SDR key sparsity is e.g. 7% , then the memorized parameter space is >200x larger than the active one.

Given the long term context is slowly changing, a very large scale memory, with hundreds of billions of parameters retrievable from disk should be feasible.

JarvisGoBrr · November 9, 2023, 8:34pm

I kinda got interested in trying to implement a system for that but training LLMs is impractical for most of us.

I wonder if anyone here is a linguist or someone versed in stuff related languages in general.

I’m thinking about trying to build a TLM (Tiny Language Model) using a madeup simplified proxy language.

By proxy language I mean simplified strings that could act as a replacememt for real world data and and could be programatically generated on the fly but have most of the properties we think a real language must have but in a simplified form that a small model could learn.

For example a typical markov chain parrot might generate something like:

“cats are cute, thats why humans have developed lungs, and I want food for my door.”

But your typical GPT will generate something more like

“cats are cute, thats why humans have deveoped a liking to them as pets, and my cat wants more food”

Theres a obvious long range dependency between the word “cat” and “pet” so a good model must be capable of remembering whats been said previously.

This is just an example but I bet there are several other properties a language has that we must be able to model properly.

I wonder if we could make a simplified “language” with only a handful of possible words that still captures those properties and is programatically generatable.

cezar_t · November 9, 2023, 9:20pm

Are you aware of the TinyStories models & dataset? Millions of short stories for kids, with a small ~1500 words vocabulary .

Also interesting is TextWorld - a text-based reinforcement learning environment. Actually ~100 environments.

PS Markovify is a simple markov chain text learner/generator, you can test whatever you want to make against.

JarvisGoBrr · November 9, 2023, 9:59pm

Thankyou, this seems perfect for me. i thought about taking a LLM and masking out the least common words while it generates text but it honestly seemed like a pain to edit other models and I’m not even sure if my hardware can handle it.

cezar_t · November 9, 2023, 10:02pm

A very simple “language” can be generated by having a bot walking on MNIST digits.

Imagine it there are words for commands and words that describe what robot “sees”.
Robot actions can be turn left-right-back and step forward. After each action the “environment” responds with what robot “sees” - the content of the nearest 3x3 pixel square he is looking at.
So there are 4 action words and 2**9 = 512 environment response words.

This simple simulator, even with random actions, would generate a textual description grounded onto whatever the underlying “reality” the robot walks into is

PS this can be transitioned into a RL problem by rewarding the bot for harvesting the “ink” from the 28x28 MNIST digit, and penalizing for each new action.

cezar_t · November 9, 2023, 10:18pm

Yes, tinystories is cool. It shows that

relatively small transformers (well, under 100M params, one can train on their home gpu) can speak English coherently
and outperform (in their limited training domain) significantly larger “generally pre-trained” models on much larger datasets.
which means curriculum counts. Likely the order stuff is learned counts too.

neel_g · November 10, 2023, 12:12am

That’s basically the same as training the model on synthetic tasks

Either way, you’re embedding priors through data. It’s definitely something that’s studied on a small scale, but you can try really expanding the dataset and researching other improvements over this idea.

JarvisGoBrr · November 10, 2023, 1:14am

Yeah, I think What I really want is to capture the essence of what makes natural language hard in a very simple synthetic dataset with a small vocabulary.

I like this idea because it allows for rapid prototyping of new architectures.

BrainVx · November 10, 2023, 5:32pm

Very interesting paper. Very strange conceptual process… asking an AI to effectively teach (provide the material) and then grade an AI.

From the results in Fig 4 it would have been interesting to see if the incremental gains continued for the model with a hidden size of 512 with 16 and 20 layers. Particularly for the grammar and consistency.

Figure 9… 21M 1 layer…
”Can cows fly?”, Alice asked her mother.
”Yes!”, her mother replied. Alice and her mother went to the barn.
Then the effects of the LSD wore off…

BrainVx · November 10, 2023, 6:02pm

In thier related works section 7 this article came to mind, particularly the chart at the end…

The article was a couple of months before the paper came out.

If output accuracy scaling is directly related to compute and the model size is not necessarily related to accuracy…

Dave_Dolan · November 10, 2023, 6:26pm

I think training the model on synthetic tasks makes sense. We could actually look for ‘reversible’ facts in training data, and reverse them and train on the reverse too. Like “Dave and John are programmers.” Could also imply: “Dave is a programmer.” “John is a programmer.” or “If you see a programmer, it may be Dave.”/ “If you see a programmer, it may be John.” This would perhaps have the effect of generalizing on the categorization tasks and it would effect ‘see through’ the concept from both angles. Obviously not everything works that way. But there is a human labelling the inputs at some point, and I know we have a certain degree of science already done, so MAYBE we could use some of it to determine, at the time we’re training, what labels we can get for free by reversing or re-wording things that we are saying, and training on ALL of it. I mean, it’s kind of what we do as humans. We learn about a concept, we don’t just think of it ‘forward’ and then play it out of order in our brains later. We are exposed to it from all sides in many interactions throughout our lives. So, essentially, humans in real life ARE training on the forward and backward scenarios. To reproduce this behavior we can use a rather stupider language processor to prepare our training data, most likely. Still, it doesn’t decrease the amount of effort required to get it going, but it may eventually reduce the amount of human effort once we get a good baseline set. I think the existing LLMs would be able to deal with it because all we would be effectively doing was feeding in a ton of more training data. (That would, to a casual observer, seem to be just a restatement of existing training data. It’s my hypothesis that this would effectively cause it to generalize, for lack of a better term for what you call that – at the cost of more compute during training.) I know it’s kind of disappointing that it costs so much to get this effect using a method like this, but I mean, you really only have to do it so many times before you can automate it. Perhaps I’m in left field, but I think we humans do a lot more brute force training to learn the things we know than I think most people realize.

JarvisGoBrr · November 10, 2023, 7:11pm

Thats about what one would intuitivelly expect.

But I’m more interested on how to make the most out of the resources we can afford rather than trying to afford more ressources.

Peter2 · November 14, 2023, 7:54am

I hope you are a creative short stort writer and a liar (this time) or a plagiarist; or, if you aren’t, that ChatGPT is programmed and trained to be so scarily efficient a plagiarist that it ought to be prohibited to compete with people as a short story writer.

Bitking · November 14, 2023, 1:01pm

It’s this one.

“ought to be prohibited to compete with people as a short story writer:”
BTW: This is a big chunk of the SAG/AFTRA strike was all about.

ChatGPT is also very good at writing software routines.
Us software people don’t do unions so we will just have to live with the competition.

Topic		Replies	Views
TM: backward predictions? Machine Learning	2	592	January 30, 2020
Chaos/reservoir computing and sequential cognitive models like HTM Tangential Theories	150	3902	April 17, 2023
HTM and language modelling NuPIC	5	942	December 29, 2020
HTM + Logic for sequence learning Machine Learning sequence-memory	2	478	November 16, 2023
Esperanto NLP using HTM and my findings Engineering	60	4307	November 30, 2018

Can Transformers generate a story backwards, from the conclusion?

Related topics