Can Transformers generate a story backwards, from the conclusion?

Haven’t followed up on a few comments (blaming on my nomadic lifestyle), sorry. Planning to answer yet.
Still question remains: given a prompt “… and they lived happily ever after!” can a GPT-like model generate text backward, a word before word? I’m guessing it cannot, but maybe I’m mistaken?

1 Like

If they are not trained for it, they are going to have a hard time, but thats true for virtually every system and I dont think its very useful for a system to be able to naturaly do this.

Its the same for humans, we cant speak backwards without a lot of practice.


meaning a definite “no”, correct? They collect probabilities of the next token, so backward generation should be impossible by design… again if I understand it correctly.

Then I’ve heard about two distinct tasks: restoring a token by context as opposed to context of a token(s). Please remind me if the same model can solve both.

Speaking backwards does not seem very useful, thinking of the possible reasons creating the outcome could be quite instrumental. “Murder she wrote” or something.

Looks like the architecture I’m working on is “generationaly omnidirectional” - beside having fun, what would be a usecase for that? (I’m not saying it understands - same stochastic machine, just omnidirectional)

1 Like

The Portia jumping spider’s front eyes are very sharp but their retinas are actually two 1D vertical stripes that they swipe left and right like a barcode scanner.

Somehow they manage to have excelent sight even with this arrangement, if you think about it, they must be doing some kind of sequence processing and they are able to do it in both directions.


They can be trained and able to complete any missing token from the context window, not only the missing first or last.
But that’s little use, encoder-decoder models are use in (e.g.) translation where input is an English text and the output is a translation of the text in Chinese.

How omnidirectional is that?


Thank you for the spider, good to know. I think I can model that sight, did something like that with MNIST. Could be it’s done already. Thanks anyways.

1 Like

@cezar_t very omnidirectional - always here and always sharp :-). Thank you, saved me from a wheel reinventing. Still could be a good teaser to see how ChatGPT makes a story from the end.

1 Like

“I’ve been doing some thinking”(c Tom & Jerry): “any token” means Transformers are good at fixing a sequence structure. But there are simpler ways of structure memorization. Tell me where you lose me:

Bag Of Words are worst, Bag Of Overlapping Sub-strings/nGrams are better, Bag Of sparse overlapping sub-sequences (skipGrams) are more economical by omitting noise - r the best.

Full collection (multiset: set with frequencies/counts) is growing exponentially with the size of “skipTokens”, but it is controlled by removal of irrelevant skipTokens (patterns) from the collection. A space of overlapping patterns as axes is overcomplete basis space - adding some patterns or deleting some does not affect ability to remember the structure. It is close to SDR set but with categorical values as opposed to binary ones.

Categorical patterns-SDRs have identities - they represent particular patterns. Any pattern could be associated with arbitrary number of parameters, representing frequencies, classes associations, contexts associations, rewards associations, …

So, an effective dictionary of overlapping patterns is enough to build as good stochastic engines as Transformers but much simpler. (Lots of experiments)

And of course, those dictionaries could be represented/implemented as directed graphs - ANNs with many interesting properties.

But, basically: Transformers vs Dictionaries? What is your take?

1 Like

sequence to sequence tasks are bidirectional - you generate the entire sequence at once.

You can train a model to predict a token at any arbitrary position - but you mathematically can’t use a pre-trained LM to predict the first token due to the causal mask.

@Bullbash as a baseline, you could always try doing that with an off-the-shelf pretrained BERT and see what you get out of it.

idk why you wanna do that ho


idk why I wanna do that. The whole point was to get wise people advice on that “Transformers vs Dictionaries” problem. I did stated that Dictionaries are much simpler and powerful solution to omnidirectional generation, other NLP tasks and beyond. But I can live without that advice just fine :-).

1 Like

No they’re not? are you implying that simple n-grams can outperform transformers?

1 Like

Umm, I dont think dictionaries are all that powerful, unless you do some really out-of-the-box stuff.

From personal experience, if you wish to make dictionaries based generators, you better have a very, very… very big ram and a beefy cpu.

And the puny context size would still pose a problem, the furthest in the past I could get them to remember was 3 tokens.

still, if you want omnidirectional dictionary based generators, maybe have a look at the wavefunction colapse algorithm.

1 Like

Easy to check, but you would not dare - breaks your Universe. “Red pill or blue pill” dilemma and I’m not an nudging anybody in that direction. Peace.

1 Like

I have tried building dict based language models and failed miserably so many times I was starting to suspect it would endup the same way as when I tried following those DIY infinite energy tutorials when I was a kid.

if dictionaries really work, you are not going to convince anyone by making it sound like those tutorials, please show a working example, or at least a minimal pseudocode.

trying to use cryptic descriptions that make it sound like you know more than you wish to tell actually makes you sound like those flatearthers instead.

1 Like

I have that generator, and it is not a unique feature - @cezar_t cleared that up. Lets forget about it.

From my old LI post (mind zero post processing - just raw stochastic parroting)

Pseudo Shakespheare generation.
Single “epoch” - it is actually trained continuously, could stop at any time.
Took 15 minutes to train, 10 seconds to generate, 4GB of RAM, no GPU, ~5 mil of nodes(patterns, grams).
No CNN, no optimization, no gradients, just naked synapses. Enjoy.
Compare to official Tensorflow example below.

Dost grant me, hedgehog? then, God grants that we have no staff, no stay.
O Clifford, devise excuses for thy faults.

Yet let us all to death:
That dog, that had his teeth before his
death: you know our king, is dead.

Return unto thy lord; commend me to the pedlar;
Money’s a medler.
That doth utter all men’s ware-a.

Marry, so I mean, sweet Kate; or else shall I?
What’s this? mutton?

First Lady:
Come, my captain knew I were here, he would
When first I did embrace him.

I’ve got 18 on 24GB old PC. Hierarchical dictionaries.

1 Like

18 tokens is certainly better. Still not nearly enough, we’d need it to remember least a thousand tokens to get close to the state of art.

But hierarchical dictionaries? how do they work, are they running on different timescales or is it more like max pooling on convnets?


Ok, so you mean something like this.

I still have no idea how you generate new examples from it though.

WARNING: if you are going to run this code. beware of ram usage, 
it might crash your system if you use a file that is too big, 
I have not done any kind of optmization.

from collections import defaultdict

class SkipGramStorage:
    def __init__(self, size=2, skip=1):
        self.ngram_freqs = defaultdict(int)
        self.seen_ngrams = dict()
        self.next_index = 0
        self.size = size
        self.skip = skip
        self.offsets = [i * skip for i in range(size)]
    def process(self, sequence):
        output = []

        #iterate over sequence
        for i in range(len(sequence) - self.size * self.skip):
            ngram_key = tuple(sequence[x + i] for x in self.offsets)
            # store ngrams that are not seen yet
            if ngram_key not in self.seen_ngrams:
                self.seen_ngrams[ngram_key] = self.next_index  # to save memory
                self.next_index += 1
            # update occurence frequency
            self.ngram_freqs[ngram_key] += 1

            # replace ngram tuple with a integer index
            output.append(self.seen_ngrams.get(ngram_key, None))
        # output is a sequence that can be processed like any other sequence by the next layer
        return output
class SkipGramStack:
    def __init__(self, size=2, skip=2, height=5):
        self.layers = [SkipGramStorage(size, skip)]
    def process(self, sequence):
        stack_results = [sequence]
        for layer in self.layers:
        return stack_results

if __name__ == '__main__':

    input_file = 'news.txt' # put your input data here, (bewarre of ram usage and use a small one)

    stack = SkipGramStack(2, 2, 5)
    with open('news.txt', 'r') as f:
        data =
        data = data.split(' ')

    # print most common level 1 2-grams
    print(sorted(stack.layers[0].ngram_freqs, key=stack.layers[0].ngram_freqs.get, reverse=True)[:300])

Honestly @Bullbash , you-re quite cryptic. Assuming you want your ideas/code to spread, a starting point would be to find someone able to spend time to understand it enough to eventually translate it into python which will (hopefully) expand the ‘audience’ within ML-oriented people exponentially. Yet that to happen requires your effort to feedback too, since nobody signals to actually understand what you talk about. A ml/python person versed in java too and willing to spend their expensive expertise in figuring out what your code is about, is pretty unlikely to be hooked by a couple threads in a low impact forum like this one.

It is cool (and potentially impressive) that it can recite Shakespeare in reverse yet when ML people can’t have a good readout of both/either core ideas and example code, they will be tempted to just pass by towards the next cooler new thing in AI (of which are plenty).

Now strictly on your last above message it clicks some “inner” links. You might want to check Dileep George and his concept of mapping cognitive spaces with sequences

As other people related to Numenta he shares the opinion DL/transformers are not the appropriate tool for learning sequences or intelligence in general.

1 Like

Ok hang on.

I think I get the dumping to HD part, thats to clear the ram and remove duplicated substrings that takeup unnecessary space, so I’m assuming it doesnt have much to do with generation.

First, what do you mean by “image sorted order”

if I got it right, you look for skipgrams that match the current prompt, then look for partial matches on the layer above and generate a kind of feedback pooling where partial matches “vote” for tokens on the layer below. Even if a exact match is not found, you can still generate something if you find several overlapping partial matches.

But directly accessing partial matches using dictionaries is impossible. I assume you are not looping over a full 8gb of data for every word you want to generate, storing all possible partial matches would take a even more ridiculous amount of memory. So you need to have some kind of approximate lookup system in there also right?

Man, sorry to say that but either you want to share a discovery or you dont.

I recognize a fellow chuunibyou when I see one, we never grow out of it but we have to learn how to actually sound cool or at least, not sound like we are pretending.

Its very annoying when people partially explain stuff and then keep saying “eh? you have trouble with only this level?” it only works well on FUNA’s webnovels.