I had a thought experiment this morning about a hypothetical machine that could maybe implemented with current technology.
this machine would run using sentence embeddings or maybe word triplets as its main datatype on a stack.
the machine would be split between several systems, all of which operate on the stack.
the base knowledge system which would be a simple a->b associator which I’ll call system1.
the common sense system which would be similar to system1 except it outputs two triplets from one triplet as input, lets call it system2.
and the evaluator system, which could be seen as the inverse of system 2, in which it takes two triplets and outputs one single triplet, lets call it system 3.
system 1 could be thought as a shortcut to the chaining of system 2 and 3 together, by skipping steps, sort of like an permanent cache memory.
I havent worked out the details yet but I have a hunch that such a machine could start with a certain amount of data on its systems and elaborate on it, producing more data and filling in the blanks.
still its still just a hazy shower thought but i believe it doesnt hurt to share.
Well, trying to do meaningful stuff with sentence embeddings is my neck of the woods…
What you’re talking about would have the same advantage my old year 2000 system did, in terms of actually generating new “pairs” (actually much higher “stacks”) for each submitted triplet. This in contrast to current tech, which will just map it to a learned “stack”.
Or maybe not. Maybe you would implement this based on mapping to a learned “stack” too. In which case what you are suggesting would mostly differ from current tech only by explicitly identifying the internal “stacks” it maps to.
Exactly on what basis do you imagine going from pairs to singles, and singles to pairs? What would be the meaning of the mapping? Based on embeddings?
You could say my old 2000 system was a system for going from singles to multiples in exactly this way. It expanded single sentences to many triplets (and pairs, and single words) on the basis of shared embeddings.
What was interesting was that these generated “stacks” not only had similar meaning to the submitted string, they also generated a hierarchical structure for it (because it turns out there’s a preferred order for sub-embeddings.)
Maybe I should spin up my old online demo so you can play with that.
I think i made that a little vague, but its as vague as my current thoughts to be fair.
the basis would be that if we we could store embedding to embedding mappings in a neural net, maybe we could use it to guess other mappings even if they are not explicitly stored in it and then update the network with this information so that it can generalize better.
it would be kinda like this.
first we have a stack.
(“is cat quadruped”)
system 1 has no answer to that specific question but it still has simething so it just spits out whatever it has.
(“is cat a quadruped”, “cat is mamal”)
system 2 detects an inconsistency and elaborates a new question.
(“is cat quadruped”, “cat is mamal”, “is mamal quadruped”)
systrm 1 runs again but this time it has an answer.
(“is cat quadruped”, “cat is mamal”, “is mamal quadruped”, “mamal is quadruped”)
system 3 detects a consistency so it colapses one level of the stack.
(“is cat quadruped”, “cat is mamal”, “mamal is quadruped”)
colapses it again.
(“is cat quadruped”, “cat is quadruped”)
now the answer is directly stored in system 1 so that it can anser right away.
I’m pretty sure one could implement that in python using noting but dictionaries.
but I still wonder what could happen of we were to implement this in neural nets and keep training it on its own generated answers.
Well, same thing again. I think there’s a parallel with what I did. Only you are evaluating “truth” values, whereas I evaluated sequence predictions.
In my system, to try to build a parallel with yours, you might have a question whether AX is a valid sequence in the language:
(AX?)
The system doesn’t know, so it “spits out” what it does know, which is that AB is an observed sequence.
(AX?, AB)
The system detects there’s no connection between X and B, so it “asks” if one can be found, and finds it in observed sequences (CX, CB):
(AX?, AB, CX, CB)
The system detects a link between X and B in the sense they share context in C, so it “collapses” X and B into a class which shares contexts, and allows AX on the basis that AB is observed, and X and B are observed to share context C.
I’m doing sequences, and you’re doing “truth” values, but I think it comes to the same thing. You’re chaining associations to build new sets.
“What happens” I think is that you’ll find you get contradictions between those sets of associations, and you’ll need to base the “collapse” mechanism on a balance of associations, not just one association, like “cat is a mammal”.
The utility depends on the utility of your association principle. But I do think there is value in chaining associations in new ways, because I think you DO get contradictions. And building sets of associations on the fly is the only way to capture contradictions, and to resolve them, by picking which of several contradictory sets of associations is relevant/“true” in a given situation.
What is crucial is the realization that the sets of associations you build will have contradictions. If you didn’t get contradictions, then you wouldn’t bother chaining associations at run-time, you’d just “learn” all the sets beforehand.
BTW, I think what transformers do (just to hammer back on my continual theme), is they “learn” as many such contradictory association sets as their compute budget will allow, and select the appropriate one using prompts at run time. But they hide the fact of contradictory sets in the black box of their network, so we don’t realize they are working as well as they do because they are capturing contradictory sets.
If you do get contradictory sets, the more efficient way to do it would be to accept you can get contradictory sets, and only build the ones you need at run time, through an association chaining mechanism, much as you describe.
The burden of proof for the utility of a real time association chaining mechanism such as you describe, would be whether sets of associations you can build, result in contradictory sets.
Maybe you have developed some primitive ops / data structures to automate this?
I’m aware that as a human myself, “at run time”, I’ll simulate multiple “theories-in-question” against a few critical “scenarios” from memory+trusted-theories, and see how well each “theory-in-question” “explains” the observed “facts” during those simulated processes.
I have never seen such data structures capable enough to represent the “associations” in meaningful ways, so as to fully automate my reasoning process. Even the “well-explaining” part is quite problem-specific, with details crying for proper quantitative-modeling by human labor. I wonder if there can be some primitive data-structures + algorithms to substitute human labor with machine labor, for most parts if not all.
Ops and data structures to automate this? Yes, absolutely, for my task of predicting the next word in sequence I automated this. Initially I automated it using actual “stacks” of word sequences. For new sequences of the type AX above, I would take observed sequences like AB and check to see if there were any shared sequences CX, CB to support AX.
As I suggest above, just one supporting association was not enough, the evidence needed to be evaluated as a set.
The nicest thing was that for longer sequences like AXY you could compare the size of the supporting set for AX and XY, and generate a hierarchical structure ((AX)Y) or (A(XY)).
But I later realized actually listing “stacks” was both overkill, and actually lossy. The lossy bit is the important part. I didn’t think of it for some time, but you can’t squeeze all the information in a network into a set or vector, and not lose information. Obvious. The conclusion I came to after quite a long time was that the best way to do it would be to leave all the associations in a network, and just evaluate the size of different “sets” of associations by… clustering oscillations, as described in the Brazilian paper.
Easy, but it often takes a long time to come to the simplicity necessary to make something easy.
I would suggest that the way your brain is doing that, is that it is assessing the supporting associations of different such “sets” of associations. And the way it is assessing those supporting associations… is by setting your network of associations oscillating, and seeing what sets of supporting associations are densest, causing those sets to synchronize their oscillations.
Well, for data structures capable of representing “associations” in meaningful ways, I actually think you can both get beyond the need for human labour, and create much more effective associations, by in a sense inverting what I did with word sequences. So my problem might be in many senses the inverse of @JarvisGoBrr 's problem. You see, just as sets of word associations (like CB, CX, etc.) provide evidence for new sequences, you can also think of them as providing evidence for things which can be substituted for each other. Which is a good definition for new “concepts” (= sets of things which can be substituted for each other.) So, you get evidence for what sequences an element can participate in by holding elements constant and comparing sets of their observed sequences. But you can also get evidence for what elements can be substituted for each other, by holding sequences constant and comparing sets of elements that can be substituted!
Those sets of things that can be substituted for each other might function very well as “concepts”. And the sequences they participate in, as a set, might function very well as “knowledge” about them.
I would guess a lot of this is indeed what is happening implicitly in transformers. But we can’t see it, because as I keep insisting, these sets turn out to contradict, so the transformers “learn” enormously complex webs of contradictory sets, totally opaque to look at as a whole, and only picked apart by the right prompt at run time.
The people building transformers don’t see this, because they are not expecting contradictory sets.
Working blindly, they stumbled on something that worked by ignoring the sets completely, and dealing only with “transformations”: the inputs and outputs :-b
If the sets didn’t contradict, none of this would be necessary. Any learning mechanism would soon narrow in on a nice clean hierarchy. But it doesn’t.