Can someone explain how the fMRI data is integrated into the text data embeddings of ChatGPT please.
"The work relies in part on a transformer model, similar to the ones that power Open AI’s ChatGPT and Google’s Bard. "
Transformers do not have to be limited to text.
Thanks. But I am very ignorant of how one describes the embeddings of the images or fMRI so that it relates to the text embeddings that can be obtained from the last but one stage of the training neural net for example once all the words (or phonemes) have been translated to numeric values. I don’t understand how to put a value on the word “chat” versus some image unless it is just the associated text of the image that gets mapped to a number (really an embedding). Sorry if this is below the level of the rest of the readers.
The actual paper is paywalled. So one can only speculate the details like how did they produced token encodings
A reading from FMRi or pixels, or sound waveforms can be encoded as groups of numbers. Tokens.
Think of it like an encoder for SDRs.
The size of the tokens should be selected to sample meaningful clusters of data.
These groups of numbers work a lot like the tokens generated by text encoding but it’s not letters, it’s groups of numbers in whatever problem space you are encoding. These arrangements can be 2d like in pictures.
All the transformer is looking for is relationships/arrangements between the tokens.
This is not a great model as it does not fit well with the arrangement part; this is how I think of it until I can develop a better version. In my mental model, the prompt is turned into tokens and the answer is an extension of the tokens. I like to think of this sort of like a very abstract version of a Bayesian prediction; the tokens are the prior, and the output is like the posterior probability distribution. As the session continues the entire exchange of prompts and answers extends the population of priors.
Exactly. I confess I haven’t had a chance to read the paper yet, but it is popping up all over the place on the “science journalism” sites.
One theory is that language is formatted both by embodiment and metaphor. How the brain weaves what it senses into syntactic, recursive language is unknown, but this work is a very important step towards that understanding. Once a machine is able to do that we will be at the so-called singularity (Kurzweil).
For those who wish to play (or understand better when there is code!):
…and then there is this.
The pre-print paper is available here:
From the paper (pg 22 Encoding Methods):
“A stimulus matrix was constructed from the training stories. For each word-time pair (𝑠%, 𝑡%) in each story, we provided the word sequence (𝑠%), 𝑠%*, . . . , 𝑠%$”, 𝑠%) to the GPT language model and extracted semantic features of 𝑠% from the ninth layer. This yields a new list of vector-time pairs (𝑀%, 𝑡%) where 𝑀% is a 768-dimensional semantic embedding for 𝑠% …"
It goes on to many other things, but AFAIK, GPT2 was used to generate semantic blocks (embeddings) based on the words in the same time intervals, and the voxel maps (pre-located per person and heavily cleaned/purged) were then also embedded and a regression model was used to find the correspondence, retrospectively.
In essence they needed to compare 10 secs of active voxels with 10 secs of word embedding (a mid-stack layer vector often used for semantic matching), as generated by their own trained GPT2.
I may have mis-understood, so others, please wade in with corrections.
Thanks, that is a great help in understanding the paper. Except I still don’t understand how the voxels were embedded since they had not been related to the text details yet. Can you explain this detail. Sorry again if I am asking dumb questions because of ignorance of machine learning techniques. Rob
There is a lot of detail, but roughly, the text and voxel activity are learned using pre-set words which are imagined (or heard) by the subject at a certain time. This is a known set of words. This gives a (known) set of words vs. (measured) set of active voxels per time slot and they learn this time correspondence per person, during the training phase.
Does this help?