This is an excellent and perceptive question. Due to how humans like to break down problems and combine them back together in disciplines such as engineering or physics, we are used to seeing hierarchy in a way that combines and condenses information as you ascend the levels.This leads to a central command and control node somewhere towards the top of the logical structure. Much of the classical literature assumes that this is what is going on in the brain and presents this as a fact without any actual support from the known wiring in the brain.
When we try to apply this model to the brain we can see there there are layers of processing but the wiring just does not support the concept of the information merging into some central node - it seems to stay mostly in a “parallel” format as it courses from area to area in the brain. I struggled with this for the longest time.
It is hard to grasp but it seems that the recognition is distributed as a cooperative effort using short-range lateral connections through each area of the brain. This allows the individual columns to recognize its possible bit of the overall picture and vote with its neighbor on which of many possible larger scale things that it may be part of. All the computations are local but they build to a global picture.
With sequential recognition the bit that is being voted on is also the transition between this current pattern and the next pattern. This adds temporal recognition to the spatial recognition.
In the example you provide - the eyes move around and keep placing a small group of letters in the center of the visual field. There, this macro-column recognizes a letter and the next macro-column recognizes a different letter. They are forming a guess that this is part of a pattern (in this case - word) that they have leaned so these two columns are voting on a 2 letter digraph.The second and third macro-column are likewise voting on a different 2 letter digraph. This process is happening over the entire visual field at the same time. None of the macro-columns know they are part of a particular word or phrase, just the little bit they can see. Jeff Hawkins describes this as looking at the world through a straw. The larger local group of macro-columns rapidly settle on some pattern that we might consider a representation of a word or phrase.
I am saying two letter digraph for this explanation but of course - it is all the surrounding mini-columns at the same time.
The transition/time element would chain a sequence of changing input patterns into a stable output constellation that stands for an object or word or phrase. This stable constellation pattern could persist for several eye or hand/finger movements building up to longer word groupings as you learn more patterns.
Keep in mind that each macro-column in maps after the primary sensing areas can center on a different mini-column that was the winner in recognizing the local pattern; in mammalian cortex this macro-column neighborhood is about 220 to 250 micro-columns. This means that the center of a macro-column is not in a fixed place - it depends on the pattern that is sensed and what mini-column was most certain that it recognized the pattern - and won in voting with the neighbors… This is true for all macro-columns so the output pattern is not fixed to a rigid location or grid. The output is the constellation of macro-columns that won in this competitive/cooperative process. Each global input state would result in a collection of local output pattern of bits. Matt Taylor has described this as a constellations of stars; I think that is an apt description. Every learned input pattern or sequence results in a different stable constellation.
Note: there are variations on this plan: the primary sensing areas and output drivers are anchored to the body structures that they are attached to so they have fixed processing structures like the cortical columns documented in V1. Function is defined by connectivity.
This general process is happening at all levels so higher levels are perceiving and voting on the groupings formed by lower levels. Keep in mind that this is not strictly a pipeline as there are huge numbers of fiber tracts crossing up and down these hierarchies and between processing streams. This gives additional things for the local columns to perceive and vote on. Since you are perceiving your entire environment at the same time these perceptions are likely to be different aspects of the thing you are perceiving. Some aspects could be sensor-somatic positions of the body or eyes, and sensations from the skin or retina.
As you go up the hierarchy you learn space/time sequences, then sequences of sequences with mixing from other areas, and so on, until you reach the association areas. The representations in the association areas are the fusion of all the sensory processing streams. Sequences of sequences for the most stable object representation in time. Still in a distributed form.
Numenta is working on explaining how this might work using the Thousand Brain model. There are many posts on this topic in this forum. Here are some examples:
I am personally pursuing a slightly different take on this problem. It is almost the same as the TBT but differs in that I think the information organizes into a unique internal structure I have been calling hex-grids.
As I said, there is considerable overlap in the basic concepts of hex-grids and the thousand brain model.
This will probably blow your mind but - how does this information come together to make decisions and initiate actions?
I maintain that the cortex contents are shared with the subcortical structures (the lizard brain inside us all) and these older parts (in an evolutionary sense) decides on and directs actions through projections to the forebrain. I suppose that if you could say that there was a executive spot in the brain it would turn out to be the thalamic nucleus. I think that modeling this area will end up being the part that finally makes AI react in a way that we consider having “intelligence.”
Madhero88 - Own work by uploader, sources [1] [2] [3] [4] [5] [6] [7]