I have been experimenting with a feed forward learning model which is based on what I think the Hippocampus is doing in terms of dealing with language and interacting with the cortex. The early stages and method of hierarchy and chain formation (see the HVC chains in this Building a state space for song learning - YouTube). I think it fit’s with resonant thoery as to how structures are learnt.
It’s a bit (very much) work in progress and does not follow any existing formal language rules so a bit catious about being the village idiot at the moment. It’s quite interesting seeing how different changes can impact the way the model evolves in real time. The method/rationale I think is closer to how the biology deals with language rather than our cortex reflections as to how we retrospectively (over) think language is structured. This is based around the rule of 7 (+/- X) and short term memory constraints that determine the structure irrespective of content (language use and variability) and may then explain why some languages have some bits in reverse order because parts of the sequence orientation do not matter in relation to how the cortex deals with them. I think the rule of 7 has a sort of second layer, which is what we class as a working memory of sorts, but it’s then mixed with activation decay (priming effects) in the cortex. But that’s a further leap from an already big conjecture around hierarchy formation from the gorund up.
The thinking is also based around the polar oposites of memory savant’s and HM.
The element of having a smaller model that can work and learn fast and give and show feedback in short timeframes changes the perception a lot. This is what and where I think research should focus on rather than trillion parameter wizard of Oz exhibitions waiting an eon between itterations to get a reply that a dog cant have 3 legs because it will fall over.
It’s only running on a CPU, no GPU involved. Sparse.
I experimented with a few billion nodes (memory chains) on a 18 node cluster at home (heats the house but not good mid summer, lol.) to research what works and does not work from a systems perspective so that I don’t end up coding something that needs a complete re-write or takes a day to load and save between experiments. Plus eliminating scalability requirements that involve remortgaging the house. Raw compute I think is just a small fraction of the problem, sparsity makes for a huge amount of random memory access at scale.
I had a scenario at a company I worked for a few years back, where I was trying to explain how electricity demand changed in a particular way at different times of the year. Figuring it would take an hour discussion and presentation I sat back and had a think. What I did was animated the 8760 points of data into a short 5 second repreating clip and then gave a short description as to what they were going to see and then watched the reactions as I played the clip. 60 seconds later, job done, lot’s of ahhhh moments. That’s how I think very fast progress can be made.
DL i think of as just a data compression mechanism that allows chains to be compressed into instincts. Current DL methods to me seems like an over sized single cotrical column at the moment, which cripples scalability.
That’s how I see it at the moment and still have a huuuuge amount to learn. This forum tends to be a paradox disaster for progress as I end up reading and thinking for hours, but a great way of some sense checking back to biological reality, lol.