Intelligence is embarrassingly simple. Part 2

Questionable but that is not the biggest problem for POC.
GC takes up to a minute to collect… never saw Java plain freezes for so long…
I 'd need to place a few long arrays for links and weights, control them manually and switch off GC for good.

What is the setup? My single 128GB consumes 400 w, so your stack is burning how many KW?

1 Like

“Continually” is an odd requirement - technically even training a brand new GPT4 every millisecond is regarded as “continual” if you have the hardware resources to do so. So defining an upper bound here for a truly CL system is next to impossible - all methods are just compared with a qualitative, relative lens.

My problem is more about the novelty of your method:

  1. What precisely are its advantages over the thousands of other CL alternatives, like Numenta’s methods as proposed in their papers or the more DL grounded ones?

  2. Why would in theory your CL based system be better in any particular field/task/domain when even with CL abilities, it already underperforms on fairly simple tasks?

Like no offense, you’re working your way up but MNIST is a literal meme. Any paper relying is MNIST is just not going to scale (case in point, Hinton’s capsuleNets and every undergrad paper claiming to have solved AGI :wink: )

Fun Fact: You can actually get more than 85% accuracy in MNIST by simple thresholding of the average of the pixel intensities. Adding a few simple rules bumps that up to the low 90s…

What I’m concerned with is that a lot of prior ideas, challenges and concepts are ignored in your approach. So I would love if you could clarify on those points in more detail than just a couple of paragraphs :slight_smile:

2 Likes

You might meet interesting opponents:
OpenAI cannot(not willing?) afford to retrain ChatGPT every day(year). Makes me wonder how Bard does this.
Jeff Hawkins with Vincenzo Lomonaco, PhD | LinkedIn might not agree with you.
In pure technical terms I train IMDB to full convergence (~87% 2 class accuracy, 40% of 8 class accuracy) on the first 10% (5K) of the set. The very first “epoch”.

As much it is true (MNIST is a meme), that is not a proof. The first ICE engine was weaker then a horse. At some point a new paradigm could start working. So we’r having pleasure to leisurely discussing it.

The main thing is a dictionary entry (a node, a neuron) in my approach represent a specific [multimodal] pattern (input, motor or inference). It is a superstructure over limited Instance Based. Specific patterns can connect and communicate “sideways”(thinking), “backward”(context injection) and form more complex patterns/nodes (speculating?! philosophy?! - not compulsory proper). A dedicated node makes the difference. A “shared” node from conventional models would never make sense making new connections between nodes. Dedicated nodes memorize “the structure” of the environment. Attention of Transformers trying to remember the structure [attn. matrix] by brute power - too many shared nodes approximate lesser number of dedicated. That’s for starters.
And I’m repeating it again [and again] - the architecture [like a salad] combines a lot of different features in one bowl: locality, explainability, plasticity, stability over distribution drifts, knowledge transfer (additivity?shareability? merging is easy, I collect fresh patterns up to RAM capacity, then merge with huge disk file). Multimodal, stable to noise and irregular sampling. Tries to explain neural code and dendritic computations. If i recall, I will add some more.
Just recalled: can compare strings/vectors of different length :slight_smile:
Magically, perfectly applies Minkowski (Euclidian) distance when comparing 10^6 components vectors (why?!) - The Curse of Dimensionality. For limited strings it computes uncomputable Kolmogorov complexity - by compexity of comprising dictionaries.

At a conference a professor reports on attempts to classify noisy, irregularly sampled live sequences of heartbeat, temp and such. A known doctor in a big Child Hospital. Expresses like no light at the end of the tunnel, no theory is available. I’m saying - I might just have a solution for you, lives at stake. The only and the last thing I’ve ever heard back is a “wow”. Cannot sell, drive a truck :-).

I cannot part time beat CV (MNIST) alone, won’t try.

P.S. And I don’t work cheap. Demand two beers from Jeff now.

2 Likes

Around 3kW at idle then around 5.5kW when the CPU’s are at 100% so don’t tend to run them for long as they are also quite loud and hot.
Main stack is 18xDL360 with 2xX5650 and 72GB memory in each. Only DDR3 but 6 channels per machine. 4GB pannels because that seemed to be the better option per IO per GB. 20GB infiniband and gb ethernet. Ethernet is for broadcast packets, Infini for main data load/save and faster machine-machine (older Infiniband does not really support broadcast packets as such). Couple of others with 144GB in each.
All very old kit and cost less than £2k, including about 150x600GB SAS drives (load 0.5TB at 8GB/sec - load and indexed)… was amazing before covid how cheap stuff was going for on eBay… servers £30 each… memory £1 per 4GB panel… 600GB SAS drive £1.30 each.
Main reality… memory timing on DDR3 access is still relatively fast and if memory access speeds are the main constraint per GB for the performance it was rather good value. Think address access latency rather than throughput, it’s the initial address access delay that matters if your only getting 2-3x64bits per request (2-3 address reads per calc).
The other aspect is I firmly believe in prototyping on older slower kit, so you code more efficiently and can then scale out massively (at a massive cost) if needed just by swapping to faster kit. I think a couple of the top end EPYC servers would deliver the same performance as my 18+2, but mine cost around 25x less.

I use arrays with hash indexing, where the hash indexes are one of the two types of connections in the network… new input is added to the arrays so it’s continual. Arrays are broken into blocks and these are spread around the machines. The arrays contain the incommming sensory stream with relative time tagged to allow the decay to vary between the incomming senses when calculated. Sparse calculation is then done with the relative times so there is no need to constantly calculate to keep track of the decays and activations. When needed the last calculation time provides a temporal delta. The complexity vs performance trade off is significant when you get to multi bn inputs / sensory points.

Your code has me thinking as it’s a full hash type implementation. In early testing I moved away from a full hash-hash type approach due to the memory (and GC) issues. At one point the CPU was showing sub 1% whilst the memory was throttling performance at a million class object accesses per second and wondered what was going on. I wanted those 99% but now settle for 10-20% busy CPU and 1000x faster calcs. Now I have different problems, like brief sleep itterations to purge and re-order the arrays.

End of the day, still very much a work in progress.

4 Likes

Private lab or what?

1 Like

Home, side hobby. I tend to do things a bit differently.

2 Likes

Ah, but the important question is - do you need to, until you actually get AGI or a self-improving agent?

You could have fairly simply algorithms with CL, but that doesn’t gurantee ēbstract reasoning.

CL is a critical component of AGI, but it seems to be the most trivially solvable one - compared to the harder task of building a system capable of reasoning.

For instance, LLMs already in-context learn. So Continual Learning is mathematically equivalent to \lim_{ctxlen \rightarrow \infty}. So in that case, would actually extending the ctxlen offer any real, tangible benefits?

The problem is this argument can be used to literally justify almost every scientific approach.

Isn’t that just Associative learning methods (like Hopfield networks) coupled with an underlying graph structure?

Are you aware of Doug Lenat’s CyC project by any chance?

What.

2 Likes

Don’t mind neel_g here, MNIST was worthy enough for Yann Le Cun and Schmidhuber.

It is worth giving it a test, the point is not about “beating” it.

I’m almost sure there are. One line of my amateurish work deals with experimenting various encodings.

2 Likes

About dealing with hash tables and garbage-collecting unused entries.

One thing to notice about them is they were designed with aversion for forgetting, thus getting an entry out of it requires explicit delete.
“Intelligent” systems don’t need absolute precision as our computers, they employ an inherent ability to defeat mishaps with statistics. The more used (== useful) some data is the less likely is for it to be forgotten.

The system below is using a stack of tables (1,2,3,4, or N but let’s stick with 4) each one, besides holding key/value entries, also keeps a readout counter for each key.
Writings are made in table 1, readings in table 4 with the following rules:

  • The “write” method of the API encapsulating all tables in a single one only puts items in the first table.
  • at each reading of same entry from table 1 its readout counter is incremented. If counter reaches a threshold (e.g. 10) the entry is copied in table 2 and counter reset to 1. There the readout threshold will increase 10 times slower yet when it’s reached the entry is copied in table 3, and so on.
  • this way most used entries tend to move in a higher order table.
  • The API exposing “read” will first look in the last table - if found then that’s what it returns, if not look the previous one, etc… until a valid entry is found.

Having to read from a list of tables seems a sacrifice in performance yet most used entries will crawl up to a first (or second) search table.

And obviously, do not deliberately erase anything. Unused entries will simply be overwritten.

3 Likes

You seem to be describing a relatively standard hierarchical caching strategy.
CPU cache ↔ Main memory ↔ SSD ↔ HDD ↔ Tape
Some database index storage strategies (B-Tree) have the indexing stored hirarchically in 4k pages (initial SQL server versions) where the index is layered in a similar manner so the pages form an index tree but not based on hit frequency rather outright index.
Add in a decay into the layers and you make it a realtime type cache.

WIth a B-tree type spacial polygon indexing, you keep the position reference within the index tree and move within the index tree to search for the new location index position. This is different to always starting at one end, but implements persistence to the things you need the index for.

3 Likes

Very smart tricks! TIL and will put into some use of mine :smile:

The 1st table grows freely, or overwrite happens on hash collision?

Or I’d ask another way, do you cap-limit your tables somehow?

1 Like

Feels very much promising! Tho honestly I have difficulties in comprehending how it works.

1 Like

I was thinking at fixed size. I guess the table size is proportional with “half life”? Which would be after how many writes a random entry has 50% chance to be overwritten. So keys that are repeated more often have a higher chance to “ascend” into the next level table, where writes (and overwrites) are significantly less frequent

2 Likes

Well gentlemen, I’m becoming notorious bla-bla-er of a month, according to the discourse engine. While being with keyboard at most 2 days a week. Trying to answer what I can and stay humble/concise.
@ BrainVx On setup: Nice firepower you have, good for you!

Never was fundable, so started/run a trucking company to finance another project:

speaking about “a bit differently” :slight_smile:

neel_g
Many plusses for keeping your integrity [being stubborn] :slight_smile:

Whoever I spoke/conversed have shied away. I cannot sell nothing, even you guys having problems understanding what I’m saying. That’s why I drive a truck :slight_smile:

Thank you for kind words. The good news - somebody’s from this forum trying to run my naive implementation: GitHub - MasterAlgo/Simply-Spiking
It’s not a MNIST yet, but the first step. Trying to solve The Curse and Kolmogorov’s complexity. Let’s see what happens.

I saw a few mentions my code utilizes hashtables, I do have a few lines of not commented hashing, but never used that. I do use Sorted Trees and Arrays. Next time (if willing) please quote a few lines with me “hashing” - it puzzles me. Will look it up.
BTW, best hash implementation I found is here: SmoothieMap 2: the lowest memory hash table | by Roman Leventov | Medium
It’s a science in itself…

I think we must wrap up this thread. If you guys ever want to “walk a walk” let’s start with very simple steps with very definitive results. Otherwise we talk and talk and talk. Starting with choosing a respectable speaker-admin, who’s not away all the time and can/willing control a community project (not me). Please suggest a person… if not - I’m still happy, appreciate the conversation and good luck to everyone!

2 Likes

tokensContainer = new TreeMap()

Learning about Java… reading the differences between HashMap and TreeMap I had assumed that TreeMap was by default a HashMap because of the performance benefits. Within .Net the dictionaries (hash) have a sorted property if required so one “Map” object effectively does both functions if required.

Dictionaries / maps are typically implemented with linked lists (to allow any object as the key for different memory allocation, compared to fixed dimension arrays). The hash index allows for the lookup to the linked list memory address position. From starting with 8 bits I’m guessing you already know this :slight_smile:

Using a TreeMap your having a performance hit as you don’t need the sorted order and for the performance benefit vs the memory should be worth it. Unless your assuming log(n) wins because you have fewer values, in which case hard code a static array inchar(255) to avoid any lookups… chars are bytes (or 2 bytes depending on how to treat the text)…

What I call and mean in relation to using hashing is using the method of disconnecting the input (or internal) layers from a hard coded fixed array type structure.

Your dendritic pattern is a Map… which uses a hash for the key as the indexing method internally (to Java).
23. Map <DendriticPattern, Integer>

The compiled code in .Net for the hashing performance (dictionary use and lookup speed) is quite decent.

1 Like

Almost ready to leave… romantic trip to Kansas… not sure if it helps, but…
the conventional neuron model is something like
<list of input links/synapses, real number weights of links/synapses>
plus fancy activation function.

my neuron model is:
<list of “structural” links/synapses, with integer “activation” threshold>
<logical AND activation [dendritic] function upon structural synapses>
in plain English - all inputs ON → a neuron is ON (activated like a transistor)

<list of “reward/attitude/utility/class” links of synapses with real number association with a class>
In plain English: a neuron gets activated - associations get activated. Seems to be working.

There is no categorical difference between unsupervised, supervised and reinforced learning.
Unsupervised means attitude value is a frequency of occurrence.
Supervised means attitude/class values are frequencies of occurrences for particular classes.
Reinforced mirrors supervised if reward/punishment represented as a vector (which it is) of rewards/punishments (which they are)
Complete herecy I guess, but seems to be working :slight_smile:

Be a Man, take responsibility - be the Admin of the project - we[you] will come up with a lot of unorthodox discoveries… songs will be sung… Nobel’s will be given… People will hate you - because it might prove we [humans] are all walking hierarchical patterns’ dictionaries.

In any case: being [slightly] drunk and bored [myself] is a dangerous combination… [I’m] getting too verbose [like Java] - still, try to oppose reproducible experiments. Lets slowly rewrite my Java into whatever PL you comfortable with. Talk to you in a week! Cheers!

3 Likes