I put a few hours of thought into what it would take to simulate this, memory and costs, using existing hardware, and aiming for minimal memory footprint. I’ll write a little later about using customized hardware (ASIC/FPGA) for doing the same activity, and its costs.
Using standard, existing hardware
If we do away with weighted cells in the minicolumns, instead essentially using them as keys in a dictionary/map that contain a list of “proximal connections”, we can shrink the overall size footprint in terms of memory, enabling us to write an actor model of sorts in C/C++. I propose the following model, in which all variables are statically declared at compile time. Strongly influenced by embedded devolopment:
Minicolumn object (struct):
- array_cell_vote_boxes[ ] // uint16/32 value per cell. Incremented by “timestep - 1” winning cells.
- ptr_idx_id // pointer where column updates its overlap score in pool object.
- cells[ cell_connections[ ] ] // work work as a ring buffer, per cell, holding ptrs to winning cells vote boxes.
- cells_buffer_idx[ cell_connections_idx_counter ] // keep track of where each cell’s connections are in its buffer
- cells_connection_buffer_full_bmp // track if cell connection buffer is full.
- col_connect_bmp // bitstring representing distal connections of minicol to input space
- col_act_state_bmp // bitstring representing overlap to current encoding for timestep
- col_dstl_conn_strngth[ ] // uint8; for connection strength for each connected bit.
- bool_previously_activated // were we fired up on the previous timestep?
- flag_minicol_access // semaphore/flag to incidate if another column is accessing this one.
- winner_candidates[ ] // ptrs of prevwinning cells voteboxes. Passed in with input encoding. --> could also simply be a pointer to a single memory location where current previous winners are located.
Pool Object (struct):
- array_of_winners[ ] // list of pointers to winning cells vote boxes
- avg_overlap_score // updated as minicolumn overlap scores flow in
- completed_count // uint32 to keep track of how many minicols have reported overlap
- min_score // find lowest activation score
- max_score // find max score
- poolstate_bmp // bitstring representing activation state of pool for current timestep.
- minicol_pool_list [ ] // list to pointers of minicolumn structs
- minicol_overlap_score [ ] //uint16, enough bits to hold up to ~64k score value per cell. Updated by minicols themselves.
- flag_incrementing_ctr // semaphore/lock for counter
- flag_min_val // semaphore/lock for pool.max_score
- flag_max_val // semaphore/lock for pool.min_score
Operation (high level):
- Load current input encoding into its variable.
- iterating over list of minicols, (can be broken to different processes) send pointer location of encoding, asking for overlap score.
- Wait for columns to finish ( columns directly increment pool.completed_count and pool.minicol_overlap_score)
- Based on average, min, max, choose winning threshold
- Depending on pool size pick winners based on score (random sampling, random starting location in list, etc… pick a scheme).
- For winning minicols, change minicol.previously_activated to true.
- Strengthen connection score where overlap existed (optionally weaken non-overlap at random?)
- Choose winning cell in minicol.
- Update (at random?) previous winning minicols (grabbing f"lag_minicol_access") connections (minicol.cells array), increment cell_buffer_idx as needed, and flip “buffer_full” state if needed.
- (minicol) update in pool object address of winning cell.
IO Block (very high level):
- Shard work across systems, handing information back and forth.
Memory Usage:
The main multiplier of memory for a system like this, is how many cells per minicol you want to simulate, and how many proximal connections you want each of those cells to support. The memory of everything else is almost trivial by comparison.
For example, in a system of 100 million minicols (instead of 120 or 150m), with 120 cells in each, with each cell supporting up to 512 proximal connections, would take ~7.3tb of memory, assuming no pool has more than ~4 million columns (32bit addressing for each pool), and ignoring everything else on a running system. Not insurmountable, but might be worth examining that number, considering that people who receive hemispherectomies still seem to mostly get by in life after a period of adjustment. Maybe we only need half of our neocortex??
Processing Speed:
A+ Server 2124BT-HTR with four nodes, each with two EPYC 7742 CPUs (64 cores, 124 threads each) --> 1024 x 2Ghz threads, 4tb ram, costs up to ~95k (USD). Our simulated brain-in-a-box above, for memory/RAM reasons, would take a couple of these puppies, meaning you have $190k USD distributed system across 8 nodes on a rack, supporting 100m minicolumns.
Each of those 2048 threads, chomping through each of those minicolumn objects in memory, synchronizing only at the ends of each turn (sharing winners), should be quite fast, as most of the time is spent on reading/writing memory locations, rather than doing any heavy computing. Each of those cores will probably be happily hopping from column to column as memory is coming and going into the CPUs…
Writing a simple single-threaded example, then roughly multiplying that by number of threads should give us our per-round execution speed. Since memory for each pool is pre-allocated at pool startup, no time is wasted requesting and freeing up memory.
Using this approach reduces the overall network delay as much as possible, as there are fewer nodes required. The primary bottleneck in this system becomes RAM and CPU data speeds, rather than processing itself.
Feel free to poke holes. I’ll spend a couple days thinking about customized hardware inserted as PCI cards, and see what the cost comparison would be.