Distributed HTM?

I have dabbled with Thesbian, which is an agent based environment in Python. I did wonder about how nupic HTM might be distributed and the use of agent based concurrency made some sense? I note an amalgam of nupic and torch. What about ray.io? Is it too soon to dream?

Cheers,
A

5 Likes

A year and a half ago, I’d done some work starting an HTM implementation in Elixir, where each column acts as an agent sending messages around in a pool.

Seeing as how we’ve been given some time here, I’ve been dusting that off and looking at revamping it as a web API.

I’ll resurrect my old thread in the morning and bring my repo online.

Be well.

1 Like

IMO it would be interesting to see topology implemented in a distributed manner.

2 Likes

From my experience. Running a single HTM layer on multiple systems is likely a very bad idea. HTM is entirely bounded by the memory bandwidth available to it. Having a distributed system to process a HTM layer is going to drastically decease the effective bandwidth due to communication latency and networks are just slow compare to DRAM.

On the other hand, distributing HTM to multiple systems sounds promising. And is a a planned feature of my HTM library. But I didn’t got around to build it nor have a piratical use. It could be useful to scale 1k brain theory to human scale. But we need a fully working TBT theory before that.

1 Like

It doesn’t make sense to distribute the same task (htm layer) in multiple distributed units. At least in the perspective of distributed computing. If distributed computing is used in HTM I would recommend using decoupled tasks that can be run independently in distributed units, therefore drastically minimizing communication between units. Communication is only done ideally during aggregation of results.

1 Like

Currently, the topology looks like this:

server–> pool manager --> columns.

Each part is its own process, using the BEAM VM’s concept of lighweight threads. Each process holds its own state, and communicates with others via messaging. This allows each column to act as an independent entity, which makes modelling a bit easier. Where Elixir is built on top of Erlang, and Erlang’s BEAM VM already includes the ability to scale and distribute work across hosts, reducing the amount of engineering we’d have to do. The VM automatically distributes processes across all the cores in a system, and across the hosts whenever a new node is added.

Current process flow:

Step one:
encoder(s) ---- SDR —> pool —(broadcast message)–> column(s); check overlaps and predictive state

Step two:
pool <—(connection count messages)— column(s)

Step three:
pool —(to winners)—> column(s): strengthen overlap connections; if in predictive state, strengthen. If not, burst.

Step four:
Update pool state, so that it is able to be queried from remote host

Planned expansions: remote callbacks for “IO block system” (state machine which amplifies/dampens pool’s connections to IO block depending on current state, whose parameters can be configurable).

Intended use is for writing controllers for simulators, where controller would encode simulator conditions and send SDRs to pool server, which would then asynchronously send a callback with output data. Using this model, could even connect to robotics projects using something like ESP32 or its cousins for real world experiments.

Anyway, had actual work come in yesterday afternoon, which might throw back my repo setup just a bit. I’ll revive that old thread when I get my code up.

4 Likes

I applaud you for leveraging BEAM / Erlang / Elixir to do this. I surmise you are also making heavy use of the OTP?

2 Likes

Heavy OTP use, with GenServers out the wazoo. Tweaking a bit to see if I can get some better efficiency gains. Think I should have time to upload my current stuff this evening.

1 Like

I put a few hours of thought into what it would take to simulate this, memory and costs, using existing hardware, and aiming for minimal memory footprint. I’ll write a little later about using customized hardware (ASIC/FPGA) for doing the same activity, and its costs.

Using standard, existing hardware

If we do away with weighted cells in the minicolumns, instead essentially using them as keys in a dictionary/map that contain a list of “proximal connections”, we can shrink the overall size footprint in terms of memory, enabling us to write an actor model of sorts in C/C++. I propose the following model, in which all variables are statically declared at compile time. Strongly influenced by embedded devolopment:

Minicolumn object (struct):

  • array_cell_vote_boxes[ ] // uint16/32 value per cell. Incremented by “timestep - 1” winning cells.
  • ptr_idx_id // pointer where column updates its overlap score in pool object.
  • cells[ cell_connections[ ] ] // work work as a ring buffer, per cell, holding ptrs to winning cells vote boxes.
  • cells_buffer_idx[ cell_connections_idx_counter ] // keep track of where each cell’s connections are in its buffer
  • cells_connection_buffer_full_bmp // track if cell connection buffer is full.
  • col_connect_bmp // bitstring representing distal connections of minicol to input space
  • col_act_state_bmp // bitstring representing overlap to current encoding for timestep
  • col_dstl_conn_strngth[ ] // uint8; for connection strength for each connected bit.
  • bool_previously_activated // were we fired up on the previous timestep?
  • flag_minicol_access // semaphore/flag to incidate if another column is accessing this one.
  • winner_candidates[ ] // ptrs of prevwinning cells voteboxes. Passed in with input encoding. --> could also simply be a pointer to a single memory location where current previous winners are located.

Pool Object (struct):

  • array_of_winners[ ] // list of pointers to winning cells vote boxes
  • avg_overlap_score // updated as minicolumn overlap scores flow in
  • completed_count // uint32 to keep track of how many minicols have reported overlap
  • min_score // find lowest activation score
  • max_score // find max score
  • poolstate_bmp // bitstring representing activation state of pool for current timestep.
  • minicol_pool_list [ ] // list to pointers of minicolumn structs
  • minicol_overlap_score [ ] //uint16, enough bits to hold up to ~64k score value per cell. Updated by minicols themselves.
  • flag_incrementing_ctr // semaphore/lock for counter
  • flag_min_val // semaphore/lock for pool.max_score
  • flag_max_val // semaphore/lock for pool.min_score

Operation (high level):

  1. Load current input encoding into its variable.
  2. iterating over list of minicols, (can be broken to different processes) send pointer location of encoding, asking for overlap score.
  3. Wait for columns to finish ( columns directly increment pool.completed_count and pool.minicol_overlap_score)
  4. Based on average, min, max, choose winning threshold
  5. Depending on pool size pick winners based on score (random sampling, random starting location in list, etc… pick a scheme).
  6. For winning minicols, change minicol.previously_activated to true.
  7. Strengthen connection score where overlap existed (optionally weaken non-overlap at random?)
  8. Choose winning cell in minicol.
  9. Update (at random?) previous winning minicols (grabbing f"lag_minicol_access") connections (minicol.cells array), increment cell_buffer_idx as needed, and flip “buffer_full” state if needed.
  10. (minicol) update in pool object address of winning cell.

IO Block (very high level):

  • Shard work across systems, handing information back and forth.

Memory Usage:
The main multiplier of memory for a system like this, is how many cells per minicol you want to simulate, and how many proximal connections you want each of those cells to support. The memory of everything else is almost trivial by comparison.

For example, in a system of 100 million minicols (instead of 120 or 150m), with 120 cells in each, with each cell supporting up to 512 proximal connections, would take ~7.3tb of memory, assuming no pool has more than ~4 million columns (32bit addressing for each pool), and ignoring everything else on a running system. Not insurmountable, but might be worth examining that number, considering that people who receive hemispherectomies still seem to mostly get by in life after a period of adjustment. Maybe we only need half of our neocortex??

Processing Speed:
A+ Server 2124BT-HTR with four nodes, each with two EPYC 7742 CPUs (64 cores, 124 threads each) --> 1024 x 2Ghz threads, 4tb ram, costs up to ~95k (USD). Our simulated brain-in-a-box above, for memory/RAM reasons, would take a couple of these puppies, meaning you have $190k USD distributed system across 8 nodes on a rack, supporting 100m minicolumns.

Each of those 2048 threads, chomping through each of those minicolumn objects in memory, synchronizing only at the ends of each turn (sharing winners), should be quite fast, as most of the time is spent on reading/writing memory locations, rather than doing any heavy computing. Each of those cores will probably be happily hopping from column to column as memory is coming and going into the CPUs…

Writing a simple single-threaded example, then roughly multiplying that by number of threads should give us our per-round execution speed. Since memory for each pool is pre-allocated at pool startup, no time is wasted requesting and freeing up memory.

Using this approach reduces the overall network delay as much as possible, as there are fewer nodes required. The primary bottleneck in this system becomes RAM and CPU data speeds, rather than processing itself.

Feel free to poke holes. I’ll spend a couple days thinking about customized hardware inserted as PCI cards, and see what the cost comparison would be.

4 Likes

Hopefully I can share some insight to hardware limitations.

I can’t stress how slow synchronizing across CPU threads can be, you could evaluate several minicolumns in the same amount of time to synchronize two threads. Syncing across networks is another magnitude slower.

Also (assuming you have directly access to the DRAM controller) writing to non-continuous addresses in small chunks are very slow. DRAM works in banks and accessing nearby banks are faster than randomly accessing. Like most cases, the core won’t have direct access to DRAM but have to go through a system bus (AXI, WishBone, etc…). In this case initiating a DRAM transfer takes ~64 cycles each time and is mostly a blocking operation; simultaneous transfers are not allowed further limiting the effective bandwidth. In practice frequent, small reads are 10x slower then a single large read. A solution is to build a write-back buffer to only write data when needed. Or plain the access pattern carefully to make memory r/w efficient.

BTW, which FPGA you are using?

1 Like

If we’re distributing, where each column is doing its own work, there’s almost no need to sync threads. Each one can act, for the most part, independently of the others, so long as they all have access to the memory.

My thought approach is to avoid syncing as much as possible. Keep column memory in contiguous arrays of memory, and copy only when needed (i.e. utilize DRDY pins on a per-area basis). In any case, RAM will always be faster than any FPGA we can reasonably get our hands on.

BTW, which FPGA you are using?

At the moment, using an ICE40LP8K as provided on a TinyFPGA BX board, but have a DE10 Nano on the way with a CycloneV SoC (with embedded ARM hard core) which will give a nice jump up from 8K LUT to 110k LE.

For FPGA implementation ideas, we should continue this discussion here.

2 Likes

But less work to weave thespian into the python implementation and so as not to drag the question into the BEAM model, which is perversely the same model Thespian is offering. So, much off topic herein on Erland, Elixir and FPGA when the question was back fitting Thespian into the standing python implementation. Still, the discussion of how to carve up appropriately is obviously apt.