In my original implementation, I just used the text blocks placed by the authors (this seems logical since the folks who wrote the text felt like these were logical breaks). My latest implementation allows online learning by borrowing the concept of eligibility traces from RL, removing the need to define text snippets up front.
This bit took me the longest to figure out. Given the properties of SDRs, however, it isn’t actually necessary to encode word semantics with this type of topology. My initial implementation did not have the “similar meanings closer to one another” topology, but was still able to replicate some of the frequently-cited word SDR math, like “Jaguar - Lion = Porsche”.
I’ve since discovered that hex grids can be used to distill topology from semantics. I’m working on a project to generalize this concept into a “universal encoder” algorithm. The idea originally started from a conversation with @jordan.kay a couple years ago.