I don’t know how strong your background is in algorithms and various coding techniques but I will try to keep this as general as possible considering the source material.
In the ratslam, they use some observed organization of the hippocampus.
The body of the hippocampus and related entorhinal cortex are known to encode the environment as responses in “grid cells” - cells that fire in relation to where a critter is in a given environment. This grid pattern varies in spacing as you go from front to back.
In HTM we generally call this an encoder, in this case, spatial features.
If you were to put lots of SDR dendrites though this tissue it would be sampling features of the environment. These samples are collected in an area commonly called “place cells” which are a store of significant places (good or bad) inside this environment. If something good or bad happens you trigger learning of this place.
This is not strictly based on visual cues - a rat traveling through dark tunnels also update these grid cells. In the ratslam model, we also keep track of head/body orientation and use the distance traveled to get a rough idea of where we are.
The ratslam does use vision to ground the location to observations. The observed environment in this model is visual strips of the entire local environment. Presenting this directly to a memory has lots of high-frequency edges that would mostly translate into noise so they do a rather clever trick - they analyze the frequency content of the image. This translates spatial data into frequency data. (This is the Fourier domain they are speaking of.) Large objects would contribute to a low-frequency in that part of the image. Smaller objects would be a higher frequency blip in a certain location. Two close views of the same scene would have the same frequency blips in about the same places.
(For you fellow theorists: the patterns in grid cells with a variable spacing as you travel along the area sure looks a lot like what a spatial FFT plot would look like. The body already does this encoding trick with the waveform to frequencies mechanical FFT in the ear. (The brain does chords in processing.) For the neural equivalent thing of the long curled limbic structure with oscillation going on. The length vs width must have a resonant frequency in terms of propagation just as it would in the sound domain. Phase encoding against the predominate 10 Hz wave allows some amazing computational tricks.)
This is a very crude but effective object recognition method which also encodes relative arrangement between two objects.
A view can be compared to stored views to select the one that most closely matches stored views.
The various metrics (orientation, distance traveled, scenes) can be fused into a real-time map of the environment. One would hope that they are stored in a way where the scene location key is close to the other position data keys in the map as it would be in the brain.
I hope this gives you the tools to read the ratslam paper and understand what they are doing.
For bonus points consider a closely related structure - the amygdala. It also samples the encoded information in the hippocampus and entorhinal cortex. It reacts to things like faces and expressions and various archetypical things like snakes and looming overhead objects. These sensations are colored with emotional tones that drive action and memory. Think about what this means - somehow this soup of encoded features can be recognized as high-level objects. These are hardwired primitives in our lizard brain.