Initial real time Vision to SDR encoder



Well it’s been a while. I’ve been pretty busy, but whenever I felt like I had enough spare time, I’ve continued working on something inspired by this forum:

This uses OpenCV and TensorFlow to translate rgb camera input into a sparse set of points, in as close to real time as possible. It uses n-dimensional algorithms and should be modifiable to work on any spatial input.

Currently, it doesn’t take in rotation correctly, so circles get exploded into seemingly random points instead of one or two off center circles. That might not make much sense before explaining how it works though.

How it works:

  1. To gain scale invariance, the input is split into different images, ranging from very zoomed in, to the entire view. This is the top row of images in the top window.

  2. The input pixel colors are compared to their surroundings. red to red, green to green, blue to blue. Well, they should be. I don’t think it is, because it’d look like a mostly white edge detector if it did that exactly, so I probably added some of the original image back in. This one still needs some work.

  3. The colors are again compared, but this time it’s red to green, blue to yellow, green to red, and yellow to blue.

  4. A set of orientation detectors are applied to the previous set. These ones activates a pixel color more if there’s a 3x3 stripe of white-black-white in the right orientation. There are three orientation detectors in this case. It seems to lock to red, green, or blue, instead of going in-between, so that needs some work.

  5. A set of line end detectors are applied to the previous set. These should activate pixel colors more when at the end of an oriented line. It doesn’t seem to do it quite right, but it still get’s a sparse, stable set of points. Then, to get individual points, I use max pooling and select only the pixel that’s equal to the max pooling value.

  6. In the second window, where it looks like an almost random cloud of points, I use the differences in positions of each activated pixel to set the location:

    for pixel a:
    for pixel b:
    c = new_pixel()
    c.position = b.position -a.position + center_position

    The output is bigger because pixels can be left, right, above, or below each other, which can result in negative values, so I resize it.

It does result an a spatial invariant sparse representation of the input, but adding in the orientation would make it much better. Orientation seems to still be locking to whichever vectors I inputted though, so I need to find a way to allow more in-between orientations before I can use that. I could definitely use some help there.

anyway, the code is available at

I’ll have a lot more free time a little into December, so please tell me anything you’d need to use this as a library. I’ll probably only be able to finish one or two of these before I have to work on other stuff , but here’s what I’m thinking so far:

  • A callback with direct translation to nupic SDRs (1)
  • Inputting individual images (1)
  • translate a different aspect of vision to SDR, like difference between current and previous frame
  • easy size selection
  • improved orientation detection
  • Limiting pixels chosen for space-invariance loop by distance&activation strength
  • rudimentary real time sound->SDR conversion, with some streaming audio input library like sounddevice
  • Create GPU optimized sparse tensor ops and start optimizing the spatial/temporal poolers.

It’s not installable via pip or conda yet, but you can pull and run it after installing tensorflow, opencv, and my cvpubsubs library.


Is there a way i could just feed the algorithm a png or jpeg image and get output an sdr in a form nupic is comfortable with using?


That’s the “A callback with direct translation to nupic SDRs” on the todo list. Adding png or jpeg conversion would be another item.

Currently, it takes a camera feed in, so you could just send one image through instead, and the output is a sparse tensor translated into a dense one for display, so you could output it into sparse form instead and translate the sparse tensor into nupic’s sparse form, which I think are fairly close.

Still, that’s probably easier for me to do, so I’ll add priority for those on my todo list.



Looks a little bit laggy, but still pretty cool.

Also have you thought about just using a edge detecting convolution kernel. Python and Java both likely have very optimized libraries for that. Essentially that is what you are trying to do. It’s deducible to a matrix multiplication problem as well and I know for a fact cuda has optimized code for those.

From my understanding, the only real requirements for SDR’s is that like inputs need to have like encodings and they are sparse. So if you just use a convolution kernel it will be sparse and similar transitional states will be close in relation to one another. I don’t think you actually need to do much for images/visuals to adhere to the sparse and distributed ideas. There are audio convolution kernels as well so it would probably work with that too.

I see what you are trying to do with the scale variance but you should try to avoid generating new samples of the same thing to make it work. The current machine learning industry does that already because the models are not very robust against variation and noise. So instead of fixing the problem, in machine learning you just add a bunch more data until the problem fixes itself, a very statistical approach to solving the problem.

What you could do is take a trick from the machine learning industry that helped it leaps and bounds. Max pooling not only helps capture prominent features and discards low values(which you might want to explicitly avoid the low values discarding part), but also helps center those prominent features. Maybe you can figure out a small “max pool” kernel that can kind of follow that inspiration. That’s completely theoretical and experimental though, so it might not work at all.



I actually am using an edge detector, in step 2. The only difference is I’m adding back in the original image at half brightness. I could use the optimized libraries, but I think even with a full frame, pure edge detection still worked at aroun 60 fps, so it really didn’t need any optimization no matter what algorithm I used. That might’ve been with the GPU though.

Yup. It’s the “like inputs need to have like encodings”, that makes visual input hard. If you’re viewing a set of objects, you want to have an input that’s the same no matter where the objects are located, for recognizing the objects, as well as an input that’s different, for recognizing the relation between the objects’ positions. The last step should take care of that, but only just.

These are n-dimensional algorithms, so I should be able to apply them to audio input too, with a bit of modification.

The images aren’t quite the same though. I tried to mimic human vision, where the center part of the image is very clear, but the outer part is very blurry. This might lose some scale invariance in the outer regions, but I’m not actually sure humans have scale invariance in a lot of cases.

I am using max pooling. See step 5, or ctrl-f “max pooling”. It was really helpful, and I’m not sure I would’ve been able to get sparse output without it.


Hey Sim,

Was looking at the code and it looks like you are doing alot just to get the results of tensorflows convolution filters. All of what you are doing can be done with a few arrays and a couple of extra functions and of course your hand coded edge detectors/filters. Was there a reason you relied on tensor variables and some of tensorflows functionality.

As a side note. I do like how you thought multiple types of filters were important. There is some evidence that suggests that there are permanent patterns in the eye and I think they are important for organizing visual data.


Yup! I wanted to be able to modify any code I wrote as much as possible. I ended up going through several iterations for the orientation tensor and the line-end tensor, so it proved useful. Also, having the instant GPU optimization option is nice. It looks like I’ll have to leave tensorflow behind for a bit now though.

Thanks! I actually tried to make some things as one layer, but they just didn’t work until they were split. I decided to split up the code on those for the most part, since it helped keep things organized. Also, a lot of the layers are based off of biology, so they’re separated as similarly to human neuron systems as I could think to make them.


Right on, well keep up the hacking, looks fun.

I’d like to caution against using tensorflow as default GPU optimization. Sure its arguably the fastest framework on the market right now, but it’s still really bloated. If you cut out the bloat and just focus on the math, you can do a tit for tat network equal in all manners and run it 2-3 times faster on a single gpu.

But if it was just for sandboxing some ideas I suppose there is no harm in it. Me personally I hate doing work all over again, so I just keep experiments as lightweight as possible.

Anyways, good luck, man! Catch you later.