SDR of chemicals and proteins

I want to be able to input some chemicals into HTMs and have them predict the reaction, but it turns out encoding chemicals is actually really hard.

The thing is, most chemicals, especially proteins, are encoded as graphs. The problem with that is that there is no real ordering or set positions of the nodes or atoms, and different orderings/positions lead to graphs that look completely different no matter how they’re represented. Something as simple as water can have three representations depending on where the oxygen atom is put in the list of atoms.

One thing I thought of was actually representing the chemicals using 3D ‘pictures’, because atoms in a chemical are typically close to each other. However, the orientation can change, and some complex chemicals can be flexible and twist around a little without changing how they work.

The main problem with representing a chemical in 3D though, is that the outer parts of the chemical will be more important for things like bonding, and a water molecule would be represented inside of a much larger protein molecule, which violates “Semantically similar data should result in SDRs with overlapping active bits”.

One thing I could do is list the outermost, most reactive chemicals/atoms first, say, on the top left side of a cube. then, I would list the connected chemicals in order of reactivity/closeness to surface. There’s still the problem of chemicals with nearly the same amount of reactivity switching positions though, but perhaps I could solve that by listing the electric field in addition to the chemicals/atoms.

Now, If I wanted to predict something else, like the structural integrity of a large chemical or protein, then that would likely need a completely different representation.

Alternatively, I could just put the proteins/chemicals or their pieces in as 3D images and have HTMs or some other neural network classify the images. I still think that would create a problem with semantic similarity though.

Sorry, a lot of that may have been me thinking out loud. I really would like some help with ideas on creating a good SDR for chemicals/proteins, because it’s a lot harder than I thought it would be. I don’t know how a water molecule could be represented with similar sparsity to a massive protein, or even a slightly larger molecule than water, or if trying to do so would actually be a good idea.

1 Like

I think you should try to uncover some tactics for chemical encodings that you can apply in different situations. Doing this across the entire input space of chemicals might be too hard.

Here are some general suggestions…

  1. While investigating how to encode complex things like chemicals, start with a smaller subset of the whole input space. In this example, choose only a few very common atoms that can be put together in different ways to create a bunch of molecules.
  2. Give up on the graphic idea. There certainly must be a better way.
  3. The better you understand the input space, the better encoder you’ll create.
  4. For each instance of the thing you need to encode, ask yourself what properties of this thing are important to represent to the system? Then think about a programatic way to extract that property.
  5. Take advantage of existing encoders. Just like the DateEncoder, you might find that a combination of ScalarEncoders, each focused on a different semantic meaning, works fine.
  6. Remember the encoding does not have to be sparse, it just needs to contain the meaning. The spatial pooling algorithm will give it a consistent sparsity.

I don’t know much about chemistry, but it might help for you to list out things within the chemicals that are important to represent in the encoding.

1 Like

Thanks for the advice!

I thought about it a lot more, looking into daylight theory’s SMILES chemical encoding and thinking about other chemical analysis software like Foldit. The thing is, especially once you get to larger molecules like proteins, chemistry is spatial in nature. A protein, composed of the same exact elements and with the same exact chemical formula, can act completely different when folded differently. These are sometimes called prions, and cause diseases like mad cow disease.

I really think it is better to feed nupic the spatial representation of proteins or chemicals. There is a problem with recognizing large molecules, and with rotation/translation, but I think I know of good solutions for those. For rotation/translation, there are two solutions: the first one being representing parts of chemicals as one atom connected to its nearby atoms, and the second being using physics software to allow rotation/translation until nupic recognizes the chemical. The problem with recognizing large proteins should be solved by the hierarchical part of nupic, by recognizing smaller then larger parts of the protein, but I’d still need to figure out how to allow nupic to interact with physics software, and to determine whether it recognizes something well enough.

I wonder whether I should use a system with set dictionary names for the elements, attached elements as lists in order, and numbers representing actual atom location relative to the central one; if I should use a system with elements and relative 3d spatial coordinates; or, if I should just actually give nupic a 3d picture with all the electron clouds.

I think the set of elements, connection types, and relative 3d spatial coordinates should work best actually. That one should keep all relevant data. And if I need to ‘rotate’/‘translate’ that one, that just means I move the coordinates a little bit or swap the order of elements…

Another thing used for simplifying 3D data is octrees. I could set up the hierarchical system using octrees, give the coordinates or bounds of each cube in the octree, and use that to represent the spatial information while limiting the neighboring nodes to six. For choosing which groups of atoms to consider molecules in the octree, the most recognized would be picked… I believe that would take care of the spatial problem.

That is a lot more programming than I thought it would be though. It would be easier if I trained nupic to recognize small molecules first… but recognizing smaller problems of one type doesn’t necessarily mean larger problems of the same type are recognized. I would need octrees eventually…

Well, I might try training nupic on chemical reactions with molecules that can be represented by a star graph first, and see if I can go from there. I feel a little lost about the ‘more recognizeable’ part, the feedback part, and the hierarchical representation part though.

1 Like

This is an interesting topic. I wonder if it would be worth the work to figure out how to encode every element in the periodic table into a small bit array representing all the most important properties of each element in the same input space. This might be the first step towards encoding molecules into bit arrays.

I still think that a giving NuPIC an image will not work well. I would try to crack the periodic table first.

1 Like

This is indeed an interesting topic. I have seen projects that tries to learn vector representation of chemical names using recurrent neural networks (e.g., LSTM) based on known chemical reactions, such as the chemnet database. It is claimed that the learned feature vectors reflect chemical fingerprints. I think we could do something similar with Cortical IO’s fingerprint technique, combined with a good database.

Bringing this thread back: I’ve got thoughts on 1) fingerprint association-embedding, 2) adjacency matrices & other sequential feeding, and 3) an evolutionary question I can’t figure out.

I’ve been thinking more on chemical graph networks lately - the computation scaling is something HTM would work wonders for if we could figure out a good encoding.
It’s definitely tricky, since graphs are more… arbitrary, I guess, than most datatypes. A scalar has a definite place in the hierarchy between minimum and maximum, and images can be… sort of figured out with topology, sometimes. But a molecule graph could have three molecules or three thousand (can molecules have three thousand atoms?) and any number of connections, yet both are considered one data-point for input. 571,346 is larger than 5, but not more complex, dimensionally speaking.

Still looking into SMILE embeddings and whatever Spektral is using for its magic, but here’s the three points:

  1. Fingerprint folding
    Convert individual atoms to fingerprints like @ycui mentioned Cortical did with words. @Paul_Lamb informed me that semantic unfolding is actually quite doable, in a sense of “deriving encodings based on co-occurrence in sentences/paragraphs”, so encoding atoms based on co-occurrence in molecules (from a chemical database, rather than wikipedia for text) seems rather canny.

  2. Adjacency matrices, sequential feeding
    First, I thought: Why not just use these SDR-fingerprints and feed in atoms one at a time, perhaps using some bits of the encoding to signal “connected to previous and/or next atom”? tm.reset() to reset the sequence and tell the HTM “End Of Chemical”. Of course there’s no “free floating/disconnected subgraphs” in a molecule, the way some abstract graphs (social media networks etc) can be, but it’s a fun thought.

  3. The brain can’t “perceive” an entire graph at once. I think.
    I thought about “how the brain understands a graph/molecule” as opposed to how the brain understands the eyes looking at a cat, or memorizing some word associations. I can’t help but feel like we’re not really… set up for graphs, in terms of evolutionary mental circuitry. So maybe trying to convert one graph to one SDR isn’t the easiest way forward.

But then, there’s examples of graph structures that were evolutionarily advantageous to understand. Several villages linked by roads and trade routes, even a mental ‘family tree’ or some other social hierarchy.

Apes growing into early humans were social, tribal - even if you don’t draw it out, a social network is a graph, and since communication and cooperation is what enabled Homo Sapiens to rise so far, I reckon a lot of our brain is at work on social tasks.
So what part of our brain lets us understand complex social relationships? How do we encode our “graph” social network/hierarchy? My feeling is that we encode it piece by piece.

Consider your own social network as an example. Spend ~15 seconds trying to visualize “your entire social circle”, then think of one specific person. Who are they connected to? You can probably “focus” on that person and a few other people (who you know they closely associate with), that may or may not also be your friends.
Now think of someone “on the opposite end” of your social circle - perhaps you met them in a different country, or time in your life, and they probably don’t know each other. Or maybe your circle is pretty localized and they maybe know the first person, but you’re not sure.

You could do your best to mentally ‘trace’ from person to person, like a graph running an A->B pathfinding algorithm. But you’re thinking of one or a couple people at a time, and each of them is stored as a different neuron-wiring-cluster in your brain, each friend with their own associated neuronal pathways that inadvertently trigger when you think of them.

So it seems to me that your brain stores this “graph” by effectively building a connected series of nodes and edges in your brain out of neurons and synapses. If you try to think of “my whole extended social circle”, it’s sort of hard to visualize that graph with ~20+ people at once, so you zoom in and explore it piecewise.
Thus I can’t help but feel like our HTM-based architecture can’t “focus” on an entire graph the way we can focus on reading one specific word or seeing a face (even though the visual recognition task is quite multifarious itself). The brain doesn’t encode or record graphs the way it encodes “I know this image to be a platypus”; it instead builds the graph on the fly (continuous learning!) as a “meta-knowledge-structure”.

Does this sound at all sensible? I realize that I’m simplifying a great deal - your neuron “map” that lights up when you see a certain person’s face isn’t necessarily a tight “cluster” like a neat node in a digital graph, and most knowledge is encoded in similar fashions.