Sparse dataset (input sparse matrix in neural model)



Hi dear colleagues!

I am currently involved in a project where we are working with a neural model.
The point is that we are starting to train our model with a sparse input (sparse matrix) and I cannot find the best way to preprocess the dataset. Even I am not sure if it is better to use the sparse input or a pre-processed input.

In one hand the dataset sparsity is about 80% to 90% and on other hands, dataset sparsity reaches 20 %. Datasets have the same origin but were picked up in different environmental conditions.
So far, I have trained my model just with the columns with nonzeros values but I realize that it is a waste of information do that.

Here is where my question comes, how could I train my neural network with a sparse input? and what set of methods can I implement to improve the performance of my NN with this kind of input?

Many thanks,


Are you using NuPIC as your “neural model”? What does your data represent? When you say that your input is already sparse, do you mean it is already in binary format? 80% is NOT sparse.


Thanks for the reply, I have built my NN model in TensorFlow core. When I said that I have a sparsity ranging from 80% to 90% is like to say that in a vector of 10 elements 8 or 9 are zero values.

[0,1,0,0,0,0,1,0,0,0] => we have 80% of zeros values. I hope it’s clear enough for you.


So you are not using NuPIC, and not using HTM. I think you are in the wrong forum?


I am pretty sure that this topic covers and cross through any language or library. So far, you asked me how I have built my model and what was 80% sparsity. :confused:


You are free to discuss it here. But I moved it from the #nupic forum into #other-topics:community-lounge.


From your description it sounds like when you say sparse input, you a referring to missing datapoints (you mentioned “columns with nonzero values” which sounds like input rows that contained multiple fields). Is that correct?

We would probably need some further details about the nature of the data. How many total fields are being measured? Is there a regular interval between each measurement (taken every second, for example)? Are there gaps between inputs?

Also, what kind of problem are you trying to solve? Classification? Prediction? Anomaly detection?


I suppose if you are using ML NNs and not memory-prediction, then you will have compatibility and accuracy issues with such varying input sparsities(20% to 80%)(20% sparsity, as you used the term, is not suitable for HTMs). You might have to do some input conditioning before it passes on to the network to make it more reasonable for the network to work on.

The preprocessing you are after might be the Spatial Pooler.