What should be criteria to decide the optimal values of n and w for given time series data? Would the default values of n and w work in most cases ?
Have you seen this video with a description of how changing n
, w
, and resolution
affect encodings?
Yes. But I am still not sure how to decide about specific values based on my data.
So what does your data look like? floats, ints? min/max?
Two sets of data. Both Integers
One ranges between 0 and 10,000
and second ranges between 0 and 400,000
Here is the calculation we have used in the past:
So you choose how many buckets you want to represent your data between min/max. For example, if your data ranged from 0 - 9,999 and you chose 5000 buckets, each bucket would represent two values, therefore your resolution would be 2
. But that would mean that juxtaposed numbers might have the same encoding. Increasing the number of buckets will decrease the resolution of the encoding.
Looking at it this way, you don’t need to specify n
and w
.
Oh… that’s really useful. But not specifying n and w would mean n = 400 and w=21 … By specifying just resolution would the same n and w work for both sets of data ?
You might want different params for each RDSE encoder. See the constructor docs for RandomDistributedScalarEncoder
.
What I meant was just changing resolution, would that be enough without specifying values of n and w for different sets of data.
Yes, you’ll only need to find a different resolution value for each field, which might depend on your min/max and how granular you want the encoding for each field.
Look at it like this… the higher the resolution
, the more values will go into each bucket, so there will be less overall encodings, which each represent more values. A high resolution smudges a bunch of values together, and a lower resolution ensures more numbers are encoded more uniquely.