I don't get it. Couldn't you just encode the original and delta sequences as two different scalar inputs? Then the input would use both original and derived information.
Oh, are you saying the machine should somehow be able to figure out that the data sets share that similar quality once transformed? If so, I don't believe that's possible without designing it specifically to do that.
My reasoning for that is this: There is a lot of data in videos that we don't perceive [example], that could be useful in some cases. There is also a lot of data that we do perceive, sometimes whether we want to or not, that isn't actually in videos either [example]. This implies our perception is skewed in some ways but not others, likely because it helps us figure out and react to our environment in ways that help us survive better. [I might as well link this guy while I'm at it.] So, our brains are specifically built to notice patterns in some transformed versions of our input, but not others.
If we're able to see some transformations, but not others, then I think any machines we design would have to be designed to see the transformations we want it to see. (Though, that doesn't mean it can't modify itself to see more transformations if it ever notices them, like how we modify ourselves with glasses, 3D glasses, etc.)