In dropconnect, at each round during training, a random set of weights is set to zero. During inference, all weights are used - it is still a dense model, since all connections have a weight attributed to it. A common interpretation of dropout techniques (but not the only interpretation) is that it allows you to learn several different models with one single network, so you are actually learning an ensemble of smaller networks that shares some parameters.
In the paper you cited, weights are sparse at initialization and at inference. But most important, what leads to robustness is not the sparse weights alone, but the combination of sparse weights and sparse activation functions (k-winners with boosting).