There are two aspects to consider. One is the word encodings. You could go with random SDRs for each word, but IMO, you’ll be better off with SDRs that encode the semantics of the word.
For word SDRs that encode semantics, I would typically recommend getting them from Cortical IO, but it seems they no longer support their lower-level APIs (which you would need to access the word SDRs), so you might want to start with implementing the semantic folding algorithm. Cortical IO has published a lot of information on how it works, so not too difficult to implement. Semantic folding is patented, though, so depends on the application.
The second thing you’ll need is the speech generation component. This is where I assume you are considering applying HTM. There are a couple of projects here on the forum which have explored speech generation, but AFAIK not with the greatest results. This one for example.
I believe the lack of hierarchy is probably part of the reason for the less than satisfactory results. I also think the addition of an object/output layer from Thousand Brains Theory would help. Either / both of these might be worth exploring, if you are looking for something that isn’t “out of the box” and want to contribute to HTM theory.