the applicability of the preprocessing CB is quite restricted to topic modeling, since the output is not preprocessed text but the docment term matrix (DTM). Also storing the DTM creates huge files.
Suggestion:
- Restrict the preprocessing CB to general text preprocessing with preprocessed text as output.
- Implement DTM as part of the lda topic modeling CB. Do not save the dtm and vocab (or, if possible, only save the dtm and vocab on demand).