Skip to content

restrict preprocessing to text normalization and move document term matrix to lda CB #4

@pselzner

Description

@pselzner

the applicability of the preprocessing CB is quite restricted to topic modeling, since the output is not preprocessed text but the docment term matrix (DTM). Also storing the DTM creates huge files.

Suggestion:

  • Restrict the preprocessing CB to general text preprocessing with preprocessed text as output.
  • Implement DTM as part of the lda topic modeling CB. Do not save the dtm and vocab (or, if possible, only save the dtm and vocab on demand).

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions