NLP Utils¶
Standard tools for aiding natural language processing.
Preprocess¶
Tools for preprocessing text.
-
tokenize_document
(text, remove_stops=False)[source]¶ Preprocess a whole raw document. :param text: Raw string of text. :type text: str :param remove_stops: Flag to remove english stopwords :type remove_stops: bool
Returns: List of preprocessed and tokenized documents
-
clean_and_tokenize
(text, remove_stops)[source]¶ Preprocess a raw string/sentence of text. :param text: Raw string of text. :type text: str :param remove_stops: Flag to remove english stopwords :type remove_stops: bool
Returns: Preprocessed tokens. Return type: tokens (list, str)
-
filter_by_idf
(documents, lower_idf_limit, upper_idf_limit)[source]¶ Remove (from documents) terms which are in a range of IDF values.
Parameters: - documents (list) – Either a
list
ofstr
or alist
oflist
ofstr
to be filtered. - lower_idf_limit (float) – Lower percentile (between 0 and 100) on which to exclude terms by their IDF.
- upper_idf_limit (float) – Upper percentile (between 0 and 100) on which to exclude terms by their IDF.
Returns: Filtered documents
- documents (list) – Either a