sasctl.services.text_parsing#

class sasctl._services.text_parsing.TextParsing[source]#

Bases: Service

The Text Parsing API parses natural language text documents.

Parsing is a key operation in understanding your data. Parsing a document involves the following analyses:

Identifying terms used in the document
Recognizing parts of speech for each term
Identifying which terms are entities (person, country, and so on)
Resolving synonyms, misspellings, and so on

The output tables that are generated during parsing can also be used in downstream analyses such as topic generation.

classmethod parse_documents(documents, caslib=None, id_column=None, text_column=None, description=None, standard_entities=False, noun_groups=False, min_doc_count=10, concept_model=None, output_postfix=None, spell_check=False, override_list=None, stop_list=None, start_list=None, synonym_list=None, language='en')[source]#

Performs natural language parsing on the input data.

Creates a text parsing job that executes asynchronously. There are two different interactions for parsing: parsing documents in CAS tables and parsing documents that are uploaded directly.

Parameters:

documents (str or dict or Iterable) – Documents to parse. May be either the URI to a CAS table where the documents are currently stored, or an iterable of strings containing the documents’ text.
caslib (str or dict, optional) – URI of a caslib in which the documents will be stored. Required if documents is a list of strings.
id_column (str, optional) – The column in documents that contains a unique id for each document. Required if documents is a CAS table URI.
text_column (str, optional) – The column in documents that contains the document text to parse. Required if documents is a CAS table URI.
description (str, optional) – Description to add to the text parsing job.
standard_entities (bool, optional)
noun_groups (bool, optional)
min_doc_count (int, optional) – Minimum number of documents in which a term must appear to be kept. Defaults to 10.
output_postfix (str, optional) – Text to be added to the end of all output table names.
spell_check (bool, optional) – Whether spell checking should be performed during parsing.
concept_model (str or dict, optional) – URI of a table containing the concept LITI binaries to apply during parsing.
override_list (str or dict, optional) – URI of a table containing overrides for the keep and drop terms.
language (str, optional) – Two letter ISO 639-1 code indicating the source language. Defaults to ‘en’.

Returns:

RestObj – The submitted job