text_extraction.py Functions

text_extraction.align_tokens_on_whitespace(dictionary, out_file)[source]
text_extraction.create_annotation_entry(begin_pos=-1, begin_pos_mapped=None, end_pos=-1, end_pos_mapped=None, raw_text=None, pivot_attr=None, pivot_value=None, parity=None, tag_name=None)[source]
text_extraction.extract_annotations(ingest_file, namespaces, document_data, patterns, skip_chars=None, out_file=None)[source]
text_extraction.extract_annotations_brat_standoff(ingest_file, offset_mapping, type_prefix, tag_name, line_type, optional_attributes=[], normalization_engines=[])[source]
text_extraction.extract_annotations_csv(csv_file, delimiter, tag_name, begin_column=None, end_column=None, text_column=None, optional_attributes=[])[source]
text_extraction.extract_annotations_json(ingest_file, raw_content, offset_mapping, annotation_path, tag_name, begin_attribute=None, end_attribute=None, optional_attributes=[], normalization_engines=[])[source]
text_extraction.extract_annotations_plaintext(offset_mapping, raw_content, delimiter, tag_name)[source]
text_extraction.extract_annotations_semeval_pipes(ingest_file, offset_mapping, tag_name, optional_attributes=[])[source]
text_extraction.extract_annotations_tsv(tsv_file, raw_content, offset_mapping, tag_name, optional_attributes=[])[source]
text_extraction.extract_annotations_xml(ingest_file, offset_mapping, annotation_path, tag_name, namespaces={}, begin_attribute=None, end_attribute=None, text_attribute=None, optional_attributes=[], normalization_engines=[])[source]
text_extraction.extract_annotations_xml_spanless(ingest_file, annotation_path, tag_name, pivot_attribute, parity, namespaces={}, text_attribute=None, optional_attributes=[])[source]
text_extraction.extract_brat_attribute(ingest_file, annot_line, optional_attributes=[])[source]
text_extraction.extract_brat_equivalence(ingest_file, annot_line, optional_attributes=[])[source]
text_extraction.extract_brat_event(ingest_file, annot_line, tag_name, optional_attributes=[])[source]
text_extraction.extract_brat_normalization(ingest_file, annot_line, normalization_engines=[])[source]
text_extraction.extract_brat_relation(ingest_file, annot_line, tag_name, optional_attributes=[])[source]
text_extraction.extract_brat_text_bound_annotation(ingest_file, annot_line, offset_mapping, tag_name, line_type, optional_attributes=[])[source]
text_extraction.extract_chars(ingest_file, namespaces, document_data, skip_chars=None)[source]
text_extraction.extract_json_chars(ingest_file, document_data, skip_chars=None)[source]
text_extraction.extract_piped_text(ingest_file, skip_chars)[source]
text_extraction.extract_plaintext(ingest_file, skip_chars)[source]
text_extraction.map_position(offset_mapping, position, direction)[source]

Convert a character position to the closest non-skipped position.

Use the offset mapping dictionary to convert a position to the closest valid character position. We include a direction for the mapping because it is important to consider the closest position to the right or left of a position when mapping the start or end position, respectively.

Parameters:
  • offset_mapping – a dictionary mapping character positions to None if the character is in the skip list or to an int, otherwise
  • position – current character position
  • direction – 1, if moving right; -1 if moving left
Returns:

character position if all skipped characters were removed from the document and positions re-assigned or None, on KeyError

text_extraction.split_content(raw_text, offset_mapping, skip_chars)[source]
text_extraction.write_annotations_to_disk(annotations, out_file)[source]