folia.main.Document

class folia.main.Document(*args, **kwargs)

Bases: object

This is the FoLiA Document and holds all its data in memory.

All FoLiA elements have to be associated with a FoLiA document. Besides holding elements, the document may hold metadata including declarations, and an index of all IDs.

Method Summary

`__init__`(args, *kwargs)	Start/load a FoLiA document:
`add`(text)	Alias for `Document.append()`
`alias`(annotationtype, set[, fallback])	Return the alias for a set (if applicable, returns the unaltered set otherwise iff fallback is enabled)
`append`(text)	Add a text (or speech) to the document:
`attachexternal`(type, set, **kwargs)
`count`(Class[, set, recursive, ignore])	See `AbstractElement.count()`
`create`(Class, args, *kwargs)	Create an element associated with this Document.
`date`([value])	Get or set the document's date from/in the metadata.
`declare`(annotationtype[, set])	Declare new annotation types, sets or annotators to be used in the document.
`declared`(annotationtype[, set])	Checks if the annotation type is present (i.e. declared) in the document.
`defaultannotator`(annotationtype[, set])	Obtain the default annotator for the specified annotation type and set.
`defaultannotatortype`(annotationtype[, set])	Obtain the default annotator type for the specified annotation type and set.
`defaultdatetime`(annotationtype[, set])	Obtain the default datetime for the specified annotation type and set.
`defaultset`(annotationtype)	Obtain the default set for the specified annotation type.
`done`()	Signal that you are done editing the document, this will perform any pending post-processing operation
`elements`()	Returns a depth-first flat list of all elements in the document
`erase`(Class[, annotationset])	Erases all annotations of a particular type and annotation set (unless set is False in which case it applies to all elements regardless of set).
`findwords`(args, *kwargs)
`getannotators`(annotationtype, annotationset)	Get all annotators for the given annotationtype and set.
`getdefaultprocessor`(annotationtype, ...)
`getprocessors`(annotationtype, annotationset)	Get all processors associated with the given annotationtype and set, generator yielding Processor instances, see also :meth:AbstractElement.getannotators
`hasannotators`(annotationtype, annotationset)	Alias for `Document.hasprocessors()`: Does this annotationtype and set have associated processors/annotators? (FoLiA v2 provenance data)
`hasdefaultprocessor`(annotationtype, ...)	Does this annotationtype and set have defaults? (new style FoLiA v2 with provenance data)
`hasdefaults`(annotationtype, annotationset)	Does this annotationtype and set have associated defaults? (old style FoLiA v1 without provenance data)
`hasprocessors`(annotationtype, annotationset)	Does this annotationtype and set have associated processors/annotators? (FoLiA v2 provenance data)
`items`()	Returns a depth-first flat list of all items in the document
`json`()	Serialise the document to a `dict` ready for serialisation to JSON.
`jsondeclarations`()	Return all declarations in a form ready to be serialised to JSON.
`jsonprovenance`()	Internal method to serialize provenance data to JSON
`language`([value])	No arguments: Get the document's language (ISO-639-3) from metadata Argument: Set the document's language (ISO-639-3) in metadata
`license`([value])	No arguments: Get the document's license from metadata Argument: Set the document's license in metadata
`load`(filename)	Load a FoLiA XML file.
`paragraphs`([index])	Return a generator of all paragraphs found in the document.
`parsemetadata`(node)	Internal method to parse metadata
`parsesubmetadata`(node)
`parsexml`(node[, ParentClass])	Internal method.
`parsexmldeclarations`(node)	Internal method to parse XML declarations
`parsexmlprovenance`(node)
`pendingsort`([warnonly])	Perform any pending sorts on span annotation elements (per layer, in turn recurses into all span annotations)
`pendingvalidation`([warnonly])	Perform any pending validations
`publisher`([value])	No arguments: Get the document's publisher from metadata Argument: Set the document's publisher in metadata
`save`([filename, form])	Save the document to file.
`select`(Class[, set, recursive, ignore])	See `AbstractElement.select()`
`sentences`([index])	Return a generator of all sentence found in the document.
`text`([cls, retaintokenisation, hidden, ...])	Returns the text of the entire document, returns a single string
`title`([value])	Get or set the document's title from/in the metadata
`unalias`(annotationtype, alias)	Return the set for an alias (if applicable, raises an exception otherwise)
`words`([index])	Return a generator of all active words found in the document.
`xml`([form])	Serialise the document to XML.
`xmldeclarations`()	Internal method to generate XML nodes for all declarations
`xmlmetadata`()	Internal method to serialize metadata to XML
`xmlprovenance`()	Internal method to serialize provenance data to XML
`xmlstring`([form])	Return the XML representation of the document as a string.
`xpath`(query)	Run Xpath expression and parse the resulting elements.

Attributes

IDSEPARATOR

Method Details

__init__(*args, **kwargs)

Start/load a FoLiA document:

There are four sources of input for loading a FoLiA document:

Create a new document by specifying an ID:
```
doc = folia.Document(id='test')
```

Load a document from FoLiA or D-Coi XML file:

doc = folia.Document(file='/path/to/doc.xml')

Load a document from an XML string:

doc = folia.Document(string='<FoLiA>....</FoLiA>')

Load a document by passing a parse xml tree (lxml.etree):
```
doc = folia.Document(tree=xmltree)
```

You will often want to associate a Processor when you instantiate a document, the processor encapsulates information regarding the tool that is processing a document (i.e. your script), and adds this to the document’s provenance chain. Any new annotations you add to this document will be automatically related to the processor:

doc = folia.Document(id="example", processor=Processor.create(name="my-tool", version="0.1"))

Keyword Arguments:

setdefinition (dict) – A dictionary of set definitions, the key corresponds to the set name, the value is a SetDefinition instance
loadsetdefinitions (bool) – download and load set definitions (default: False)
deepvalidation (bool) – Do deep validation of the document (default: False), implies loadsetdefinitions
textvalidation (bool) – Do validation of text consistency (default: False), this value is always forced to True to FoLiA v1.5 and above``
preparsexmlcallback (function) – Callback for a function taking one argument (node, an lxml node). Will be called whenever an XML element is parsed into FoLiA. The function should return an instance inherited from folia.AbstractElement, or None to abort parsing this element (and all its children)
parsexmlcallback (function) – Callback for a function taking one argument (element, a FoLiA element). Will be called whenever an XML element is parsed into FoLiA. The function should return an instance inherited from folia.AbstractElement, or None to abort adding this element (and all its children)
keepversion (bool) – attempt to keep the FoLiA version (use with caution)
version (str) – force a particular FoLiA version when creating a new document (use with caution)
declare (list) – Declare the specifies annotation types. Consists of a list or tuple of annotationtypes or (annotation,set) tuples or (annotationtype,set,processor) tuples
processor (Processor) – Register the current processor in the provenance data and use this processor in all subsequent declarations.
reprocessor (Processor) – As above, but will take pro-active ownership of any declarations already present but not tied to a processor yet.
debug (bool) – Boolean to enable/disable debug
autodeclare (bool) – Automatically declare annotation types and annotators whenever possible (enabled by default for FoLiA v2)
mode – The mode for loading a document, is either folia.Mode.MEMORY, in which the entire FoLiA Document will be loaded into memory. This is the default mode and the only mode in which documents can be manipulated and saved againor folia.Mode.XPATH, in which the full XML tree will still be loaded into memory, but conversion to FoLiA classes occurs only when queried. This mode can be used when the full power of XPath is required.
checkreferences (bool) – Check whether references are valid upon loading (default: True)
fixunassignedprocessor (bool) – If set, fixes invalid FoLiA that does not explicitly assign a processor to an annotation when multiple processors are possible (and there is therefore no default). The last processor will be used in this case. (default: False)
fixinvalidereferences (bool) – Do not serialise an invalid reference, remove the reference and output a comment instead. (default: False)

__init__(*args, **kwargs)

Start/load a FoLiA document:

There are four sources of input for loading a FoLiA document:

Create a new document by specifying an ID:
```
doc = folia.Document(id='test')
```

Load a document from FoLiA or D-Coi XML file:

doc = folia.Document(file='/path/to/doc.xml')

Load a document from an XML string:

doc = folia.Document(string='<FoLiA>....</FoLiA>')

Load a document by passing a parse xml tree (lxml.etree):
```
doc = folia.Document(tree=xmltree)
```

You will often want to associate a Processor when you instantiate a document, the processor encapsulates information regarding the tool that is processing a document (i.e. your script), and adds this to the document’s provenance chain. Any new annotations you add to this document will be automatically related to the processor:

doc = folia.Document(id="example", processor=Processor.create(name="my-tool", version="0.1"))

Keyword Arguments:

setdefinition (dict) – A dictionary of set definitions, the key corresponds to the set name, the value is a SetDefinition instance
loadsetdefinitions (bool) – download and load set definitions (default: False)
deepvalidation (bool) – Do deep validation of the document (default: False), implies loadsetdefinitions
textvalidation (bool) – Do validation of text consistency (default: False), this value is always forced to True to FoLiA v1.5 and above``
preparsexmlcallback (function) – Callback for a function taking one argument (node, an lxml node). Will be called whenever an XML element is parsed into FoLiA. The function should return an instance inherited from folia.AbstractElement, or None to abort parsing this element (and all its children)
parsexmlcallback (function) – Callback for a function taking one argument (element, a FoLiA element). Will be called whenever an XML element is parsed into FoLiA. The function should return an instance inherited from folia.AbstractElement, or None to abort adding this element (and all its children)
keepversion (bool) – attempt to keep the FoLiA version (use with caution)
version (str) – force a particular FoLiA version when creating a new document (use with caution)
declare (list) – Declare the specifies annotation types. Consists of a list or tuple of annotationtypes or (annotation,set) tuples or (annotationtype,set,processor) tuples
processor (Processor) – Register the current processor in the provenance data and use this processor in all subsequent declarations.
reprocessor (Processor) – As above, but will take pro-active ownership of any declarations already present but not tied to a processor yet.
debug (bool) – Boolean to enable/disable debug
autodeclare (bool) – Automatically declare annotation types and annotators whenever possible (enabled by default for FoLiA v2)
mode – The mode for loading a document, is either folia.Mode.MEMORY, in which the entire FoLiA Document will be loaded into memory. This is the default mode and the only mode in which documents can be manipulated and saved againor folia.Mode.XPATH, in which the full XML tree will still be loaded into memory, but conversion to FoLiA classes occurs only when queried. This mode can be used when the full power of XPath is required.
checkreferences (bool) – Check whether references are valid upon loading (default: True)
fixunassignedprocessor (bool) – If set, fixes invalid FoLiA that does not explicitly assign a processor to an annotation when multiple processors are possible (and there is therefore no default). The last processor will be used in this case. (default: False)
fixinvalidereferences (bool) – Do not serialise an invalid reference, remove the reference and output a comment instead. (default: False)

add(text): Alias for Document.append()

alias(annotationtype, set, fallback=False): Return the alias for a set (if applicable, returns the unaltered set otherwise iff fallback is enabled)

append(text)

Add a text (or speech) to the document:

Example 1:

doc.append(folia.Text)

Example 2::: doc.append( folia.Text(doc, id=’example.text’) )

Example 3:

doc.append(folia.Speech)

attachexternal(type, set, **kwargs)

count(Class, set=False, recursive=True, ignore=True): See AbstractElement.count()

create(Class, *args, **kwargs): Create an element associated with this Document. This method may be obsolete and removed later.

date(value=None)

Get or set the document’s date from/in the metadata.

No arguments: Get the document’s date from metadata Argument: Set the document’s date in metadata

declare(annotationtype, set=None, *args, **kwargs)

Declare new annotation types, sets or annotators to be used in the document.

This typically done by associating an annotationtype and set with a processor, the processor contains annotator information and will be recorded in the provenance data.

Parameters:

annotationtype – The type of annotation, this is conveyed by passing the corresponding annotation class (such as PosAnnotation for example), or a member of AnnotationType, such as AnnotationType.POS.
set (str) – the set, should formally be a URL pointing to the set definition

Positional Arguments:: processor (Processor or str): A processor to declare, can be a processor instance or an ID of an existing processor. The processor encapsulates all information of an annotator. If you specify multiple processors then they are parsed as a hierarchy, the first one being the root and the others subprocessors.

Keyword Arguments:

alias (str) – Defines alias that may be used in set attribute of elements instead of the full set name
generator (bool) – Automatically append a subprocessor with generator information on the FoLiA library used? (default: True)

Keyword Arguments (<= FoLiA 1.5 behaviour, i.e. without provenance data):: annotator (str): Sets a default annotator old-style, i.e. without full provenance annotatortype: Old-style, should be either AnnotatorType.MANUAL or AnnotatorType.AUTO, indicating whether the annotation was performed manually or by an automated process. Please use processor= instead. datetime (datetime.datetime): Sets the default datetime

Example 1 (with provenance):

doc.declare(folia.PosAnnotation, 'http://some/path/brown-tag-set', Processor(name="mytagger") )

Example 2 (with provenance; nested processors):

main_processor = Processor(name="myNLPtool", version="2.2")
doc.declare(folia.PosAnnotation, 'http://some/path/brown-tag-set', main_processor, Processor(name="mytagger"))
doc.declare(folia.LemmaAnnotation, 'http://some/set', main_processor, Processor(name="mylemmatiser"))

Example 2b (with provenance; nested processors, same as above but setting main processor on Document instantiation instead):

doc = folia.Document(id="mydoc", processor=Processor(name="myNLPtool", version="2.2"))
doc.declare(folia.PosAnnotation, 'http://some/path/brown-tag-set', Processor(name="mytagger"))
doc.declare(folia.LemmaAnnotation, 'http://some/set', Processor(name="mylemmatiser"))

Example 3 (with provenance; nested processors):

main_processor = Processor(name="myEditor", version="1.2")
doc.declare(folia.PosAnnotation, 'http://some/path/brown-tag-set', main_processor, Processor(name="alice", type=AnnotatorType.MANUAL))
doc.declare(folia.PosAnnotation, 'http://some/path/brown-tag-set', main_processor, Processor(name="bob", type=AnnotatorType.MANUAL))
doc.declare(folia.PosAnnotation, 'http://some/path/brown-tag-set', main_processor, Processor(name="john", type=AnnotatorType.MANUAL))

Example 4 (without provenance, for backward compatibility, the use of proper provenance is always preferred!):

doc.declare(folia.PosAnnotation, 'http://some/path/brown-tag-set', annotator="mytagger", annotatortype=folia.AnnotatorType.AUTO)

Returns::: Processor instance of the last processor added (or None if no provenance is used)

declared(annotationtype, set=False)

Checks if the annotation type is present (i.e. declared) in the document.

Parameters:

annotationtype – The type of annotation, this is conveyed by passing the corresponding annototion class (such as PosAnnotation for example), or a member of AnnotationType, such as AnnotationType.POS.
set (str/None/False) – the set, should formally be a URL pointing to the set definition (aliases are also supported). If set to False, checks regardless of set (i.e. matching any set). If set to None, there is no associated set.

Example:

if doc.declared(folia.PosAnnotation, 'http://some/path/brown-tag-set'):
    ..

Returns:: bool

defaultannotator(annotationtype, set=False)

Obtain the default annotator for the specified annotation type and set.

Parameters:

annotationtype – The type of annotation, this is conveyed by passing the corresponding annototion class (such as PosAnnotation for example), or a member of AnnotationType, such as AnnotationType.POS.
set (str/None/False) – the set, should formally be a URL pointing to the set definition or None for setless annotations. If set to False, the default set will be inferred automatically, but an exception will occur if there is none!

Returns:

the set (str)

Raises:

NoDefaultError –

defaultannotatortype(annotationtype, set=False)

Obtain the default annotator type for the specified annotation type and set.

Parameters:

annotationtype – The type of annotation, this is conveyed by passing the corresponding annototion class (such as PosAnnotation for example), or a member of AnnotationType, such as AnnotationType.POS.
set (str/None/False) – the set, should formally be a URL pointing to the set definition or None for setless annotations. If set to False, the default set will be inferred automatically, but an exception will occur if there is none!

Returns:

AnnotatorType.AUTO or AnnotatorType.MANUAL

Raises:

NoDefaultError –

defaultdatetime(annotationtype, set=False)

Obtain the default datetime for the specified annotation type and set.

Parameters:

annotationtype – The type of annotation, this is conveyed by passing the corresponding annototion class (such as PosAnnotation for example), or a member of AnnotationType, such as AnnotationType.POS.
set (str/None/False) – the set, should formally be a URL pointing to the set definition or None for setless annotations. If set to False, the default set will be inferred automatically, but an exception will occur if there is none!

Returns:

the set (str)

Raises:

NoDefaultError –

defaultset(annotationtype)

Obtain the default set for the specified annotation type.

Parameters:: annotationtype – The type of annotation, this is conveyed by passing the corresponding annototion class (such as PosAnnotation for example), or a member of AnnotationType, such as AnnotationType.POS.
Returns:: the set (str or None), or False if there is no default set. Take care to explicitly distinguish between False and None!
Raises:: NoSuchAnnotation –

done(): Signal that you are done editing the document, this will perform any pending post-processing operation

elements(): Returns a depth-first flat list of all elements in the document

erase(Class, annotationset=False): Erases all annotations of a particular type and annotation set (unless set is False in which case it applies to all elements regardless of set). Also removed the declarations (i.e. the opposite of declare())

findwords(*args, **kwargs)

getannotators(annotationtype, annotationset): Get all annotators for the given annotationtype and set. This is a generator that yields Annotator instances, these resolve to a Processor when called. See also :meth:AbstractElement.getprocessors to obtain processors directly, which is most likely what you want.

getdefaultprocessor(annotationtype, annotationset)

getprocessors(annotationtype, annotationset): Get all processors associated with the given annotationtype and set, generator yielding Processor instances, see also :meth:AbstractElement.getannotators

hasannotators(annotationtype, annotationset): Alias for Document.hasprocessors(): Does this annotationtype and set have associated processors/annotators? (FoLiA v2 provenance data)

hasdefaultprocessor(annotationtype, annotationset): Does this annotationtype and set have defaults? (new style FoLiA v2 with provenance data)

hasdefaults(annotationtype, annotationset): Does this annotationtype and set have associated defaults? (old style FoLiA v1 without provenance data)

hasprocessors(annotationtype, annotationset): Does this annotationtype and set have associated processors/annotators? (FoLiA v2 provenance data)

items(): Returns a depth-first flat list of all items in the document

json()

Serialise the document to a dict ready for serialisation to JSON.

Example:

import json
jsondoc = json.dumps(doc.json())

jsondeclarations()

Return all declarations in a form ready to be serialised to JSON.

Returns:: list of dict

jsonprovenance(): Internal method to serialize provenance data to JSON

language(value=None): No arguments: Get the document’s language (ISO-639-3) from metadata Argument: Set the document’s language (ISO-639-3) in metadata

license(value=None): No arguments: Get the document’s license from metadata Argument: Set the document’s license in metadata

load(filename)

Load a FoLiA XML file.

Argument:: filename (str): The file to load

paragraphs(index=None)

Return a generator of all paragraphs found in the document.

If an index is specified, return the n’th paragraph only (starting at 0)

parsemetadata(node): Internal method to parse metadata

parsesubmetadata(node)

parsexml(node, ParentClass=None)

Internal method.

This is the main XML parser, will invoke class-specific XML parsers.

parsexmldeclarations(node): Internal method to parse XML declarations

parsexmlprovenance(node)

pendingsort(warnonly=None): Perform any pending sorts on span annotation elements (per layer, in turn recurses into all span annotations)

pendingvalidation(warnonly=None)

Perform any pending validations

Parameters:: warnonly (bool) – Warn only (True) or raise exceptions (False). If set to None then this value will be determined based on the document’s FoLiA version (Warn only before FoLiA v1.5)
Returns:: bool

publisher(value=None): No arguments: Get the document’s publisher from metadata Argument: Set the document’s publisher in metadata

save(filename=None, form=0)

Save the document to file.

Parameters:: filename (*) – The filename to save to. If not set (None, default), saves to the same file as loaded from.

select(Class, set=False, recursive=True, ignore=True): See AbstractElement.select()

sentences(index=None)

Return a generator of all sentence found in the document. Except for sentences in quotes.

If an index is specified, return the n’th sentence only (starting at 0)

text(cls='current', retaintokenisation=False, hidden=False, trim_spaces=True, correctionhandling=1): Returns the text of the entire document, returns a single string

See also

AbstractElement.text()

title(value=None)

Get or set the document’s title from/in the metadata

No arguments: Get the document’s title from metadata Argument: Set the document’s title in metadata

unalias(annotationtype, alias): Return the set for an alias (if applicable, raises an exception otherwise)

words(index=None)

Return a generator of all active words found in the document. Does not descend into annotation layers, alternatives, originals, suggestions.

If an index is specified, return the n’th word only (starting at 0)

xml(form=0)

Serialise the document to XML.

Returns:: lxml.etree.Element