Library API¶
Tree()
is the main class of this package. An instance of the class
represents the tree information.
Construction of the tree representation¶
To construct the tree representation, information about the nodes of the tree and their parent node is required.
Constructing from a tabular file¶
Tree.construct_from_tabular(filename, separator="\t", elem_field_num=0, parent_field_num=1)
allows the construction of a Tree()
object from a tabular file.
Header or comment lines starting with a #
are ignored.
The elem_field_num
and parent_field_num
parameters are 0-based field
numbers of the fields (columns) in the tabular file containing, respectively,
element IDs and the IDs of the element parents. Parents can be defined before
or after their children.
If the tabular file is a NCBI Taxonomy dump nodes file, the method
Tree.construct_from_ncbi_dump(filename)
can be used, which sets the
separator and field numbers appropriately.
Constructing from a different data source¶
For constructing the tree from a different data source (e.g. a database)
the Tree.construct(generator)
class method can be used.
The paremeter shall be a generator, which yields pairs of values, that
are the ID of each node and the corresponding parent node ID.
Parents can be defined before or after their children.
Saving to and loading from file¶
The tree representations can be stored to file using the instance method
tree.to_file(filename)
and re-loaded from such a file using
the class method Tree.from_file(filename)
.
Modifying an existing tree representation¶
A tree representation (constructed from an input source or loaded from a file) can be modified, by adding or deleting leaf nodes or entire subtrees.
Adding nodes¶
For adding new nodes, the tree.add_nodes(generator)
method is used.
The paremeter shall be a generator, which yields pairs of values, that
are the ID of each node and the corresponding parent node ID.
If attributes have been defined (see below), the corresponding files
will be automatically adapted.
The optional parameter list_added
can be set to a list, to which
the IDs of added nodes are appended.
The optional parameter total
can be set to the number of tuples
yielded by the generator, for displaying a tqdm progress bar.
The method will raise an exception if the added nodes already existed in the tree,
or if a parent node does not exist in the tree or in the nodes
added in the same call to add_nodes
.
Removing nodes¶
For removing a subtree, the tree.delete_subtree(node_number)
method is used.
If the provided ID is for a leaf tree, then only that node is removed.
If it is an internal node, the node, and the entire subtree under it
are removed.
If attributes have been defined (see below), the corresponding files
will be automatically adapted.
The optional parameter list_deleted
can be set to a list, to which
the IDs of the deleted nodes are appended.
The method will raise an exception if the node_number
does not exist.
Moving a node¶
A node (except the root) can be moved, by detaching it from its parent
and re-attaching it to a different parent node in the tree.
For this operation the tree.move_subtree(subtree_root, new_parent)
method is used.
If attributes have been defined (see below), the corresponding files
will be automatically adapted.
The method will raise an exception if the new parent node is not present in the tree.
Updating or resetting a tree¶
A tree can be updated to reflect the contents of a data source of
tuples (node_id, parent)
. Thereby all nodes which are present in the
tree but not generated by the data source, will be removed, nodes
with a changed parent will be moved, and nodes not present in the
tree will be added.
The method for updating a tree is tree.update(generator)
.
The method will fail if the root is changed, or if the tree data
is invalid.
Updating can be advantageous over creating a new tree, in terms of
performance, if the changes are not many.
If a tree had several changes, a more efficient way is to recreate
the tree data, using the tree.reset(generator)
method.
This is equivalent to creating a new tree, but the attribute values
of nodes are kept, if the nodes are still present in the tree
after the reset.
Both update and reset have a _from_tabular
and a _from_ncbi_dump
version of the method for working with tabular files and, respectively,
the NCBI taxonomy nodes dump file.
Working with attributes¶
Attributes are stored and accessed from files. The filenames are automatically
computed from the tree filename. If the tree object has been constructed but not
saved yet to file, the filename can be set with tree.set_filename
.
If a tree topology changes, the attribute values files will be adapted
accordingly.
Creating attributes¶
To create an attribute, a source of element IDs and associated attribute values
is required. This can be a generator function which yields tuples
(element_ID, attribute_value)
, and is passed to the function
tree.create_attribute(attribute_name, generator)
or a tabular file,
whose name is passed to
tree.create_attribute_from_tabular(attribute_name, tabular_filename)
.
In both cases, a single-argument casting function can be passed as the argument
casting_fn
, which converts the values from the values source, e.g. from
strings to another datatype.
Modifying attributes¶
To delete the value of an attribute for a list of nodes, the method
tree.delete_attribute_values(attribute_name, nodes_list)
is used.
To append new values to some nodes, the new values must
be passed as a dict in the form {node_id: attr_values_list}
to
the method tree.append_attribute_values(attribute_name, new_values)
.
This does not touch the existing values. To replace them, instead
use tree.replace_attribute_values(attribute_name, new_values)
.
This replaces all values for the nodes in the dictionary and leaves
the rest of the values intact.
It is also possible to directly edit the list of attribute values for
some nodes. To this purpose a dictionary of the form
{node_id: attr_values_list}
is obtained using the method
attrvalues = tree.load_attribute_values(attribute_name)
.
After loading, the dictionary entries are changed, by deleting, adding or
modifying some values and finally passed to the method
tree.save_attribute_values(tree, attribute_name, attrvalues)
.
Checking if an attribute exists¶
To check if an attribute exists, the tree.has_attribute(attribute_name)
method can be used. The list of all attributes is returned by
tree.list_attributes()
.
Destroying attributes¶
To remove an attribute completely, the
tree.destroy_attribute(attribute_name)
method is used. To remove all
attributes, the tree.destroy_all_attributes()
method is used.
Subtree queries¶
List of IDs of a subtree¶
The list of IDs of a subtree whose root is node subtre_root
is obtained using
the method: tree.subtree_ids(subtree_root)
.
Attribute values in a subtree¶
Using the method tree.query_attribute(subtree_root, attributes)
attribute values for the given subtree, for multiple attributes are obtained as
a dictionary in the form {'attribute_name': [values]}
.
The optional parameter show_stats
can be set to True, to compute
and output some statistics to the logger info channel.
To associate the attribute names to IDs, and optionally collect further
information, such as the IDs of the parents of each node, the method
tree.subtree_info(subtree_root, attributes)
can be used.
Optional parameters can be set, to include the subtree sizes (debug information
only, since they still contain deleted nodes), the parents
(include_parents
option). The dictionary
keys for the subtree sizes, node ids and parent ids can be set
using the node_id_key
, subtree_size_key
and parent_key
.
Verbosity¶
For slow operations, such as constructing a new tree, optional progress bars
(based on the tqdm libary) can be displayed. They are enabled by setting
fastsubtrees.PROGRESS_ENABLED
to True
(by default the value is False
).
The output verbosity can be controlled by setting the log level.
By default the logger is disabled.
Log messages can be activated by using fastsubtrees.enable_logger("INFO")
.
For debugging, additional messages can be displayed by using
fastsubtrees.enable_logger("DEBUG")
.