Library API

Tree() is the main class of this package. An instance of the class represents the tree information.

Construction of the tree representation

To construct the tree representation, information about the nodes of the tree and their parent node is required.

Constructing from a tabular file

Tree.construct_from_tabular(filename, separator="\t", elem_field_num=0, parent_field_num=1) allows the construction of a Tree() object from a tabular file. Header or comment lines starting with a # are ignored. The elem_field_num and parent_field_num parameters are 0-based field numbers of the fields (columns) in the tabular file containing, respectively, element IDs and the IDs of the element parents. Parents can be defined before or after their children.

If the tabular file is a NCBI Taxonomy dump nodes file, the method Tree.construct_from_ncbi_dump(filename) can be used, which sets the separator and field numbers appropriately.

Constructing from a different data source

For constructing the tree from a different data source (e.g. a database) the Tree.construct(generator) class method can be used. The paremeter shall be a generator, which yields pairs of values, that are the ID of each node and the corresponding parent node ID. Parents can be defined before or after their children.

Saving to and loading from file

The tree representations can be stored to file using the instance method tree.to_file(filename) and re-loaded from such a file using the class method Tree.from_file(filename).

Modifying an existing tree representation

A tree representation (constructed from an input source or loaded from a file) can be modified, by adding or deleting leaf nodes or entire subtrees.

Adding nodes

For adding new nodes, the tree.add_nodes(generator) method is used. The paremeter shall be a generator, which yields pairs of values, that are the ID of each node and the corresponding parent node ID. If attributes have been defined (see below), the corresponding files will be automatically adapted.

The optional parameter list_added can be set to a list, to which the IDs of added nodes are appended. The optional parameter total can be set to the number of tuples yielded by the generator, for displaying a tqdm progress bar.

The method will raise an exception if the added nodes already existed in the tree, or if a parent node does not exist in the tree or in the nodes added in the same call to add_nodes.

Removing nodes

For removing a subtree, the tree.delete_subtree(node_number) method is used. If the provided ID is for a leaf tree, then only that node is removed. If it is an internal node, the node, and the entire subtree under it are removed. If attributes have been defined (see below), the corresponding files will be automatically adapted.

The optional parameter list_deleted can be set to a list, to which the IDs of the deleted nodes are appended.

The method will raise an exception if the node_number does not exist.

Moving a node

A node (except the root) can be moved, by detaching it from its parent and re-attaching it to a different parent node in the tree. For this operation the tree.move_subtree(subtree_root, new_parent) method is used. If attributes have been defined (see below), the corresponding files will be automatically adapted.

The method will raise an exception if the new parent node is not present in the tree.

Updating or resetting a tree

A tree can be updated to reflect the contents of a data source of tuples (node_id, parent). Thereby all nodes which are present in the tree but not generated by the data source, will be removed, nodes with a changed parent will be moved, and nodes not present in the tree will be added. The method for updating a tree is tree.update(generator). The method will fail if the root is changed, or if the tree data is invalid.

Updating can be advantageous over creating a new tree, in terms of performance, if the changes are not many. If a tree had several changes, a more efficient way is to recreate the tree data, using the tree.reset(generator) method. This is equivalent to creating a new tree, but the attribute values of nodes are kept, if the nodes are still present in the tree after the reset.

Both update and reset have a _from_tabular and a _from_ncbi_dump version of the method for working with tabular files and, respectively, the NCBI taxonomy nodes dump file.

Working with attributes

Attributes are stored and accessed from files. The filenames are automatically computed from the tree filename. If the tree object has been constructed but not saved yet to file, the filename can be set with tree.set_filename. If a tree topology changes, the attribute values files will be adapted accordingly.

Creating attributes

To create an attribute, a source of element IDs and associated attribute values is required. This can be a generator function which yields tuples (element_ID, attribute_value), and is passed to the function tree.create_attribute(attribute_name, generator) or a tabular file, whose name is passed to tree.create_attribute_from_tabular(attribute_name, tabular_filename). In both cases, a single-argument casting function can be passed as the argument casting_fn, which converts the values from the values source, e.g. from strings to another datatype.

Modifying attributes

To delete the value of an attribute for a list of nodes, the method tree.delete_attribute_values(attribute_name, nodes_list) is used. To append new values to some nodes, the new values must be passed as a dict in the form {node_id: attr_values_list} to the method tree.append_attribute_values(attribute_name, new_values). This does not touch the existing values. To replace them, instead use tree.replace_attribute_values(attribute_name, new_values). This replaces all values for the nodes in the dictionary and leaves the rest of the values intact.

It is also possible to directly edit the list of attribute values for some nodes. To this purpose a dictionary of the form {node_id: attr_values_list} is obtained using the method attrvalues = tree.load_attribute_values(attribute_name). After loading, the dictionary entries are changed, by deleting, adding or modifying some values and finally passed to the method tree.save_attribute_values(tree, attribute_name, attrvalues).

Checking if an attribute exists

To check if an attribute exists, the tree.has_attribute(attribute_name) method can be used. The list of all attributes is returned by tree.list_attributes().

Destroying attributes

To remove an attribute completely, the tree.destroy_attribute(attribute_name) method is used. To remove all attributes, the tree.destroy_all_attributes() method is used.

Subtree queries

List of IDs of a subtree

The list of IDs of a subtree whose root is node subtre_root is obtained using the method: tree.subtree_ids(subtree_root).

Attribute values in a subtree

Using the method tree.query_attribute(subtree_root, attributes) attribute values for the given subtree, for multiple attributes are obtained as a dictionary in the form {'attribute_name': [values]}. The optional parameter show_stats can be set to True, to compute and output some statistics to the logger info channel.

To associate the attribute names to IDs, and optionally collect further information, such as the IDs of the parents of each node, the method tree.subtree_info(subtree_root, attributes) can be used. Optional parameters can be set, to include the subtree sizes (debug information only, since they still contain deleted nodes), the parents (include_parents option). The dictionary keys for the subtree sizes, node ids and parent ids can be set using the node_id_key, subtree_size_key and parent_key.

Verbosity

For slow operations, such as constructing a new tree, optional progress bars (based on the tqdm libary) can be displayed. They are enabled by setting fastsubtrees.PROGRESS_ENABLED to True (by default the value is False).

The output verbosity can be controlled by setting the log level. By default the logger is disabled. Log messages can be activated by using fastsubtrees.enable_logger("INFO"). For debugging, additional messages can be displayed by using fastsubtrees.enable_logger("DEBUG").