Command line interface¶
The command line interface of the library consists in a script fastsubtrees
that can be used to construct and modify a tree, add attributes to nodes, and
query subtree IDs and attribute values.
The list of subcommands is displayed using fastsubtrees --help
.
Using --help
after a subcommand (e.g. fastsubtrees tree --help
displays the syntax and details of the subcommand.
The following subcommands are available:
tree Create or modify a tree.
attribute Create, modify or remove an attribute.
query List node IDs and/or attributes in a subtree.
Tree construction¶
The subcommand fastsubtrees tree
is used to construct the tree
representation, from data consisting in node IDs and the corresponding parent
IDs. The data can be obtained from a tabular file, or from a different data
source.
Construction from a tabular file¶
If the IDs of the elements and their parents are contained in a tabular
file, the filename is given as an argument to fastsubtrees tree
, e.g.:
fastsubtrees tree my.tree
Details of the format can be specified using options. The separator
is specified using --separator
(default: tab), the columns containing
the IDs using --elementscol
and --parentscol
(as
1-based column numbers, default: 1 and 2), and the prefix of comment/header
lines using --commentchar
(default: #).
When using nodes.dmp
from
the NCBI taxonomy tree dump, the preset --ncbi
can be used.
Example:
fastsubtrees my.tree --ncbi ntdumpsdir/nodes.dmp
Generalized tree construction¶
In the generalized tree construction mode, using the option --module
,
the path to a Python module is passed. The module defines a function, yielding
the node and parent IDs. The default function name is element_parent_ids
and can be changed using the option --fn
.
All positional arguments given to the script are passed to the function.
If they contain a =
, they are passed as keyword arguments (unless
the option --nokeys
is used).
Examples:
fastsubtrees tree my.tree --module my_module.py a b c
fastsubtrees tree my.tree --module my_module.py k1=v1 k2=v=2 x --fn myfn
fastsubtrees tree my.tree --module my_module.py k1=v1 k2=v=2 x --nokeys
This is the called function in the tree cases:
element_parent_ids("a", "b", "c")
myfn("x", k1="v1", k2="v=2")
element_parent_ids("k1=v1", "k2=v=2", "x")
A module implementing the described interface for reading
from a tabular file is provided
under fastsubtrees/ids_modules/ids_from_tabular_file.py
.
Modifying an existing tree representation¶
Existing tree representations can be modified using fastsubtrees tree
with the options --update
, --add
or --delete
.
Updating or resetting a tree¶
If the option --update
or --reset
is provided, the tree is modified so
to reflect the given source of IDs of elements and their parents (tabular file
or Python function). The result is a tree, which is functionally equivalent to
a new tree, constructed with the same data source.
The difference between the two is that --update
edits the existing tree and
attribute data, while --reset
recomputes the tree data from scratch, after
dumping the attribute values and reconstructing the attribute files with the
dumped values afterwards.
The common advantage of using --update
or --reset
is that the attribute
files are not lost.
Generally the reset operation performs better, since tree creation is fast.
The update operation can be faster, if the tree is large and is only slightly
modified.
Adding leaf nodes or new subtrees¶
If the option --add
is used, new elements are added to an existing tree.
The elements must not yet be present in the tree and they must all be connected
to a node already present in the tree or added in the same operation.
Removing leaf nodes or subtrees¶
If the option --delete
is used, all remaining positional arguments of the
script are IDs of nodes. If a node is a leaf node, it is removed from the tree.
If it is an internal node, the entire subtree under that node is removed.
Attributes when editing a tree¶
If attribute have been defined, as described in the following section, the attribute files are automatically detected and modified too, when adding or deleting nodes.
If nodes have been added by --add
or --update
,
new attribute values for those nodes
can be added using fastsubtrees attribute --add
, as explained below.
Tree attributes¶
The tree can contain further information, except the IDs, in the form of
attributes. Attribute values can be integers, floats or strings.
Not all nodes will necessarily have an attribute value associated
with them. Some nodes can contain multiple values for an attribute.
Attributes are managed by the subcommand fastsubtrees attribute
.
Adding an attribute¶
To create a new attribute, a source of attribute values is
passed to fastsubtrees attribute
. Similar to the tree construction case,
the source can be a tabular file, or a Python module, specifiying a function
yielding node IDs and attribute values.
The same node ID can appear multiple times, in which case the
attribute values will all be stored, as a list.
By default, attribute values are stored as strings. The option
--type f
can be used to apply a function f()
to each attribute
value. The function can be either from the standard library (e.g.
int
or float
, or, if --module
is used, from that module.
Attribute values from a tabular file¶
To create an attribute from a tabular file, the filename is passed, e.g.
fastsubtrees attribute my.tree myattr value.tsv
Also in this case the format options can be used, for changing separator,
comment character and specifying the columns containing the ID of the
nodes (--elementscol
) and attribute values (--valuescol
).
Generalized source of attribute values¶
As a generalized attribute values source, the path to a Python module
is passed, using the option --module
. The module
defines a function, yielding
tuples (node_ID, attribute_value)
.
The default function name is attribute_values
and can be changed using the option --fn
.
All positional arguments given to the script are passed to the function.
If they contain a =
, they are passed as keyword arguments (unless
the option --nokeys
is used).
Examples:
fastsubtrees attribute my.tree myattr --module my_module.py a b c
fastsubtrees attribute my.tree myattr --module my_module.py k1=v1 k2=v=2 x --fn myfn
fastsubtrees attribute my.tree myattr --module my_module.py k1=v1 k2=v=2 x --nokeys
This is the called function in the tree cases:
attribute_values("a", "b", "c")
myfn("x", k1="v1", k2="v=2")
attribute_values("k1=v1", "k2=v=2", "x")
An example module implementing the described interface is provided
under fastsubtrees/ids_modules/attrs_from_tabular_file.py
.
Listing defined attributes¶
To list the attributes that have been created, use fastsubtrees attribute
with the option --list
.
Editing attribute values¶
To add new values for an attribute, fastsubtrees attribute
with the option
--add
is used. New values of the attributes for a node are appended to the
existing ones. If the existing ones shall be replaced by the new ones, use the
option --replace
instead of --add
.
To remove the values of an attribute for a list of given nodes,
use fastsubtrees attribute --delete
specifying the nodes.
To remove an attribute completely, use fastsubtrees attribute --delete
without specifying any node.
Subtree queries¶
The subcommand fastsubtrees query
loads a tree representation from file
and performs a subtree query to return a list of node IDs and/or attributes
of the subtree under a given node.
To run the query, two parameters are required:
tree
: File containing the tree.subtreeroot
: ID of the root of the subtree for which the IDs have to be queried
For query the values of an attribute in a subtree, the attribute names are passed as further arguments, after the subtree root argument. The output is tabular and a header line is output, which summarizes the content of each column.
To hide the node IDs when attributes are printed, use the option
--attributes-only
. In this case, only nodes for which
some attribute value exists are shown, unless the option --missing
is used.