NtMirror¶
NtMirror is a tool for creating and keeping up-to-date a local mirror of the NCBI Taxonomy database.
Requirements¶
The software is distributed as a Python 3 package.
An installation by pip is only possible, if the mariadb
module is installed (which requires installation of MariaDB
and its C and Python connectors).
CLI¶
The CLI has been developed and tested exclusively using MariaDB as a RDBMS.
Loading into the database¶
The script ntmirror-dbload
is used to load the dump files into the
database. It takes as arguments the database username, its password, the
database name and the path to the connection socket, followed by
the directory where the dump files are located.
If the database tables do not exist, they are created. If dump files are found, they are loaded into the database and deleted. If no dump file is found, then nothing is done.
The exit code of the script is 0 on success and 1 if an error occurs. If the
option --exitcode
is used, and no dump file is found, the exit code of the
script is 100 (instead of 0).
Subtree search using hierarchical SQL¶
To list the IDs of a subtree of the NCBI taxonomy tree, the
ntmirror-extract-subtree
script can be used.
It takes as arguments the database username, its password, the
database name and the path to the connection socket, followed by
the subtree root ID.
Example usage¶
The following example uses the ntdownload
package to download the
dumps and loads them into the database and extracts a subtree
using ntmirror
.
ntdownload ntdumpsdir
ntmirror-dbload myuser mypass mydb /path/to/db.socket ntdumpsdir
ntmirror-extract-subtree myser mypass mydb /path/to/db.socket 562
API¶
Database setup¶
To create the database tables, a SqlAlchemy connection object is necessary.
This is passed to the dbschema.create(connection)
method, which creates the
tables, if they do not exist yet.
Loading the data using MariaDB¶
In MariaDB, the database data loading is performed using the
mysql
library and not using SqlAlchemy, since the loading is faster.
The dbloader_mysql.load_all(ntdumpsdir)
function is used, to which the path of the
directory containing the dump files is passed,
followed by the database hostname, database username, its password, the
database name and the path to the connection socket.
If no dump files are found,
nothing happens. Otherwise, the dump files are loaded into the database. The
function returns an array of tuples (filepfx, filepath)
for each dump file
which was loaded into the database.
Example usage¶
The following example uses the ntdownload
package to download the
dumps and loads them into the database and extracts a subtree
using ntmirror
.
from ntdownload import Downloader
from ntmirror import dbschema, dbloader_mysql
# this assumes that the SqlAchemy connection is available
dbschema.create(connection)
d = Downloader("ntdumpsdir")
d.run()
dbloader_mysql.load_all("ntdumpsdir", dbhostname, dbusername, dbpassword,
path_to_db_socket)
Loading the data using another RDBMS¶
Database data loading using SqlAlchemy was also implemented,
so that the package can be used with other RDBMS,
although it is slower than using the dbloader_mysql
module.
To upload the dump files into the database the dbloader_sqlalchemy
module
can be used. The database must implement a LOAD DATA LOCAL INFILE
SQL
command.
The dbloader_sqlalchemy
version of load_all()
takes two arguments: the
dump file directory and a SqlAlchemy connection object.
Tests¶
To run the test suite, a YAML configuration file for the database connection must be provided. The database name, database username and password, hostname and port, drivername and path to the connection socket file must be provided.
For example:
database: ntmirror_test
username: ntmirror_user
password: ntmirror_pass
host: localhost
port: 3306
socket: ntmirror.sock
drivername: "mysql+mysqldb"
Then the test suite is run using pytest
(or make tests
)