Developer Guide
===============

Stelaro is a metagenomic software tool written in Rust that can be used through the command line and
a Python binding.


Data Analysis Pipeline
++++++++++++++++++++++

.. note::

   This pipeline is subject to changes

The pipeline **classifies contigs** into taxonomic groups to profile metagenomic samples and
**identifies annotated sequences** in metagenomic samples such as antimicrobial resistance genes.

.. image:: images/pipeline.svg


+-----------------------+---------------------------------------+----------------------------------+
| Component             | Integration / Implementation          | Progress                         |
+=======================+=======================================+==================================+
| Reference genome      | Rust interface of the NCBI database   | Completed                        |
| database              |                                       |                                  |
+-----------------------+---------------------------------------+----------------------------------+
| Metagenomic reads     | Domain-specific datasets              | Not tested                       |
+-----------------------+---------------------------------------+----------------------------------+
| Taxonomic database    | Rust interface to GTDB                | Not done                         |
+-----------------------+---------------------------------------+----------------------------------+
| Annotation database   | Rust interface to CARD                | Not done                         |
+-----------------------+---------------------------------------+----------------------------------+
| Read simulator        | Rust program                          | Done                             |
+-----------------------+---------------------------------------+----------------------------------+
| Quality control       | External tools (FastQC, Trimmomatic)  | Not integrated                   |
+-----------------------+---------------------------------------+----------------------------------+
| Sequence assembler    | Rust program or external tool         | Not done                         |
+-----------------------+---------------------------------------+----------------------------------+
| Taxonomic profiler    | Rust program or external tool         | Not done                         |
+-----------------------+---------------------------------------+----------------------------------+
| Functional element    | Attention-based neural network        | Not done                         |
| identifier            |                                       |                                  |
+-----------------------+---------------------------------------+----------------------------------+


Input
-----

- **Quality-controlled reads** from a sequencing device.
   - The pipeline itself does not comprise facilities that can trim adapters and discard
     low-quality reads.
   - External tools like Trimmomatic are to be used to pre-process raw reads.
- **Synthetic reads** generated from reference genomes. Used when training the models. The reference
  genomes are taken from the NCBI.
   - A **read simulator** generates data from taxonomic profiles :cite:`fritz2019`. This is more
     representative of real metagenomic samples than random sampling.
- A **Taxonomy database** to profile the reads such as GTDB or the one of the NCBI.
- A **Sequence annotation database** to identify elements of interest such as antimicrobial
  resistance genes. Retrieved from the CARD database :cite:`hackenberger2024`.


Data Processing
---------------

- The **sequence assembler** creates contigs from pre-processed sequence reads.
   - This component uses either De Bruijn graphs or overlap-layout-consensus to assemble the reads.
   - The assembler will be either hardware-accelerated on GPUs or performed by an external tool.
- The **sequence processor** converts the contigs into a compressed format.
   - Techniques used in natural language processing, such as tokenization (BERT), are to be used by
     this component.
   - Conversion into tetra-mers are also considered.
- The **taxonomic binning model** uses processed contigs to assign them to taxonomic profiles.
   - This is currently envisioned as an attention-based neural network, but a rule-based program
     will be used if performances are disappointing.
- The **functional element identifier** finds relevant elements in the genomes, such as AMR genes.
   - This is envisioned as an attention-based neural network.


Output
------

- **Taxonomic profiles** ascribing taxonomic levels to reads.
- **Functional element predictions** ascribing potential functions to reads.


Organization
++++++++++++

The source code is organized as follows:

- `src`: Rust source code.
   - `data`: Download data.
   - `io`: Functions to read and write genome sequence files.
   - `kernels`: Hardware acceleration kernels.
   - `utils`: Utility modules (e.g. console output formatting).


References
----------

.. bibliography:: references.bib