.. _user-guide: User Guide ========== This page presents how to use the ``diurnal`` library to predict RNA secondary structures. The information presented on this page is also available in a `Jupyter notebook `_. List of Modules --------------- The ``diurnal`` library comprises the following modules. The ``numpy`` and ``scikit-learn`` libraries must be installed in the Python virtual environment to use them. - `diurnal.database `__: Download and use RNA datasets. - `diurnal.evaluate `__: Evaluate the performances of predictive models. - `diurnal.family `__: Manipulate data related to RNA families. - `diurnal.structure `__: Format RNA structures into different representations. - `diurnal.train `__: Prepare data for training models. - `diurnal.visualize `__: Visualize data with ``matplotlib`` graphs or output in the terminal. The library also contains the `diurnal.models `__ subpackage that comprises predictive models. The `pytorch `_ library must be installed to use this package. Usage ----- Download Data ^^^^^^^^^^^^^ The code snippet below illustrates how to download RNA datasets: .. code-block:: python from diurnal import database # `data` is the directory in which data is downloaded. `archiveII` is the # name of the dataset. database.download("./data/", "archiveII") This code will download the `archiveII` dataset and write the raw data into the `data` directory. If the directory does not exist, it is created. Three datasets can be downloaded: - `archiveII`: 3975 molecules. Originally retrieved from ``_ :cite:`rochester`. - `RNASTRalign`: 37149 molecules. Originally retrieved from ``_ :cite:`rochester`. - `RNA_STRAND`: 4666 molecules. Originally retrieved from ``_ :cite:`rnastrand`. One may instead use the function ``diurnal.database.download_all()`` to download all available datasets. Molecules are represented in these datasets with CT files, which is a tab-separated format. The first line of a CT file contains (1) the number of nucleotides in the molecule and (2) the name of the molecule. The other lines of the files all follow the same syntax, which is presented in the table below: +---------+-------------------------------------------------------------------+ | Column | Content | +=========+===================================================================+ | 1 | Index of the nucleotide (starts at 1) | +---------+-------------------------------------------------------------------+ | 2 | Nucleotide (A, C, G, or U) | +---------+-------------------------------------------------------------------+ | 3 | Neighboring nucleotide index in the 5' direction (i.e. upstream) | +---------+-------------------------------------------------------------------+ | 4 | Neighboring nucleotide index in the 3' direction (i.e. downstream)| +---------+-------------------------------------------------------------------+ | 5 | Index of the paired nucleotide. If the nucleotide is unpaired, | | | the value ``0`` is used. | +---------+-------------------------------------------------------------------+ | 6 | Index of the nucleotide (same as column 1) | +---------+-------------------------------------------------------------------+ One can read the content of CT files with the following code: .. code-block:: python # Read the first molecule of the `archiveII` dataset. filename = "./data/archiveII/5s_Acanthamoeba-castellanii-1.ct" name, primary, secondary = diurnal.utils.rna_data.read_ct_file(filename) which returns the following data: - The **name** of the molecule. - The **primary structure** of the molecule, represented as a list of ``A``, ``C``, ``G``, and ``U`` characters. - The **secondary structure** of the molecule, represented as a list of pairings in which the *i* th element of the list is paired with the index that it contains (for instance, ``(((...)))`` can be represented as ``[8, 7, 6, -1, -1, -1, 2, 1, 0]``). Since Python uses zero-based indices, ``-1`` is used for unpaired nucleotides instead of ``0`` like CT files do. Format Data ^^^^^^^^^^^ The RNA structure data of CT files must be converted into numerical vectors to train predictive models. The module ``diurnal.structure`` can encode the data into other formats, as shown below: .. code-block:: python from diurnal.structure import Primary, Secondary # Encode the list of bases into a one-hot vector. For instance, if `primary` # contains the value `['A', 'C']`, the encoded structure will be # `[[1, 0, 0, 0], [0, 1, 0, 0]]`. primary_onehot = Primary.to_onehot(primary) # Encode the list of pairings into a one-hot vector. For instance, if # `secondary` contains the value `[2, -1, 0]`, the encoded structure will be # `[[1, 0, 0], [0, 1, 0], [0, 0, 1]]`, which correspond to `(.)`. secondary_onehot = Secondary.to_onehot(secondary) # Obtain the list of bases from an encoded vector. primary = Primary.to_sequence(primary_onehot) # Obtain the bracket notation from an encoded vector. bracket = Secondary.to_bracket(secondary_onehot) For convenience, the library can encode a whole dataset of CT files into another representation and store them in Numpy files. Users can subsequently read these already-formatted files instead of reading CT files every time. The following code snippet shows how to do that: .. code-block:: python from diurnal import database, structure database.format_basic( "./data/archiveII", # Directory of the raw data to format. "./data/formatted", # Formatted data output directory. 512, # Normalized size. Short molecules are padded at the 3' end. structure.Primary.to_onehot, # Primary to vector map. structure.Secondary.to_onehot # Secondary to vector map. ) Executing this function will generate the following files: - ``families.npy``: Encoded RNA families. - ``readme.rst``: Metadata such as the file creation date. - ``names.txt``: The list of molecule names. - ``primary_structures.npy``: Encoded primary structures. - ``secondary_structures.npy``: Encoded secondary structures. The ``.npy`` files can be read with the function ``numpy.load(filename)``, which returns a ``numpy.array`` object. Prepare Data for Training ^^^^^^^^^^^^^^^^^^^^^^^^^ Formatted data can be loaded and split for training. In the context of RNA secondary structure prediction, there are a few ways to divide data: - In **inter-family testing** (also called *family-wise cross-validation* by Sato et al. :cite:`mxfold2`), the model is trained and tested with datasets that comprise different RNA families. Therefore, training and testing data are structurally different. The point of this type of training is to measure how well the model can predict the structure of unfamiliar molecules. - In **sequence testing** (also called *sequence-wise cross-validation* by Sato et al. :cite:`mxfold2`), the model is trained and tested with datasets that comprise the same RNA families. Therefore, training and testing data are structurally similar. Consequently, this type of testing is expected to yield more accurate results than inter-family testing. - In **intra-family testing**, models are trained and tested with RNA molecules that belong to the same family. Therefore, training and testing data are structurally very similar and results are expected to yield more accurate results than sequence testing. This type of testing does not appear to be discussed in published work, but it can be useful to validate models. The code snippet below shows how to load data for inter-family and sequence testing. .. code-block:: python from diurnal import train, family # Inter-family testing. test_set = train.load_families("./data/formatted", "5s") train_set = train.load_families("./data/formatted", family.all_but("5s")) # Sequence testing. data = train.load_data("./data/formatted", randomize = True) # Divide data in training (80 % of points) and test sets (20 % of points). train_set, test_set = train.split_data(data, [0.8, 0.2]) One may also divide data to perform K-fold validation, as shown below: .. code-block:: python from diurnal import train # Do five K-fold splits. K = 5 data = train.load_data("./data/formatted", randomize = True) for i in range(K) train_set, test_set = train.k_fold_split(data, [0.8, 0.2], K, i) # Train and test a model for this K-split. Train Models ^^^^^^^^^^^^ One can load predictive models comprised within the ``diurnal`` library, as demonstrated in the code snippet below: .. code-block:: python import torch from diurnal import models # Load a `diurnal` neural network based on the `Dot_Bracket` architecture. model = models.NN( model=models.networks.cnn.Dot_Bracket, N=SIZE, n_epochs=3, optimizer=torch.optim.Adam, loss_fn=torch.nn.MSELoss, optimizer_args={"eps": 1e-4}, loss_fn_args=None, verbosity=1) # Train the model model.train(train_set) In the example above, the class ``diurnal.models.NN`` is a wrapper around a ``pytorch`` neural network. Another type of ``diurnal`` models are baselines, which make basic predictions. For instance, in the code below, .. code-block:: python from diurnal.models import baseline model = baseline.Random() model.train(train_set) the model makes random predictions. This can be useful to compare performances with other models and ensure that the data processing pipeline works well. Predict Structures ^^^^^^^^^^^^^^^^^^ You can predict structures as shown below: .. code-block:: python from diurnal import structure # Assume that `model` is a trained `diurnal.model` object. The method # `predict` accepts primary structures encoded in the same format # that was used for training (in this case, one-hot encoding). primary_structure = list("AAAACCCCUUUU") encoded_primary_structure = structure.Primary.to_onehot(primary_structure) prediction = model.predict(encoded_primary_structure) The data format returned by the ``predict`` method depends on the architecture of the ``model`` object. For example, a model may return a one-hot encoded bracket notation of the secondary structure. Evaluate Results ^^^^^^^^^^^^^^^^ The are two main ways to evaluate secondary structure predictions. The first and most widespread method consists in using the **recall** and **precision** :cite:`cnnfold` :cite:`mxfold2` :cite:`attfold` :cite:`ufold` :cite:`cdpfold`. This evaluation method uses the following metrics: - True positives (TP): number of paired bases that are correctly predicted to be paired. - True negative (TN): number of unpaired bases that are correctly predicted to be unpaired. - False positives (FP): number of paired bases that are erroneously predicted to be unpaired. - False negatives (FN): number of unpaired bases that are erroneously predicted to be paired. Recall (or *true positive rate* or *sensitivity*) is the probability that a positive prediction is actually positive. It is computed with the following equation: .. math:: recall = \frac{TP}{TP + FN} Precision (or *positive predictive value*) is the fraction of relevant elements among retrieved elements. It is computed with the following equation: .. math:: precision = \frac{TP}{TP + FP} The geometric mean of these two values is the **F1-score**, which is also called *F1*, *F1-measure*, *F-score*, or *F-measure*: .. math:: F1 = 2 \times \frac{recall \times precision}{recall + precision} These evaluation metrics can be computed with the function ``diurnal.evaluate.recall_precision_f1``, as shown below: .. code-block:: python from diurnal import evaluate true = list("(((....)))") prediction = list("((......))") r, p, f1 = evaluate.recall_precision_f1(true, prediction) One drawback of these evaluation metrics is that they do not make a distinction about whether a nucleotide is paired with a base in the 5' or 3' direction. Therefore, when comparing the structures ``(((....)))`` and ``)))....(((``, the precision, recall, and f1-score all have a perfect value of 1 even though the predicted structure is inaccurate. The ``diurnal`` library therefore uses another metric, the *micro f1-score*, which generalizes precision and recall to classification problems that rely on more than two classes. One can also obtain the confusion matrix of predicted structures. The code snippet below shows how to compute the micro f1-score and confusion matrix: .. code-block:: python from diurnal import evaluate true = list("(((....)))") prediction = list("((......))") micro_f1 = evaluate.micro_f1(true, prediction) confusion_matrix = evaluate.get_confusion_matrix(true, prediction) Save Models ^^^^^^^^^^^ Predictive models can be written in files for subsequent reuse, as shown below: .. code-block:: python # Assume that `model` is a trained `diurnal.model` object. model.save(directory = "saved_model") In addition to writing the model in the provided directory, the library also generates: - a file containing the list of the names of the molecules that were used for training the model, and - an informative file containing metadata. Load Models ^^^^^^^^^^^ Predictive models can be loaded from saved files, as shown below: .. code-block:: python from diurnal import models model = models.NN( cnn.Dot_Bracket, SIZE, None, torch.optim.Adam, torch.nn.MSELoss, {"eps": 1e-4}, None, verbosity=1) model.load("saved_model") Visualize Results ^^^^^^^^^^^^^^^^^ The module ``diurnal.visualize`` contains utility functions that can help users visualize results with graphs or console output. Elaborate Predictive Models --------------------------- The class ``diurnal.models.Basic`` represents a basic predictive model. One may derive this class to create a new predictive model. Four methods need to be implemented in the derived class: - ``_train(data)``: Train the model. - ``_predict(primary)``: Predict and return a secondary structure. - ``_save(directory)``: Write the model in files. - ``_load(directory)``: Read the model from files. The class ``diurnal.models.NN`` is an example of a predictive model that is derived from the class ``diurnal.models.Basic``. The ``NN`` class is used to represent predictive models based on neural networks. References ---------- .. bibliography:: references.bib