User Guide
This page presents how to use the diurnal
library to predict RNA secondary
structures.
The information presented on this page is also available in a Jupyter notebook.
List of Modules
The diurnal
library comprises the following modules. The numpy
and
scikit-learn
libraries must be installed in the Python virtual environment
to use them.
diurnal.database: Download and use RNA datasets.
diurnal.evaluate: Evaluate the performances of predictive models.
diurnal.family: Manipulate data related to RNA families.
diurnal.structure: Format RNA structures into different representations.
diurnal.train: Prepare data for training models.
diurnal.visualize: Visualize data with
matplotlib
graphs or output in the terminal.
The library also contains the diurnal.models subpackage that comprises predictive models. The pytorch library must be installed to use this package.
Usage
Download Data
The code snippet below illustrates how to download RNA datasets:
from diurnal import database
# `data` is the directory in which data is downloaded. `archiveII` is the
# name of the dataset.
database.download("./data/", "archiveII")
This code will download the archiveII dataset and write the raw data into the data directory. If the directory does not exist, it is created. Three datasets can be downloaded:
archiveII: 3975 molecules. Originally retrieved from https://rna.urmc.rochester.edu/pub/archiveII.tar.gz [3].
RNASTRalign: 37149 molecules. Originally retrieved from https://www.urmc.rochester.edu/rna/ [3].
RNA_STRAND: 4666 molecules. Originally retrieved from http://www.rnasoft.ca/strand/ [4].
One may instead use the function diurnal.database.download_all()
to
download all available datasets.
Molecules are represented in these datasets with CT files, which is a tab-separated format. The first line of a CT file contains (1) the number of nucleotides in the molecule and (2) the name of the molecule. The other lines of the files all follow the same syntax, which is presented in the table below:
Column |
Content |
---|---|
1 |
Index of the nucleotide (starts at 1) |
2 |
Nucleotide (A, C, G, or U) |
3 |
Neighboring nucleotide index in the 5’ direction (i.e. upstream) |
4 |
Neighboring nucleotide index in the 3’ direction (i.e. downstream) |
5 |
Index of the paired nucleotide. If the nucleotide is unpaired,
the value |
6 |
Index of the nucleotide (same as column 1) |
One can read the content of CT files with the following code:
# Read the first molecule of the `archiveII` dataset.
filename = "./data/archiveII/5s_Acanthamoeba-castellanii-1.ct"
name, primary, secondary = diurnal.utils.rna_data.read_ct_file(filename)
which returns the following data:
The name of the molecule.
The primary structure of the molecule, represented as a list of
A
,C
,G
, andU
characters.The secondary structure of the molecule, represented as a list of pairings in which the i th element of the list is paired with the index that it contains (for instance,
(((...)))
can be represented as[8, 7, 6, -1, -1, -1, 2, 1, 0]
). Since Python uses zero-based indices,-1
is used for unpaired nucleotides instead of0
like CT files do.
Format Data
The RNA structure data of CT files must be converted into numerical vectors to
train predictive models. The module diurnal.structure
can encode the data
into other formats, as shown below:
from diurnal.structure import Primary, Secondary
# Encode the list of bases into a one-hot vector. For instance, if `primary`
# contains the value `['A', 'C']`, the encoded structure will be
# `[[1, 0, 0, 0], [0, 1, 0, 0]]`.
primary_onehot = Primary.to_onehot(primary)
# Encode the list of pairings into a one-hot vector. For instance, if
# `secondary` contains the value `[2, -1, 0]`, the encoded structure will be
# `[[1, 0, 0], [0, 1, 0], [0, 0, 1]]`, which correspond to `(.)`.
secondary_onehot = Secondary.to_onehot(secondary)
# Obtain the list of bases from an encoded vector.
primary = Primary.to_sequence(primary_onehot)
# Obtain the bracket notation from an encoded vector.
bracket = Secondary.to_bracket(secondary_onehot)
For convenience, the library can encode a whole dataset of CT files into another representation and store them in Numpy files. Users can subsequently read these already-formatted files instead of reading CT files every time. The following code snippet shows how to do that:
from diurnal import database, structure
database.format_basic(
"./data/archiveII", # Directory of the raw data to format.
"./data/formatted", # Formatted data output directory.
512, # Normalized size. Short molecules are padded at the 3' end.
structure.Primary.to_onehot, # Primary to vector map.
structure.Secondary.to_onehot # Secondary to vector map.
)
Executing this function will generate the following files:
families.npy
: Encoded RNA families.readme.rst
: Metadata such as the file creation date.names.txt
: The list of molecule names.primary_structures.npy
: Encoded primary structures.secondary_structures.npy
: Encoded secondary structures.
The .npy
files can be read with the function numpy.load(filename)
,
which returns a numpy.array
object.
Prepare Data for Training
Formatted data can be loaded and split for training. In the context of RNA secondary structure prediction, there are a few ways to divide data:
In inter-family testing (also called family-wise cross-validation by Sato et al. [5]), the model is trained and tested with datasets that comprise different RNA families. Therefore, training and testing data are structurally different. The point of this type of training is to measure how well the model can predict the structure of unfamiliar molecules.
In sequence testing (also called sequence-wise cross-validation by Sato et al. [5]), the model is trained and tested with datasets that comprise the same RNA families. Therefore, training and testing data are structurally similar. Consequently, this type of testing is expected to yield more accurate results than inter-family testing.
In intra-family testing, models are trained and tested with RNA molecules that belong to the same family. Therefore, training and testing data are structurally very similar and results are expected to yield more accurate results than sequence testing. This type of testing does not appear to be discussed in published work, but it can be useful to validate models.
The code snippet below shows how to load data for inter-family and sequence testing.
from diurnal import train, family
# Inter-family testing.
test_set = train.load_families("./data/formatted", "5s")
train_set = train.load_families("./data/formatted", family.all_but("5s"))
# Sequence testing.
data = train.load_data("./data/formatted", randomize = True)
# Divide data in training (80 % of points) and test sets (20 % of points).
train_set, test_set = train.split_data(data, [0.8, 0.2])
One may also divide data to perform K-fold validation, as shown below:
from diurnal import train
# Do five K-fold splits.
K = 5
data = train.load_data("./data/formatted", randomize = True)
for i in range(K)
train_set, test_set = train.k_fold_split(data, [0.8, 0.2], K, i)
# Train and test a model for this K-split.
Train Models
One can load predictive models comprised within the diurnal
library, as
demonstrated in the code snippet below:
import torch
from diurnal import models
# Load a `diurnal` neural network based on the `Dot_Bracket` architecture.
model = models.NN(
model=models.networks.cnn.Dot_Bracket,
N=SIZE,
n_epochs=3,
optimizer=torch.optim.Adam,
loss_fn=torch.nn.MSELoss,
optimizer_args={"eps": 1e-4},
loss_fn_args=None,
verbosity=1)
# Train the model
model.train(train_set)
In the example above, the class diurnal.models.NN
is a wrapper around a
pytorch
neural network. Another type of diurnal
models are baselines,
which make basic predictions. For instance, in the code below,
from diurnal.models import baseline
model = baseline.Random()
model.train(train_set)
the model makes random predictions. This can be useful to compare performances with other models and ensure that the data processing pipeline works well.
Predict Structures
You can predict structures as shown below:
from diurnal import structure
# Assume that `model` is a trained `diurnal.model` object. The method
# `predict` accepts primary structures encoded in the same format
# that was used for training (in this case, one-hot encoding).
primary_structure = list("AAAACCCCUUUU")
encoded_primary_structure = structure.Primary.to_onehot(primary_structure)
prediction = model.predict(encoded_primary_structure)
The data format returned by the predict
method depends on the architecture
of the model
object. For example, a model may return a one-hot encoded
bracket notation of the secondary structure.
Evaluate Results
The are two main ways to evaluate secondary structure predictions.
The first and most widespread method consists in using the recall and precision [1] [5] [6] [7] [8]. This evaluation method uses the following metrics:
True positives (TP): number of paired bases that are correctly predicted to be paired.
True negative (TN): number of unpaired bases that are correctly predicted to be unpaired.
False positives (FP): number of paired bases that are erroneously predicted to be unpaired.
False negatives (FN): number of unpaired bases that are erroneously predicted to be paired.
Recall (or true positive rate or sensitivity) is the probability that a positive prediction is actually positive. It is computed with the following equation:
Precision (or positive predictive value) is the fraction of relevant elements among retrieved elements. It is computed with the following equation:
The geometric mean of these two values is the F1-score, which is also called F1, F1-measure, F-score, or F-measure:
These evaluation metrics can be computed with the function
diurnal.evaluate.recall_precision_f1
, as shown below:
from diurnal import evaluate
true = list("(((....)))")
prediction = list("((......))")
r, p, f1 = evaluate.recall_precision_f1(true, prediction)
One drawback of these evaluation metrics is that they do not make a distinction
about whether a nucleotide is paired with a base in the 5’ or 3’ direction.
Therefore, when comparing the structures (((....)))
and )))....(((
, the
precision, recall, and f1-score all have a perfect value of 1 even though the
predicted structure is inaccurate. The diurnal
library therefore uses
another metric, the micro f1-score, which generalizes precision and recall to
classification problems that rely on more than two classes. One can also obtain
the confusion matrix of predicted structures. The code snippet below
shows how to compute the micro f1-score and confusion matrix:
from diurnal import evaluate
true = list("(((....)))")
prediction = list("((......))")
micro_f1 = evaluate.micro_f1(true, prediction)
confusion_matrix = evaluate.get_confusion_matrix(true, prediction)
Save Models
Predictive models can be written in files for subsequent reuse, as shown below:
# Assume that `model` is a trained `diurnal.model` object.
model.save(directory = "saved_model")
In addition to writing the model in the provided directory, the library also generates:
a file containing the list of the names of the molecules that were used for training the model, and
an informative file containing metadata.
Load Models
Predictive models can be loaded from saved files, as shown below:
from diurnal import models
model = models.NN(
cnn.Dot_Bracket,
SIZE,
None,
torch.optim.Adam,
torch.nn.MSELoss,
{"eps": 1e-4},
None,
verbosity=1)
model.load("saved_model")
Visualize Results
The module diurnal.visualize
contains utility functions that can help users
visualize results with graphs or console output.
Elaborate Predictive Models
The class diurnal.models.Basic
represents a basic predictive model. One may
derive this class to create a new predictive model. Four methods need to be
implemented in the derived class:
_train(data)
: Train the model._predict(primary)
: Predict and return a secondary structure._save(directory)
: Write the model in files._load(directory)
: Read the model from files.
The class diurnal.models.NN
is an example of a predictive model that
is derived from the class diurnal.models.Basic
. The NN
class is used
to represent predictive models based on neural networks.
References
- 1
Mehdi Saman Booy, Alexander Ilin, and Pekka Orponen. Rna secondary structure prediction with convolutional neural networks. BMC Bioinformatics, 23(1):58, 2022. URL: https://doi.org/10.1186/s12859-021-04540-7, doi:10.1186/s12859-021-04540-7.
- 2
Accessed: 2023-07-15. URL: http://rna.tbi.univie.ac.at/forna/.
- 3(1,2)
David Mathews. Mathews lab. Accessed: 2023-04-15. URL: https://rna.urmc.rochester.edu/.
- 4
Andronescu, Bereg, Hoos, and Condon. Rna strand v2.0 - the rna secondary structure and statistical analysis database. 2009. Accessed: 2023-04-15. URL: http://www.rnasoft.ca/strand/.
- 5(1,2,3)
Kengo Sato, Manato Akiyama, and Yasubumi Sakakibara. Rna secondary structure prediction using deep learning with thermodynamic integration. Nature Communications, 12(1):941, 2021. URL: https://doi.org/10.1038/s41467-021-21194-4, doi:10.1038/s41467-021-21194-4.
- 6
Yili Wang, Yuanning Liu, Shuo Wang, Zhen Liu, Yubing Gao, Hao Zhang, and Liyan Dong. Attfold: rna secondary structure prediction with pseudoknots based on attention mechanism. Frontiers in Genetics, 2020. URL: https://www.frontiersin.org/articles/10.3389/fgene.2020.612086, doi:10.3389/fgene.2020.612086.
- 7
Laiyi Fu, Yingxin Cao, Jie Wu, Qinke Peng, Qing Nie, and Xiaohui Xie. UFold: fast and accurate RNA secondary structure prediction with deep learning. Nucleic Acids Research, 50(3):e14–e14, 11 2021. URL: https://doi.org/10.1093/nar/gkab1074, arXiv:https://academic.oup.com/nar/article-pdf/50/3/e14/42544496/gkab1074\_supplemental\_file.pdf, doi:10.1093/nar/gkab1074.
- 8
Hui Zhang, Cong Zhang, Zhe Li, Chunhua Li, Xiaopeng Wei, Baofeng Zhang, and Yunlong Liu. A new method of rna secondary structure prediction based on convolutional neural network and dynamic programming. Frontiers in Genetics, 10:467, 2019. doi:10.3389/fgene.2019.00467.
- 9
missing author in redfold
- 10
Jaswinder Singh, Jack Hanson, Kuldip Paliwal, and Yaoqi Zhou. Rna secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nature Communications, 10:2041–1723, 2019. doi:10.1038/s41467-019-13395-9.
- 11
Marcell Szikszai, Michael Wise, Amitava Datta, Max Ward, and David H Mathews. Deep learning models for RNA secondary structure prediction (probably) do not generalize across families. Bioinformatics, 38(16):3892–3899, 06 2022. URL: https://doi.org/10.1093/bioinformatics/btac415, arXiv:https://academic.oup.com/bioinformatics/article-pdf/38/16/3892/45300927/btac415\_supplementary\_data.pdf, doi:10.1093/bioinformatics/btac415.