.. _lit_review:

Literature Review
=================

.. bibliography:: references.bib
   :filter: False

This page presents research efforts conducted in the field of RNA structure
prediction.

View the section :ref:`overview` for background information about RNA molecules.


Problem Statement
-----------------

The objective of RNA secondary structure prediction is to determine the pairing
of nucleotide from their sequence. This can be described as a
**prediction problem** in which:

- The **input** is a sequence of four characters: ``A``, ``C``, ``G``, and
  ``U`` (i.e. primary structure).
- The **output** is s sequence of characters describing the pairing of
  nucleotides (i.e. secondary structure). This structure can be described in
  different ways:
  - The **bracket notation** uses the characters ``.``, ``(``, and ``)`` to
    designate nucleotides that are unpaired, paired with a nucleotide in the
    3' direction, or paired with a nucleotide in the 5' direction, respectively.
    It cannot represent pseudo-knots.
  - The **shadow** of the secondary structure is a sequence of Boolean values
    to distinguish paired and unpaired nucleotides (e.g. ``0`` and ``1``). The
    shadow describes the secondary structure in less detail than the bracket
    notation.
  - The **list of pairing indices** is a sequence of integers that each indicate
    to which nucleotide a nucleotide in the sequence is paired. For instance, in
    the zero-based list ``[2, -1, 0]``, the first nucleotide is paired to the
    third nucleotide. The value ``-1`` designates an unpaired nucleotide. CT
    files use one-based lists of pairing indices. It can represent pseudo-knots.
  - The **scoring matrix** presents a list of pairing indices in a 2D matrix.
    It can represent pseudo-knots.

Alternatively, this problem can be described as "multiple binary classification
problems that predict whether each pair of nucleotides forms a base pair or not"
:cite:`mxfold2`.

RNA secondary structure prediction methods usually comprise the following
stages:

- **Pre-processing** transforms the primary structure (i.e. sequence of letters)
  into another representation.
- **Prediction** converts the input into an output using a previously
  trained model.
- **Post-processing** transforms the output of the model into a secondary
  structure (e.g. bracket notation or shadow).


Definitions
-----------

+---------------------+-------------------------------------------------------+
| Term                | Definition                                            |
+=====================+=======================================================+
| contact matrix      | A 2D matrix in which 0 indicates a non-paired base    |
| or pairing scoring  | and 1 indicates a paired base. Can represent a        |
| matrix              | secondary structure with pseudo-knots :cite:`cnnfold` |
|                     | :cite:`ufold`.                                        |
+---------------------+-------------------------------------------------------+


Classical Prediction Models
---------------------------

The term "classical model" is used in this context to refer to predictive models
that do not rely on neural network.

- MXFold2
- RNAfold
- RNAstructure
- TORNADO


Prediction Models Based on Neural Networks
------------------------------------------

This section summarizes a few recent RNA secondary structure prediction models.


REDfold (2023)
''''''''''''''

REDfold :cite:`redfold` uses a contact matrix as the input and an
encoder-decoder for the processing. It supports pseudoknots.


CNNFold (2022)
''''''''''''''

CNNFold :cite:`cnnfold` is a predictive model designed by Booy, Ilin, and
Orponen that *uses no energy model and can predict pseudo-knots*.

The **input** of the model is a **matrix of possible pairings**. Each cell
indicates if the pairing is possible and, if so, what bases would be paired to
one another, as shown in the figure below:

.. figure:: images/potential_pairings.png
   :alt: Example of a pairing matrix.

   Example of a matrix of potential pairings.

This representation makes the predictive model more efficient because it
captures global information about the structure of the molecule. Instead of
only using the local information of a primary structure (i.e. a nucleotide and
its neighboring nucleotides), the model can rely on the pre-processed potential
pairings.

The **predictive model** is a **deep convolutional neural network**. It uses
only the pairing matrix as the input and does not consider any energy model.

The **output** is a **scoring matrix** that is converted to a secondary
structure.


ATTFold (2020)
''''''''''''''

ATTFold :cite:`attfold` is a predictive model designed by Wang, Liu, Gao, Zhang,
and Dong that *uses no energy model and can predict pseudo-knots*.

The **input** is a one-hot encoded primary structure.

The **model** is a pipeline of components:

- The first component is a transformer network that encodes the primary
  structure.
- A CNN follows the transformer and decodes the output of the transformer into
  a base pairing scoring matrix.
- **Hard constraints** are then applied to the scoring matrix to (1) remove
  impossible pairings (e.g. ``A`` to ``C``) and (2) remove pairings less than
  three bases apart.

The **output** is a base pairing scoring matrix.


UFold (2022)
''''''''''''

UFold :cite:`ufold` is a predictive model designed by Fu, Cao, Wu, Peng, Nie,
and Xie that *uses no energy model and can predict pseudo-knots*.

The **input** is a contact matrix, which is generated by computing the Kronecker
product of the one-hot encoded primary structure with itself. This data
structure differs from the matrix of possible pairings used in CNNFold.

The **model** is a U-Net neural network.

The **output** is a base pairing scoring matrix.


MXFold2 (2021)
''''''''''''''

MXFold2 :cite:`mxfold2` is a predictive model designed by Sato, Akiyama, and
Sakakibara that *uses an energy model and cannot predict pseudo-knots*.

The **input** is the one-hot encoded primary structure.

The **model** is a pipeline of components:

- A BiLSTM recurrent neural network.
- Convolutional and concatenation layers.
- A Zucker-style dynamic programming algorithm (THAT's the energy model, right
  there).

The **output** is the secondary structure in dot-bracket notation.


CDPFold (2019)
''''''''''''''

CDPFold :cite:`cdpfold` is a predictive model designed by Zhang, Li, and Wei
that *uses no energy model and cannot predict pseudo-knots*.

The **input** of the model is a matrix-encoded encoded primary structure whose
element represent possible base pairings. This is similar to CNNFold, but the
elements of the matrix are scalars instead of vectors.

The **model** is a pipeline of components:

- A CNN predicts base pairing probabilities.
- A maximum probability sum algorithm (a modified Nussimov dynamic programming
  method) corrects predictions to obtain valid secondary structures.

The **output** is the secondary structure in dot-bracket notation.


SPOT-RNA (2019)
'''''''''''''''

SPOT-RNA :cite:`spotrna` is a predictive model designed by Singh, Hanson,
Paliwal, and Zhou that *uses no energy model and can predict pseudo-knots*.

The model uses ensemble and transfer learning to predict structures.


General Observations
''''''''''''''''''''

Szikszai et al. :cite:`szikszai22` show that most models based on neural
networks perform poorly at family-wise cross validation. In other words, they
have a hard time generalizing predictions to unfamiliar structures.


Datasets
--------

:cite:`rnastrand`

- RNAStralign: https://www.urmc.rochester.edu/rna/
- RNAalign: https://drive.google.com/drive/folders/19KPRYJjjMJh1qdMhtmUoYA_ncw3ocAHc
- RFam: https://rfam.org/


References
----------

.. bibliography:: references.bib