diurnal package

Subpackages

Submodules

diurnal.database module

RNA secondary structure database utility module.

This module contains functions to install (i.e. download and unwrap) RNA dataset and manipulate the data into matrix formats usable by processing algorithms. Note: the word dataset is used to refer to a given set of RNA secondary structures (e.g. archiveII or RNASTRalign). The collection of datasets is referred as the database.

import diurnal.database as db
db.download("./data/", "archiveII")
db.format_basic("./data/archiveII", "./data/formatted", 512)
diurnal.database.download(dst: str, datasets: list, cleanup: bool = True, verbosity: int = 1) None[source]

Download and unpack RNA secondary structure databases.

Download the datasets listed in the datasets argument, places them in the dst directory, and unpacks the downloaded files.

Parameters
  • dst (str) – Directory path in which the files are downloaded and unwrapped.

  • datasets (list(str)) – The list of databases to download. The allowed databases are archiveII and RNASTRalign.

  • cleanup (bool) – If True, the raw, compressed file is deleted. If False, that file is not deleted.

  • verbosity (int) – Verbosity of the function. 1 (default) prints informative messages. 0 silences the function.

diurnal.database.download_all(dst: str, cleanup: bool = True, verbosity: int = 1) None[source]

Download all available RNA secondary structure datasets (archiveII and RNASTRalign).

Parameters
  • dst (str) – Directory path in which the files are downloaded and unwrapped.

  • cleanup (bool) – If True, the raw, compressed file is deleted. If False, that file is not deleted.

  • verbosity (int) – Verbosity of the function. 1 (default) prints informative messages. 0 silences the function.

diurnal.database.format_basic(src: str, dst: str, max_size: int, primary_structure_map: any = <function Primary.to_onehot>, secondary_structure_map: any = <function Secondary.to_onehot>, verbosity: int = 1) None[source]

Transform the original datasets into the representation provided by the arguments.

This function reads the RNA dataset files comprised in the directory dataset_path, applies the encoding schemes defined by the arguments, and writes the result in the formatted_path directory. All encoded elements are zero-padded to obtain elements of dimensions [1 X max_size].

The function writes four files: - info.rst describes the data. - primary_structure.np contains the encoded primary structures of

the molecules.

  • secondary_structure.np contains the encoded secondary structures

    of the molecules.

  • families.np contains the encoded family of the molecules.

  • names.txt contains the newline-delimited names of the molecules.

Parameters
  • src (str) – The directory in which RNA datasets are located. The function searches for RNA files recursively.

  • dst (str) – The directory in which the encoded RNA structures are written. If the directory does not exist, it is created.

  • max_size (int) – Maximal number of nucleotides in an RNA structure. If an RNA structure has more nucleotides than max_size, it is not included in the formatted dataset.

  • primary_structure_map – A dictionary or function that maps an RNA primary structure symbol to a vector (e.g. map A to [1, 0, 0, 0]). If None, the file x.np is not written.

  • secondary_structure_map – A dictionary or function that maps an RNA secondary structure symbol to a vector (e.g. map ‘.’ to [0, 1, 0]). If None, the file y.np is not written.

  • verbosity (int) – Verbosity level of the function. 1 (default) prints informative messages. 0 silences the function.

diurnal.database.format_filenames(src: str, dst: str = None, size: int = 0, families: list[str] = [], randomize: bool = True, verbosity: int = 1) list[str][source]

Obtain all file names that satisfy the arguments.

Parameters
  • src (str) – Directory of the sequence files.

  • dst (str) – Output file name. Set to None for no output.

  • size (int) – Maximum length of a sequence. Provide 0 for no maximum length.

  • families (list[str]) – Set of RNA families to include. Provide [] to include all families.

  • randomize (bool) – If True, shuffle the filenames.

  • verbosity (int) – Verbosity level. 0 to disable the output.

Returns (list[str]): List of file names.

diurnal.database.format_primary_secondary_structure(names: str, dst: str, size: int, map: Callable, verbosity: int = 1, epsilon: float = 0.01) str[source]

Convert a combination of primary and secondary structures into a Numpy file.

Parameters
  • names (list[str]) – List of sequence file names.

  • dst (str) – Output file name.

  • size (int) – Maximum length of a sequence.

  • map (Callable) – Function that transforms the sequence of bases into a formatted primary structure.

  • verbosity (int) – Verbosity level. 0 to disable the output.

  • epsilon – Maximum Manhattan distance between two matrices to consider them different. Used to account for rounding.

diurnal.database.format_primary_structure(names: str, dst: str, size: int, map: Callable, verbosity: int = 1, epsilon: float = 0.01) str[source]

Convert primary structures into a Numpy file.

Parameters
  • names (list[str]) – List of sequence file names.

  • dst (str) – Output file name.

  • size (int) – Maximum length of a sequence.

  • map (Callable) – Function that transforms the sequence of bases into a formatted primary structure.

  • verbosity (int) – Verbosity level. 0 to disable the output.

  • epsilon – Maximum Manhattan distance between two matrices to consider them different. Used to account for rounding.

diurnal.database.format_secondary_structure(names: str, dst: str, size: int, map: Callable, verbosity: int = 1, epsilon: float = 0.01) str[source]

Convert secondary structures into a Numpy file.

Parameters
  • names (list[str]) – List of sequence file names.

  • dst (str) – Output file name.

  • size (int) – Maximum length of a sequence.

  • map (Callable) – Function that transforms the sequence of bases into a formatted primary structure.

  • verbosity (int) – Verbosity level. 0 to disable the output.

  • epsilon – Maximum Manhattan distance between two matrices to consider them different. Used to account for rounding.

diurnal.database.save(matrix: numpy.ndarray, name: str) None[source]

Save a matrix into a file.

Parameters
  • matrix – Input

  • name – Complete filename. Directories are created.

diurnal.database.summarize(path: str, primary_structure_map, secondary_structure_map) str[source]

Summarize the content of the formatted file directory.

Parameters
  • path (str) – File path of the formatted data.

  • primary_structure_map – A dictionary or function that maps an RNA primary structure symbol to a vector (e.g. map A to [1, 0, 0, 0]). If None, the file x.np is not written.

  • secondary_structure_map – A dictionary or function that maps an RNA secondary structure symbol to a vector (e.g. map ‘.’ to [0, 1, 0]). If None, the file y.np is not written.

Returns (str): Informative file containing:
  • Title

  • Generation date and time

  • Number of structures

  • Structure size (number of nucleotides)

  • Primary structure encoding example

  • Secondary structure encoding example

diurnal.evaluate module

RNA secondary prediction evaluation module.

This module contains functions to evaluate RNA secondary predictions by comparing a predicted structure to a reference structure.

class diurnal.evaluate.Bracket[source]

Bases: object

Evaluate predictions made with the bracket notation.

confusion_matrix(pred: list[str], symbols: str = '(.)') float[source]

Get the confusion matrix of the prediction.

Parameters
  • true (list-like) – Vector of the true structure.

  • pred (list-like) – Vector of the predicted structure.

  • symbols – Set of possible elements.

Returns (tuple): A tuple containing the confusion matrix and a

list of symbols that correspond to each row of the matrix.

convert_to_scalars(pred: list[str], symbols: tuple[str]) tuple[source]

Convert a vector of vectors into a vector of scalars. For instance, [[0, 1], [0, 1], [1, 0]] and [‘.’, ‘.’, ‘(‘] are converted to [0, 0, 1].

Parameters
  • true (list-like) – Vector of the true structure.

  • pred (list-like) – Vector of the predicted structure.

  • symbols – Set of possible elements.

Returns (list): Tuple containing the scalar vectors.

crop(length: int) list[str | int][source]

Return a cropped secondary structure to exclude padding.

Parameters
  • bracket – Bracket notation of the secondary structure.

  • length – Number of bases in the primary structure.

Returns: The bracket argument from element 0 to length.

micro_f1(pred: list[str], symbols: str = '(.)') float[source]

Compute the micro F1-score by considering the secondary structure symbols ‘(’, ‘.’, and ‘)’ as three different classes.

Parameters
  • true (list-like) – Vector of the true structure.

  • pred (list-like) – Vector of the predicted structure.

  • symbols – Set of possible elements.

Returns (float): F1-score of the prediction, i.e. a value

between 0 and 1.

class diurnal.evaluate.ContactMatrix[source]

Bases: object

Evaluate predictions made with contact matrices.

FN(pred: numpy.ndarray) int[source]

Get the number of false negatives.

FP(pred: numpy.ndarray) int[source]

Get the number of false positives.

TN(pred: numpy.ndarray) int[source]

Get the number of true negatives.

TP(pred: numpy.ndarray) int[source]

Get the number of true positives.

crop(length: int) list[int][source]

Return a cropped contact matrix to exclude padding.

Parameters
  • contact – Contact matrix of the secondary structure.

  • length – Number of bases in the primary structure.

Returns: The length by length upper left square of the

contact matrix.

f1(pred) float[source]

Compute the F1 score, a harmonic mean of precision and recall.

precision(pred: numpy.ndarray) float[source]

Compute the precision obtained by comparing two secondary structures. Precision is defined as:

\[TP / (TP + FP).\]
recall(pred: numpy.ndarray) float[source]

Compute the recall obtained by comparing two secondary structures. Precision is defined as:

\[TP / (TP + FN).\]
class diurnal.evaluate.Shadow[source]

Bases: object

Evaluate predictions made with secondary structure shadows, i.e. a sequence of paired / unpaired bases.

FN(pred: list[int]) float[source]

Compute the false negative value (predicted unpaired bases that are actually unpaired).

FP(pred: list[int]) float[source]

Compute the false positive value (predicted paired bases that are actually unpaired).

TN(pred: list[int]) float[source]

Compute the true negative value (predicted unpaired bases that are actually unpaired).

TP(pred: list[int]) float[source]

Compute the true positive value (predicted paired bases that are actually paired).

crop(length: int) list[int][source]

Return a cropped shadow to exclude padding.

Parameters
  • shadow – Shadow of the secondary structure.

  • length – Number of bases in the primary structure.

Returns: The shadow argument from element 0 to length.

precision(pred) float[source]

Compute the precision obtained by comparing two secondary structures. Precision is defined as:

\[TP / (TP + FP).\]
recall(pred) float[source]

Compute the recall value obtained by comparing two secondary structures. Recall is defined as:

\[TP / (TP + FN).\]
recall_precision_f1(pred)[source]
Compute the F1-score obtained by comparing two secondary

structures. The f1-score is defined as:

\[F1 = 2 imes\]

rac{recall imes precision}{recall + precision}

diurnal.evaluate.summarize_results(f1_scores: list, name: str) None[source]

Summarize the f1-scores.

Parameters
  • f1_scores (list(float)) – List of f1-scores.

  • name (str) – Name of the results printed along with the summary.

diurnal.evaluate.to_shadow(bracket: list[str] | str) list[int][source]

Convert a bracket notation to a secondary structure shadow.

Parameters

bracket – Secondary structure represented in bracket notation with the characters (, ., and ).

Returns: Secondary structure shadow in which 0 stands for ( or

) and 1 stands for ..

diurnal.family module

RNA family utility module.

This module simplifies operations related to the encoding of RNA families into other representations.

diurnal.family.all_but(families: list[str]) bool[source]

Return all RNA family names except those provided as arguments.

Parameters

families (List(str) | str) – RNA families to exclude.

Returns (List(str)): The list of selected RNA families.

diurnal.family.get_name(filename: str) str[source]

Attempt to determine the family of an RNA molecule based on its filename.

Parameters

filename (str) – Name of the file containing the representation of the RNA molecule.

Returns (str): RNA family if found, empty string otherwise.

diurnal.family.is_known(family: str) bool[source]

Check if an RNA family is recognized.

Parameters

family (str) – Family test name.

Returns (bool): True if the family is recognized, False otherwise.

diurnal.family.select(names: list[str], families: str | list[str]) list[str][source]

Return a list of molecule names that belong to a provided family.

Parameters
  • names (list[str]) – List of names to filter.

  • families (str | list[str]) – Family or families to preserve.

Returns (list[str]) List of names.

diurnal.family.split(names: list[str]) dict[source]

Split a list of molecule names into a dictionary of names organized by family.

Parameters

names – List of molecule names.

Returns: Dictionary formatted as {“family”: [names]}.

diurnal.family.to_name(vector: list) str[source]

Convert a one-hot-encoded family back into its name.

Parameters

vector (list) – A one-hot encoded family.

Returns (str): Family name.

diurnal.family.to_onehot(family: str, map: dict = {'16s': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0], '23s': [0, 0, 1, 0, 0, 0, 0, 0, 0, 0], '5s': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'RNaseP': [0, 0, 0, 0, 0, 1, 0, 0, 0, 0], 'SRP': [0, 0, 0, 0, 0, 0, 1, 0, 0, 0], 'grp1': [0, 0, 0, 1, 0, 0, 0, 0, 0, 0], 'grp2': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0], 'tRNA': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1], 'telomerase': [0, 0, 0, 0, 0, 0, 0, 1, 0, 0], 'tmRNA': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]}) list[source]

Encode a family into a one-hot vector.

Parameters
  • family (str) – RNA family.

  • map (dict) – A dictionary that assigns a family to a vector.

Returns (list(int)): One-hot encoded family.

diurnal.structure module

Transform RNA structures into useful representations.

class diurnal.structure.Constants[source]

Bases: object

Set of physical values that contraint RNA structures.

LOOP_MIN_DISTANCE

Minimum number of nucleotides between two bases paired to each other. For instance, in the sequence ACCCU, the bases A and U can be paired because they are separated by three bases. However, in the sequence ACU, the bases A and U cannot be paired because they are too close.

Type

int

LOOP_MIN_DISTANCE = 3
class diurnal.structure.Primary[source]

Bases: object

Transform RNA primary structures into useful formats.

to_mask(size: int = 0) numpy.ndarray[source]

Make a primary structure pairing mask.

Return the a copy of the input matrix in which impossible pairings are set to 0 and possible pairings are set to 1.

Parameters
  • pairings (np.ndarray) – Primary structure potential pairing matrix.

  • size (int) – Matrix dimension. 0 for no padding.

Returns (np.ndarray): Pairing matrix mask.

to_matrix(size: int = 0, map: dict = {'-': (0, 0, 0, 0, 0, 0, 0, 0), 'AU': (1, 0, 0, 0, 0, 0, 0, 0), 'CG': (0, 0, 1, 0, 0, 0, 0, 0), 'GC': (0, 0, 0, 1, 0, 0, 0, 0), 'GU': (0, 0, 0, 0, 1, 0, 0, 0), 'UA': (0, 1, 0, 0, 0, 0, 0, 0), 'UG': (0, 0, 0, 0, 0, 1, 0, 0), 'invalid': (0, 0, 0, 0, 0, 0, 1, 0)}) numpy.ndarray[source]

Encode a primary structure in a matrix of potential pairings.

Create an n by n matrix, where n is the number of bases, in which element each represent a potential RNA base pairing. For instance, the pairing AA is not possible and will be assigned the invalid value of the map parameter. AU is a valid pairing and the corresponding element will be assigned to its value in the map.

Parameters
  • bases (list(str)) – Primary structure (sequence of bases).

  • size (int) – Matrix dimension. 0 for no padding.

  • map (dict) – Assign a pairing to a matrix element. The elements of the map must be (1) convertible to a Numpy array and (2) of the same dimension.

Returns (np.ndarray): Encoded matrix.

to_onehot(size: int = 0, map: dict = {'-': (0, 0, 0, 0), '.': (0, 0, 0, 0), 'A': (1, 0, 0, 0), 'B': (0, 1, 1, 1), 'C': (0, 1, 0, 0), 'D': (1, 0, 1, 1), 'G': (0, 0, 1, 0), 'H': (1, 1, 0, 1), 'K': (0, 0, 1, 1), 'M': (1, 1, 0, 0), 'N': (1, 1, 1, 1), 'R': (1, 0, 1, 0), 'S': (0, 1, 1, 0), 'T': (0, 0, 0, 1), 'U': (0, 0, 0, 1), 'V': (1, 1, 1, 0), 'W': (1, 0, 0, 1), 'Y': (0, 1, 0, 1)}) numpy.ndarray[source]

Transform a sequence of bases into a one-hot encoded vector.

Parameters
  • bases (List[str] | str) – A sequence of bases. E.g.: ['A', 'U'] or AU.

  • size (int) – Size of a normalized vector. 0 for no padding.

  • map (dict) – Assign an input to a vector.

Returns (np.ndarray): One-hot encoded primary structure.

E.g.: [[1, 0, 0, 0], [0, 1, 0, 0]]

to_sequence(strip: bool = True, map: dict = {'-': (0, 0, 0, 0), '.': (0, 0, 0, 0), 'A': (1, 0, 0, 0), 'B': (0, 1, 1, 1), 'C': (0, 1, 0, 0), 'D': (1, 0, 1, 1), 'G': (0, 0, 1, 0), 'H': (1, 1, 0, 1), 'K': (0, 0, 1, 1), 'M': (1, 1, 0, 0), 'N': (1, 1, 1, 1), 'R': (1, 0, 1, 0), 'S': (0, 1, 1, 0), 'T': (0, 0, 0, 1), 'U': (0, 0, 0, 1), 'V': (1, 1, 1, 0), 'W': (1, 0, 0, 1), 'Y': (0, 1, 0, 1)}) list[source]

Transform a one-hot encoded vector into a sequence of bases.

Parameters
  • vector (list-like) – One-hot encoded primary structure.

  • strip (bool) – Remove empty elements at the vector’s right end.

  • map – A dictionary or function that maps bases to vectors.

Returns (list): A sequence of bases. E.g.: ['A', 'U'].

unpad_matrix(map: dict = {'-': (0, 0, 0, 0, 0, 0, 0, 0), 'AU': (1, 0, 0, 0, 0, 0, 0, 0), 'CG': (0, 0, 1, 0, 0, 0, 0, 0), 'GC': (0, 0, 0, 1, 0, 0, 0, 0), 'GU': (0, 0, 0, 0, 1, 0, 0, 0), 'UA': (0, 1, 0, 0, 0, 0, 0, 0), 'UG': (0, 0, 0, 0, 0, 1, 0, 0), 'invalid': (0, 0, 0, 0, 0, 0, 1, 0)}) numpy.ndarray[source]

Strip a matrix of its padding elements.

Parameters
  • matrix – Input matrix (Numpy array of Python lists).

  • map (dict) – Assign a pairing to a matrix element.

Returns (list): Unpadded matrix.

class diurnal.structure.Schemes[source]

Bases: object

RNA structure schemes

The attributes of this class are used to transform raw RNA sequence data into other representations that can be used for prediction problems.

IUPAC_TO_ONEHOT

One-hot encoding dictionary for IUPAC symbols. See: https://www.bioinformatics.org/sms/iupac.html

Type

dict

IUPAC_ONEHOT_PAIRINGS_VECTOR

One-hot encoded nucleotide pairings, including normal ones (AU, UA, CG, and GC) and wobble pairs (GU and UG). Taken from CNNFold by Booy et al.

Type

dict

BRACKET_TO_ONEHOT

One-hot encoding dictionary for a secondary structure that relies on the bracket notation. . is an unpaired base. ( is a base paired to a downstream base. ) is a base paired to an upstream base. - is a padding (i.e. empty) base.

Type

dict

SHADOW_ENCODING

One-hot encoding dictionary to encode the shadow of the secondary structure (i.e. the symbols ( and ) of the bracket notation are considered identical).

Type

dict

BRACKET_TO_ONEHOT = {'(': (1, 0, 0), ')': (0, 0, 1), '-': (0, 0, 0), '.': (0, 1, 0)}
IUPAC_ONEHOT_PAIRINGS_VECTOR = {'-': (0, 0, 0, 0, 0, 0, 0, 0), 'AU': (1, 0, 0, 0, 0, 0, 0, 0), 'CG': (0, 0, 1, 0, 0, 0, 0, 0), 'GC': (0, 0, 0, 1, 0, 0, 0, 0), 'GU': (0, 0, 0, 0, 1, 0, 0, 0), 'UA': (0, 1, 0, 0, 0, 0, 0, 0), 'UG': (0, 0, 0, 0, 0, 1, 0, 0), 'invalid': (0, 0, 0, 0, 0, 0, 1, 0)}
IUPAC_PAIRINGS_SCALARS = {'-': 0, 'AU': 2, 'CG': 3, 'GC': 3, 'GU': 1, 'UA': 2, 'UG': 1, 'invalid': 0}
IUPAC_PAIRINGS_SCALARS_NEGATIVE_PADDING = {'-': -1, 'AU': 2, 'CG': 3, 'GC': 3, 'GU': 1, 'UA': 2, 'UG': 1, 'invalid': 0}
IUPAC_TO_ONEHOT = {'-': (0, 0, 0, 0), '.': (0, 0, 0, 0), 'A': (1, 0, 0, 0), 'B': (0, 1, 1, 1), 'C': (0, 1, 0, 0), 'D': (1, 0, 1, 1), 'G': (0, 0, 1, 0), 'H': (1, 1, 0, 1), 'K': (0, 0, 1, 1), 'M': (1, 1, 0, 0), 'N': (1, 1, 1, 1), 'R': (1, 0, 1, 0), 'S': (0, 1, 1, 0), 'T': (0, 0, 0, 1), 'U': (0, 0, 0, 1), 'V': (1, 1, 1, 0), 'W': (1, 0, 0, 1), 'Y': (0, 1, 0, 1)}
SHADOW_ENCODING = {'(': 1, ')': 1, '-': 0, '.': 0}
class diurnal.structure.Secondary[source]

Bases: object

Transform RNA secondary structures into useful formats.

normalize_distance_matrix() numpy.ndarray[source]

Normalize the distance matrix.

This function returns a new distance matrix whose elements are normalized within the range 0.0 (farthest from a paired base) to 1.0 (paired base).

Parameters

distance_matrix (np.ndarray) – Result of the function to_distance_matrix.

Returns (np.ndarray): Normalized distance matrix.

quantize(mask: numpy.ndarray, threshold: float = None) numpy.ndarray[source]

Eliminate invalid pairings in a secondary structure matrix.

Let the following represent a secondary structure matrix:

```
[[_, _, _, _, c, b],

[_, _, _, _, _, a], [_, _, _, _, _, _], [_, _, _, _, _, _], [x, _, _, _, _, _], [y, z, _, _, _, _]]

```

It follows that (x, a), (y, b), and (z, c) must all be pairs of identical elements because they represent either paired or unpaired bases. Differing elements would indicate that a base is both paired and unpaired, which is impossible. This function assigns the value 0 to all impossible pairings and 1 to all other values.

Steps: - Symmetrize the matrix by multiplying it by its transpose. - Determine a threshold value from the average of non-paired

elements.

  • Assign 0 to all the elements below the threshold.

  • Quantize the matrix along both axes and multiply the result with each other.

Parameters
  • matrix (np.ndarray) – Contact matrix.

  • mask (np.ndarray) – Valid pairing mask.

  • threshold (float) – Value below which elements are discarded. Determined at runtime if not provided.

Returns (np.ndarray): Folded pairing matrix.

quantize_distance_matrix() numpy.ndarray[source]

Create a contact matrix from a distance matrix.

Parameters

distance_matrix (np.ndarray) – Result of the function to_distance_matrix.

Returns (np.ndarray): Contact matrix.

quantize_vector() numpy.ndarray[source]

Quantize a secondary structure vector.

Convert a vector of predicted brackets into a one-hot vector. For instance, [[0.9, 0.5, 0.1], [0.0, 0.5, 0.1]] is converted to [[1, 0, 0], [0, 1, 0]].

Parameters

prediction (list-like) – Secondary structure prediction.

Returns: Reformatted secondary structure.

to_bracket() list[source]

Convert a list of nucleotide pairings into a secondary structure bracket notation, e.g. ‘(((…)))’.

Parameters

pairings (list(int)) – A list of nucleotide pairings, e.g. the pairing (((…))) is represented as [8, 7, 6, -1, -1, -1, 2, 1, 0].

Returns (list): Secondary structure bracket notation.

to_distance_matrix(size: int = 0, normalize: bool = True, power: float = 1) numpy.ndarray[source]

Encode a secondary structure into a score contact matrix.

Transform the sequence of pairings into an n by n matrix, where n is the number of pairings, whose elements can be 1 for a paired base and x for unpaired bases, where x is given by: x = 1 - (d / n), in which d is the Manhattan distance with the closest paired base.

Parameters
  • pairings (list(int) – List of base pairings.

  • size (int) – Dimension of the matrix. 0 for no padding.

  • normalize (bool) – If True, scale distances so that paired elements are 1 and the farthest elements are 0.

  • power (float) – Power to apply to normalized distances.

Returns (np.ndarray): Encoded matrix of the secondary structure.

to_elements() str[source]

Convert pairings into secondary structure elements.

The possible elements or loops are:

element | character |

+=================+===========+ | bulge | b | | external loop | e | | hairpin loop | h | | internal loop | i | | multiloop | m | | stem / stacking | s |

Parameters

pairings – List of pairings as indices or bracket notations.

Returns (str): List of elements.

to_matrix(size: int = 0) numpy.ndarray[source]

Encode a secondary structure into a contact matrix.

Transform the sequence of pairings into an n by n matrix, where n is the number of pairings, whose elements can be 0 for an unpaired base and 1 for a paired base.

Parameters
  • pairings (list(int) – List of base pairings.

  • size (int) – Dimension of the matrix. 0 for no padding.

Returns (np.ndarray): Encoded matrix of the secondary structure.

to_onehot(size: int = 0, map: dict = {'(': (1, 0, 0), ')': (0, 0, 1), '-': (0, 0, 0), '.': (0, 1, 0)}) numpy.ndarray[source]

Encode pairings in a one-hot encoded dot-bracket secondary structure.

Parameters
  • pairings (List[int|str]) – A list of nucleotide pairings. The pairing (((…))) can be represented as [8, 7, 6, -1, -1, -1, 2, 1, 0] or [‘(’, ‘(’, ‘(’, ‘.’, ‘.’, ‘.’, ‘)’, ‘)’, ‘)’].

  • size (int) – Size of the output. 0 for no padding.

  • map (dict) – Assign an input to a vector.

Returns (np.ndarray): One-hot encoded secondary structure.

to_pairings() list[source]

Convert the bracket notation to a list of pairings.

Parameters

bracket (List[str] | str) – Secondary structure.

Returns (List[int]): List of pairings.

to_shadow(size: int = 0) list[source]

Return the shadow of a secondary structure.

Parameters
  • Pairings (List[str]) – Secondary structure.

  • size (int) – Final sequence length.

diurnal.train module

RNA secondary structure training utility module.

diurnal.train.categorize_matrix(prediction: numpy.ndarray) numpy.ndarray[source]
diurnal.train.categorize_vector(prediction: list) list[source]

Convert a vector of predicted pairings into a one-hot vector. For instance, [[0.9, 0.5, 0.1], [0.0, 0.5, 0.1]] is converted to [[1, 0, 0], [0, 1, 0]].

Parameters

prediction (list-like) – Secondary structure prediction.

Returns: Reformatted secondary structure.

diurnal.train.clean_matrices(primary: list, true: list, pred: list) tuple[source]

Prepare a secondary structure prediction for evaluation.

Parameters
  • primary (list) – Vector-encoded primary structure.

  • true (list) – True vector-encoded secondary structure.

  • pred (list) – Predicted vector-encoded secondary structure.

Returns (tuple): A tuple of elements organized as:
  • sequence of bases

  • stripped true secondary structure matrix

  • stripped predicted secondary structure matrix

diurnal.train.clean_vectors(primary: list, true: list, pred: list) tuple[source]

Prepare a secondary structure prediction for evaluation.

Parameters
  • primary (list) – Vector-encoded primary structure.

  • true (list) – True vector-encoded secondary structure.

  • pred (list) – Predicted vector-encoded secondary structure.

Returns (tuple): A tuple of elements organized as:
  • sequence of bases

  • stripped true secondary structure

  • stripped predicted secondary structure

diurnal.train.k_fold_indices(fractions: list, k: int, n: int) list[source]

Return tuples of indices for K-fold splits.

Parameters
  • fractions – Proportion of each subset. For instance, to use 80% of the data for training and 20% for testing, use [0.8, 0.2].

  • k – Number of folds.

  • n – Number of indices.

Returns (list): k tuples containing len(fractions) of index lists.

diurnal.train.k_fold_split(data, fractions: list, k: int, i: int) list[source]

Split the data to make a K-fold split.

Parameters
  • data – Array-like object containing the data to split.

  • fractions – Proportion of each subset. For instance, to use 80% of the data for training and 20% for testing, use [0.8, 0.2].

  • k – Number of folds.

  • i – Zero-based index of the fold.

Returns

A tuple containing the split data object.

diurnal.train.load_data(path: str, randomize: bool = True) tuple[source]

Read formatted data into tensors.

Parameters
  • path (str) – Name of the directory that contains the Numpy files written by the function diurnal.database.format.

  • randomize (bool) – Randomize data if set to True.

Returns

Loaded data represented as

[primary structure, secondary structure, family].

Return type

list

diurnal.train.load_families(path: str, families: list, randomize=True, verbose: bool = True) list[source]

Read formatted molecules of the specified RNA family.

Parameters
  • path (str) – Name of the directory that contains the Numpy files written by the function diurnal.database.format.

  • families (List(str) | str) – Families to read.

  • randomize (bool) – Randomize data if set to True.

  • verbose (bool) – Print informative messages.

Returns (dict): Loaded data represented as
`{

“input”: tuple[list], “secondary”: list, “names”: list(str), “family”: list

}`

diurnal.train.quantize_matrix(matrix: list[list[float]], dim: int = 0) None[source]

Quantize a matrix.

All the rows of the matrix are formatted as follows: - The maximum element is set to 1. - The other elements are set to 0.

Parameters
  • matrix – Input matrix

  • dim – Dimension along which to quantize the matrix.

diurnal.train.shuffle_data(*args) tuple[source]

Shuffle vectors to preserve one-to-one original pairings.

For instance, consider - a = [ 0, 1, 2 ] - b = [‘a’, ‘b’, ‘c’] Shuffling lists a and b may result in: - a = [ 2, 1, 0 ] - b = [‘c’, ‘b’, ‘a’]

Parameters

args – List-like elements to be shuffled. They need to be of the same dimensions.

Returns (tuple): Shuffled data. The vector are returned in the same

order as they were provided.

diurnal.train.split(data, fractions: tuple[float], offset: int = 0) list[source]

Split an array of data.

Parameters
  • data (any) – Array-like data to split.

  • fractions (tuple[float]) – Fraction of data in each resulting set. Elements must sum to 1.

  • offset (int) – Index offset.

Returns (list[any]): List of split sets.

Example:

>>> data = [0, 1, 2, 3, 4, 5, 6, 8, 9]
>>> split(data, (0.2, 0.8), 1)
[[1, 2], [3, 4, 5, 6, 7, 8, 9, 0]]
diurnal.train.split_data(data, fractions: list, offset: int = 0) list[source]

Split data in subsets according to the specified fractions.

Parameters
  • data – Array-like object containing the data to split.

  • fractions – Proportion of each subset. For instance, to use 80% of the data for training and 20% for testing, use [0.8, 0.2].

  • offset – Number of indices to offset to assemble the subsets. Used for K-fold data splits.

Returns

A list containing the split data object.

diurnal.train.split_indices(fractions: list, n: int) list[source]

Split a range of indices in subsets according to the specified fractions.

Parameters
  • fractions – Proportion of each subset. For instance, to use 80% of the data for training and 20% for testing, use [0.8, 0.2].

  • n – Number of indices.

Returns (list): A list containing the split data object.

diurnal.visualize module

Data visualization module.

diurnal.visualize.compare_pairings(true: numpy.ndarray, prediction: numpy.ndarray, title: str = 'Comparison of Secondary Structures') None[source]

Compare secondary structures.

Parameters
  • matrices – Contact matrices. Must contain two (2) matrices.

  • labels – Name of each contact matrix.

  • title – Graph title.

diurnal.visualize.heatmap(matrices: numpy.ndarray, title: str = '', label: bool = False) None[source]

Visualize heatmaps.

The function opens a plot that visualizes the matrices argument. If the matrices is a 3D array, the heatmap is the sum of all arrays along the 0 axis. If matrices is a 2D array, it is used as the heatmap.

Parameters
  • matrices – Set 2D matrices or one 2D matrix.

  • title (str) – Graph title.

  • label (bool) – If True, label each axis.

diurnal.visualize.lengths(data) None[source]

Display a histogram of the length of the data.

diurnal.visualize.potential_pairings(primary: str | list[str] | numpy.ndarray, secondary: list | tuple[list] = None, title: str = 'RNA Molecule Potential Pairings', map: dict = {'-': (0, 0, 0, 0, 0, 0, 0, 0), 'AU': (1, 0, 0, 0, 0, 0, 0, 0), 'CG': (0, 0, 1, 0, 0, 0, 0, 0), 'GC': (0, 0, 0, 1, 0, 0, 0, 0), 'GU': (0, 0, 0, 0, 1, 0, 0, 0), 'UA': (0, 1, 0, 0, 0, 0, 0, 0), 'UG': (0, 0, 0, 0, 0, 1, 0, 0), 'invalid': (0, 0, 0, 0, 0, 0, 1, 0)}) None[source]

Display a heatmap of potential pairings.

Parameters
  • primary (str) – List of bases or potential pairing matrix.

  • secondary (list) – Secondary structure or tuple of secondary structures represented as contact matrices or lists of pairings.

  • title (str) – Name of the graph.

  • map – Potential pairing to string map.

diurnal.visualize.prediction(primary, true, pred) None[source]

Compare true and predicted secondary structures.

diurnal.visualize.primary_structure(primary) None[source]

Print the sequence of nucleotides from a one-hot encoded primary structure.

Parameters

primary – Primary structure.

diurnal.visualize.print_contact_matrix(matrix: numpy.ndarray)[source]

Print a contact matrix in the terminal.

diurnal.visualize.secondary_structure(matrix, primary: list = None, title: str = 'RNA Molecule Pairings') None[source]

Display a heatmap of the secondary structure.

Parameters
  • primary (List[str]) – Primary structure.

  • matrix (List[List[bool]]) – Secondary structure.

  • title (str) –

diurnal.visualize.shadow(primary, true, pred) None[source]

Compare shadows.

diurnal.visualize.structure_length_per_family(path: str, max_size: int = None) None[source]

Display a histogram of RNA lengths.

Parameters
  • path (str) – Database file path.

  • max_size (int) – If provided, reject larger molecules.

Module contents

Diurnal is a Python library designed to predict RNA secondary structures.