diurnal package

Subpackages

Submodules

diurnal.database module

RNA secondary structure database utility module.

This module contains functions to install (i.e. download and unwrap) RNA dataset and manipulate the data into matrix formats usable by processing algorithms. Note: the word dataset is used to refer to a given set of RNA secondary structures (e.g. archiveII or RNASTRalign). The collection of datasets is referred as the database.

import diurnal.database as db
db.download("./data/", "archiveII")
db.format_basic("./data/archiveII", "./data/formatted", 512)

Author: Vincent Therrien (therrien.vincent.2@courrier.uqam.ca)
Affiliation: Département d’informatique, UQÀM
File creation date: April 2023
License: MIT

diurnal.database.download(dst: str, datasets: list, cleanup: bool = True, verbosity: int = 1) → None[source]

Download and unpack RNA secondary structure databases.

Download the datasets listed in the datasets argument, places them in the dst directory, and unpacks the downloaded files.

Parameters

dst (str) – Directory path in which the files are downloaded and unwrapped.
datasets (list(str)) – The list of databases to download. The allowed databases are archiveII and RNASTRalign.
cleanup (bool) – If True, the raw, compressed file is deleted. If False, that file is not deleted.
verbosity (int) – Verbosity of the function. 1 (default) prints informative messages. 0 silences the function.

diurnal.database.download_all(dst: str, cleanup: bool = True, verbosity: int = 1) → None[source]

Download all available RNA secondary structure datasets (archiveII and RNASTRalign).

Parameters

dst (str) – Directory path in which the files are downloaded and unwrapped.
cleanup (bool) – If True, the raw, compressed file is deleted. If False, that file is not deleted.
verbosity (int) – Verbosity of the function. 1 (default) prints informative messages. 0 silences the function.

diurnal.database.format_basic(src: str, dst: str, max_size: int, primary_structure_map: any = <function Primary.to_onehot>, secondary_structure_map: any = <function Secondary.to_onehot>, verbosity: int = 1) → None[source]

Transform the original datasets into the representation provided by the arguments.

This function reads the RNA dataset files comprised in the directory dataset_path, applies the encoding schemes defined by the arguments, and writes the result in the formatted_path directory. All encoded elements are zero-padded to obtain elements of dimensions [1 X max_size].

The function writes four files: - info.rst describes the data. - primary_structure.np contains the encoded primary structures of

the molecules.

secondary_structure.np contains the encoded secondary structures
of the molecules.
families.np contains the encoded family of the molecules.
names.txt contains the newline-delimited names of the molecules.

Parameters

src (str) – The directory in which RNA datasets are located. The function searches for RNA files recursively.
dst (str) – The directory in which the encoded RNA structures are written. If the directory does not exist, it is created.
max_size (int) – Maximal number of nucleotides in an RNA structure. If an RNA structure has more nucleotides than max_size, it is not included in the formatted dataset.
primary_structure_map – A dictionary or function that maps an RNA primary structure symbol to a vector (e.g. map A to [1, 0, 0, 0]). If None, the file x.np is not written.
secondary_structure_map – A dictionary or function that maps an RNA secondary structure symbol to a vector (e.g. map ‘.’ to [0, 1, 0]). If None, the file y.np is not written.
verbosity (int) – Verbosity level of the function. 1 (default) prints informative messages. 0 silences the function.

diurnal.database.format_filenames(src: str, dst: str = None, size: int = 0, families: list[str] = [], randomize: bool = True, verbosity: int = 1) → list[str][source]

Obtain all file names that satisfy the arguments.

Parameters

src (str) – Directory of the sequence files.
dst (str) – Output file name. Set to None for no output.
size (int) – Maximum length of a sequence. Provide 0 for no maximum length.
families (list[str]) – Set of RNA families to include. Provide [] to include all families.
randomize (bool) – If True, shuffle the filenames.
verbosity (int) – Verbosity level. 0 to disable the output.

Returns (list[str]): List of file names.

diurnal.database.format_primary_secondary_structure(names: str, dst: str, size: int, map: Callable, verbosity: int = 1, epsilon: float = 0.01) → str[source]

Convert a combination of primary and secondary structures into a Numpy file.

Parameters

names (list[str]) – List of sequence file names.
dst (str) – Output file name.
size (int) – Maximum length of a sequence.
map (Callable) – Function that transforms the sequence of bases into a formatted primary structure.
verbosity (int) – Verbosity level. 0 to disable the output.
epsilon – Maximum Manhattan distance between two matrices to consider them different. Used to account for rounding.

diurnal.database.format_primary_structure(names: str, dst: str, size: int, map: Callable, verbosity: int = 1, epsilon: float = 0.01) → str[source]

Convert primary structures into a Numpy file.

Parameters

names (list[str]) – List of sequence file names.
dst (str) – Output file name.
size (int) – Maximum length of a sequence.
map (Callable) – Function that transforms the sequence of bases into a formatted primary structure.
verbosity (int) – Verbosity level. 0 to disable the output.
epsilon – Maximum Manhattan distance between two matrices to consider them different. Used to account for rounding.

diurnal.database.format_secondary_structure(names: str, dst: str, size: int, map: Callable, verbosity: int = 1, epsilon: float = 0.01) → str[source]

Convert secondary structures into a Numpy file.

Parameters

names (list[str]) – List of sequence file names.
dst (str) – Output file name.
size (int) – Maximum length of a sequence.
map (Callable) – Function that transforms the sequence of bases into a formatted primary structure.
verbosity (int) – Verbosity level. 0 to disable the output.
epsilon – Maximum Manhattan distance between two matrices to consider them different. Used to account for rounding.

diurnal.database.save(matrix: numpy.ndarray, name: str) → None[source]

Save a matrix into a file.

Parameters

matrix – Input
name – Complete filename. Directories are created.

diurnal.database.summarize(path: str, primary_structure_map, secondary_structure_map) → str[source]

Summarize the content of the formatted file directory.

Parameters

path (str) – File path of the formatted data.
primary_structure_map – A dictionary or function that maps an RNA primary structure symbol to a vector (e.g. map A to [1, 0, 0, 0]). If None, the file x.np is not written.
secondary_structure_map – A dictionary or function that maps an RNA secondary structure symbol to a vector (e.g. map ‘.’ to [0, 1, 0]). If None, the file y.np is not written.

Returns (str): Informative file containing:

Title
Generation date and time
Number of structures
Structure size (number of nucleotides)
Primary structure encoding example
Secondary structure encoding example

diurnal.evaluate module

RNA secondary prediction evaluation module.

This module contains functions to evaluate RNA secondary predictions by comparing a predicted structure to a reference structure.

Author: Vincent Therrien (therrien.vincent.2@courrier.uqam.ca)
Affiliation: Département d’informatique, UQÀM
File creation date: April 2023
License: MIT

class diurnal.evaluate.Bracket[source]

Bases: object

Evaluate predictions made with the bracket notation.

confusion_matrix(pred: list[str], symbols: str = '(.)') → float[source]

Get the confusion matrix of the prediction.

Parameters

true (list-like) – Vector of the true structure.
pred (list-like) – Vector of the predicted structure.
symbols – Set of possible elements.

Returns (tuple): A tuple containing the confusion matrix and a: list of symbols that correspond to each row of the matrix.

convert_to_scalars(pred: list[str], symbols: tuple[str]) → tuple[source]

Convert a vector of vectors into a vector of scalars. For instance, [[0, 1], [0, 1], [1, 0]] and [‘.’, ‘.’, ‘(‘] are converted to [0, 0, 1].

Parameters

true (list-like) – Vector of the true structure.
pred (list-like) – Vector of the predicted structure.
symbols – Set of possible elements.

Returns (list): Tuple containing the scalar vectors.

crop(length: int) → list[str | int][source]

Return a cropped secondary structure to exclude padding.

Parameters

bracket – Bracket notation of the secondary structure.
length – Number of bases in the primary structure.

Returns: The bracket argument from element 0 to length.

micro_f1(pred: list[str], symbols: str = '(.)') → float[source]

Compute the micro F1-score by considering the secondary structure symbols ‘(’, ‘.’, and ‘)’ as three different classes.

Parameters

true (list-like) – Vector of the true structure.
pred (list-like) – Vector of the predicted structure.
symbols – Set of possible elements.

Returns (float): F1-score of the prediction, i.e. a value: between 0 and 1.

class diurnal.evaluate.ContactMatrix[source]

Bases: object

Evaluate predictions made with contact matrices.

FN(pred: numpy.ndarray) → int[source]: Get the number of false negatives.

FP(pred: numpy.ndarray) → int[source]: Get the number of false positives.

TN(pred: numpy.ndarray) → int[source]: Get the number of true negatives.

TP(pred: numpy.ndarray) → int[source]: Get the number of true positives.

crop(length: int) → list[int][source]

Return a cropped contact matrix to exclude padding.

Parameters

contact – Contact matrix of the secondary structure.
length – Number of bases in the primary structure.

Returns: The length by length upper left square of the: contact matrix.

f1(pred) → float[source]: Compute the F1 score, a harmonic mean of precision and recall.

precision(pred: numpy.ndarray) → float[source]: Compute the precision obtained by comparing two secondary structures. Precision is defined as:

\[TP / (TP + FP).\]

recall(pred: numpy.ndarray) → float[source]: Compute the recall obtained by comparing two secondary structures. Precision is defined as:

\[TP / (TP + FN).\]

class diurnal.evaluate.Shadow[source]

Bases: object

Evaluate predictions made with secondary structure shadows, i.e. a sequence of paired / unpaired bases.

FN(pred: list[int]) → float[source]: Compute the false negative value (predicted unpaired bases that are actually unpaired).

FP(pred: list[int]) → float[source]: Compute the false positive value (predicted paired bases that are actually unpaired).

TN(pred: list[int]) → float[source]: Compute the true negative value (predicted unpaired bases that are actually unpaired).

TP(pred: list[int]) → float[source]: Compute the true positive value (predicted paired bases that are actually paired).

crop(length: int) → list[int][source]

Return a cropped shadow to exclude padding.

Parameters

shadow – Shadow of the secondary structure.
length – Number of bases in the primary structure.

Returns: The shadow argument from element 0 to length.

precision(pred) → float[source]: Compute the precision obtained by comparing two secondary structures. Precision is defined as:

\[TP / (TP + FP).\]

recall(pred) → float[source]: Compute the recall value obtained by comparing two secondary structures. Recall is defined as:

\[TP / (TP + FN).\]

recall_precision_f1(pred)[source]

Compute the F1-score obtained by comparing two secondary: structures. The f1-score is defined as:

\[F1 = 2 imes\]

rac{recall imes precision}{recall + precision}

diurnal.evaluate.summarize_results(f1_scores: list, name: str) → None[source]

Summarize the f1-scores.

Parameters

f1_scores (list(float)) – List of f1-scores.
name (str) – Name of the results printed along with the summary.

diurnal.evaluate.to_shadow(bracket: list[str] | str) → list[int][source]

Convert a bracket notation to a secondary structure shadow.

Parameters: bracket – Secondary structure represented in bracket notation with the characters (, ., and ).

Returns: Secondary structure shadow in which 0 stands for ( or: ) and 1 stands for ..

diurnal.family module

RNA family utility module.

This module simplifies operations related to the encoding of RNA families into other representations.

Author: Vincent Therrien (therrien.vincent.2@courrier.uqam.ca)
Affiliation: Département d’informatique, UQÀM
File creation date: June 2023
License: MIT

diurnal.family.all_but(families: list[str]) → bool[source]

Return all RNA family names except those provided as arguments.

Parameters: families (List(str) | str) – RNA families to exclude.

Returns (List(str)): The list of selected RNA families.

diurnal.family.get_name(filename: str) → str[source]

Attempt to determine the family of an RNA molecule based on its filename.

Parameters: filename (str) – Name of the file containing the representation of the RNA molecule.

Returns (str): RNA family if found, empty string otherwise.

diurnal.family.is_known(family: str) → bool[source]

Check if an RNA family is recognized.

Parameters: family (str) – Family test name.

Returns (bool): True if the family is recognized, False otherwise.

diurnal.family.select(names: list[str], families: str | list[str]) → list[str][source]

Return a list of molecule names that belong to a provided family.

Parameters

names (list[str]) – List of names to filter.
families (str | list[str]) – Family or families to preserve.

Returns (list[str]) List of names.

diurnal.family.split(names: list[str]) → dict[source]

Split a list of molecule names into a dictionary of names organized by family.

Parameters: names – List of molecule names.

Returns: Dictionary formatted as {“family”: [names]}.

diurnal.family.to_name(vector: list) → str[source]

Convert a one-hot-encoded family back into its name.

Parameters: vector (list) – A one-hot encoded family.

Returns (str): Family name.

diurnal.family.to_onehot(family: str, map: dict = {'16s': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0], '23s': [0, 0, 1, 0, 0, 0, 0, 0, 0, 0], '5s': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'RNaseP': [0, 0, 0, 0, 0, 1, 0, 0, 0, 0], 'SRP': [0, 0, 0, 0, 0, 0, 1, 0, 0, 0], 'grp1': [0, 0, 0, 1, 0, 0, 0, 0, 0, 0], 'grp2': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0], 'tRNA': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1], 'telomerase': [0, 0, 0, 0, 0, 0, 0, 1, 0, 0], 'tmRNA': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]}) → list[source]

Encode a family into a one-hot vector.

Parameters

family (str) – RNA family.
map (dict) – A dictionary that assigns a family to a vector.

Returns (list(int)): One-hot encoded family.

diurnal.structure module

Transform RNA structures into useful representations.

Author: Vincent Therrien (therrien.vincent.2@courrier.uqam.ca)
Affiliation: Département d’informatique, UQÀM
File creation date: June 2023
License: MIT

class diurnal.structure.Constants[source]

Bases: object

Set of physical values that contraint RNA structures.

LOOP_MIN_DISTANCE

Minimum number of nucleotides between two bases paired to each other. For instance, in the sequence ACCCU, the bases A and U can be paired because they are separated by three bases. However, in the sequence ACU, the bases A and U cannot be paired because they are too close.

Type: int

LOOP_MIN_DISTANCE = 3

class diurnal.structure.Primary[source]

Bases: object

Transform RNA primary structures into useful formats.

to_mask(size: int = 0) → numpy.ndarray[source]

Make a primary structure pairing mask.

Return the a copy of the input matrix in which impossible pairings are set to 0 and possible pairings are set to 1.

Parameters

pairings (np.ndarray) – Primary structure potential pairing matrix.
size (int) – Matrix dimension. 0 for no padding.

Returns (np.ndarray): Pairing matrix mask.

to_matrix(size: int = 0, map: dict = {'-': (0, 0, 0, 0, 0, 0, 0, 0), 'AU': (1, 0, 0, 0, 0, 0, 0, 0), 'CG': (0, 0, 1, 0, 0, 0, 0, 0), 'GC': (0, 0, 0, 1, 0, 0, 0, 0), 'GU': (0, 0, 0, 0, 1, 0, 0, 0), 'UA': (0, 1, 0, 0, 0, 0, 0, 0), 'UG': (0, 0, 0, 0, 0, 1, 0, 0), 'invalid': (0, 0, 0, 0, 0, 0, 1, 0)}) → numpy.ndarray[source]

Encode a primary structure in a matrix of potential pairings.

Create an n by n matrix, where n is the number of bases, in which element each represent a potential RNA base pairing. For instance, the pairing AA is not possible and will be assigned the invalid value of the map parameter. AU is a valid pairing and the corresponding element will be assigned to its value in the map.

Parameters

bases (list(str)) – Primary structure (sequence of bases).
size (int) – Matrix dimension. 0 for no padding.
map (dict) – Assign a pairing to a matrix element. The elements of the map must be (1) convertible to a Numpy array and (2) of the same dimension.

Returns (np.ndarray): Encoded matrix.

to_onehot(size: int = 0, map: dict = {'-': (0, 0, 0, 0), '.': (0, 0, 0, 0), 'A': (1, 0, 0, 0), 'B': (0, 1, 1, 1), 'C': (0, 1, 0, 0), 'D': (1, 0, 1, 1), 'G': (0, 0, 1, 0), 'H': (1, 1, 0, 1), 'K': (0, 0, 1, 1), 'M': (1, 1, 0, 0), 'N': (1, 1, 1, 1), 'R': (1, 0, 1, 0), 'S': (0, 1, 1, 0), 'T': (0, 0, 0, 1), 'U': (0, 0, 0, 1), 'V': (1, 1, 1, 0), 'W': (1, 0, 0, 1), 'Y': (0, 1, 0, 1)}) → numpy.ndarray[source]

Transform a sequence of bases into a one-hot encoded vector.

Parameters

bases (List[str] | str) – A sequence of bases. E.g.: ['A', 'U'] or AU.
size (int) – Size of a normalized vector. 0 for no padding.
map (dict) – Assign an input to a vector.

Returns (np.ndarray): One-hot encoded primary structure.: E.g.: [[1, 0, 0, 0], [0, 1, 0, 0]]

to_sequence(strip: bool = True, map: dict = {'-': (0, 0, 0, 0), '.': (0, 0, 0, 0), 'A': (1, 0, 0, 0), 'B': (0, 1, 1, 1), 'C': (0, 1, 0, 0), 'D': (1, 0, 1, 1), 'G': (0, 0, 1, 0), 'H': (1, 1, 0, 1), 'K': (0, 0, 1, 1), 'M': (1, 1, 0, 0), 'N': (1, 1, 1, 1), 'R': (1, 0, 1, 0), 'S': (0, 1, 1, 0), 'T': (0, 0, 0, 1), 'U': (0, 0, 0, 1), 'V': (1, 1, 1, 0), 'W': (1, 0, 0, 1), 'Y': (0, 1, 0, 1)}) → list[source]

Transform a one-hot encoded vector into a sequence of bases.

Parameters

vector (list-like) – One-hot encoded primary structure.
strip (bool) – Remove empty elements at the vector’s right end.
map – A dictionary or function that maps bases to vectors.

Returns (list): A sequence of bases. E.g.: ['A', 'U'].

unpad_matrix(map: dict = {'-': (0, 0, 0, 0, 0, 0, 0, 0), 'AU': (1, 0, 0, 0, 0, 0, 0, 0), 'CG': (0, 0, 1, 0, 0, 0, 0, 0), 'GC': (0, 0, 0, 1, 0, 0, 0, 0), 'GU': (0, 0, 0, 0, 1, 0, 0, 0), 'UA': (0, 1, 0, 0, 0, 0, 0, 0), 'UG': (0, 0, 0, 0, 0, 1, 0, 0), 'invalid': (0, 0, 0, 0, 0, 0, 1, 0)}) → numpy.ndarray[source]

Strip a matrix of its padding elements.

Parameters

matrix – Input matrix (Numpy array of Python lists).
map (dict) – Assign a pairing to a matrix element.

Returns (list): Unpadded matrix.

class diurnal.structure.Schemes[source]

Bases: object

RNA structure schemes

The attributes of this class are used to transform raw RNA sequence data into other representations that can be used for prediction problems.

IUPAC_TO_ONEHOT

One-hot encoding dictionary for IUPAC symbols. See: https://www.bioinformatics.org/sms/iupac.html

Type: dict

IUPAC_ONEHOT_PAIRINGS_VECTOR

One-hot encoded nucleotide pairings, including normal ones (AU, UA, CG, and GC) and wobble pairs (GU and UG). Taken from CNNFold by Booy et al.

Type: dict

BRACKET_TO_ONEHOT

One-hot encoding dictionary for a secondary structure that relies on the bracket notation. . is an unpaired base. ( is a base paired to a downstream base. ) is a base paired to an upstream base. - is a padding (i.e. empty) base.

Type: dict

SHADOW_ENCODING

One-hot encoding dictionary to encode the shadow of the secondary structure (i.e. the symbols ( and ) of the bracket notation are considered identical).

Type: dict

BRACKET_TO_ONEHOT = {'(': (1, 0, 0), ')': (0, 0, 1), '-': (0, 0, 0), '.': (0, 1, 0)}

IUPAC_ONEHOT_PAIRINGS_VECTOR = {'-': (0, 0, 0, 0, 0, 0, 0, 0), 'AU': (1, 0, 0, 0, 0, 0, 0, 0), 'CG': (0, 0, 1, 0, 0, 0, 0, 0), 'GC': (0, 0, 0, 1, 0, 0, 0, 0), 'GU': (0, 0, 0, 0, 1, 0, 0, 0), 'UA': (0, 1, 0, 0, 0, 0, 0, 0), 'UG': (0, 0, 0, 0, 0, 1, 0, 0), 'invalid': (0, 0, 0, 0, 0, 0, 1, 0)}

IUPAC_PAIRINGS_SCALARS = {'-': 0, 'AU': 2, 'CG': 3, 'GC': 3, 'GU': 1, 'UA': 2, 'UG': 1, 'invalid': 0}

IUPAC_PAIRINGS_SCALARS_NEGATIVE_PADDING = {'-': -1, 'AU': 2, 'CG': 3, 'GC': 3, 'GU': 1, 'UA': 2, 'UG': 1, 'invalid': 0}

IUPAC_TO_ONEHOT = {'-': (0, 0, 0, 0), '.': (0, 0, 0, 0), 'A': (1, 0, 0, 0), 'B': (0, 1, 1, 1), 'C': (0, 1, 0, 0), 'D': (1, 0, 1, 1), 'G': (0, 0, 1, 0), 'H': (1, 1, 0, 1), 'K': (0, 0, 1, 1), 'M': (1, 1, 0, 0), 'N': (1, 1, 1, 1), 'R': (1, 0, 1, 0), 'S': (0, 1, 1, 0), 'T': (0, 0, 0, 1), 'U': (0, 0, 0, 1), 'V': (1, 1, 1, 0), 'W': (1, 0, 0, 1), 'Y': (0, 1, 0, 1)}

SHADOW_ENCODING = {'(': 1, ')': 1, '-': 0, '.': 0}

class diurnal.structure.Secondary[source]

Bases: object

Transform RNA secondary structures into useful formats.

normalize_distance_matrix() → numpy.ndarray[source]

Normalize the distance matrix.

This function returns a new distance matrix whose elements are normalized within the range 0.0 (farthest from a paired base) to 1.0 (paired base).

Parameters: distance_matrix (np.ndarray) – Result of the function to_distance_matrix.

Returns (np.ndarray): Normalized distance matrix.

quantize(mask: numpy.ndarray, threshold: float = None) → numpy.ndarray[source]

Eliminate invalid pairings in a secondary structure matrix.

Let the following represent a secondary structure matrix:

```

[[_, _, _, _, c, b],: [_, _, _, _, _, a], [_, _, _, _, _, _], [_, _, _, _, _, _], [x, _, _, _, _, _], [y, z, _, _, _, _]]

```

It follows that (x, a), (y, b), and (z, c) must all be pairs of identical elements because they represent either paired or unpaired bases. Differing elements would indicate that a base is both paired and unpaired, which is impossible. This function assigns the value 0 to all impossible pairings and 1 to all other values.

Steps: - Symmetrize the matrix by multiplying it by its transpose. - Determine a threshold value from the average of non-paired

elements.

Assign 0 to all the elements below the threshold.
Quantize the matrix along both axes and multiply the result with each other.

Parameters

matrix (np.ndarray) – Contact matrix.
mask (np.ndarray) – Valid pairing mask.
threshold (float) – Value below which elements are discarded. Determined at runtime if not provided.

Returns (np.ndarray): Folded pairing matrix.

quantize_distance_matrix() → numpy.ndarray[source]

Create a contact matrix from a distance matrix.

Parameters: distance_matrix (np.ndarray) – Result of the function to_distance_matrix.

Returns (np.ndarray): Contact matrix.

quantize_vector() → numpy.ndarray[source]

Quantize a secondary structure vector.

Convert a vector of predicted brackets into a one-hot vector. For instance, [[0.9, 0.5, 0.1], [0.0, 0.5, 0.1]] is converted to [[1, 0, 0], [0, 1, 0]].

Parameters: prediction (list-like) – Secondary structure prediction.

Returns: Reformatted secondary structure.

to_bracket() → list[source]

Convert a list of nucleotide pairings into a secondary structure bracket notation, e.g. ‘(((…)))’.

Parameters: pairings (list(int)) – A list of nucleotide pairings, e.g. the pairing (((…))) is represented as [8, 7, 6, -1, -1, -1, 2, 1, 0].

Returns (list): Secondary structure bracket notation.

to_distance_matrix(size: int = 0, normalize: bool = True, power: float = 1) → numpy.ndarray[source]

Encode a secondary structure into a score contact matrix.

Transform the sequence of pairings into an n by n matrix, where n is the number of pairings, whose elements can be 1 for a paired base and x for unpaired bases, where x is given by: x = 1 - (d / n), in which d is the Manhattan distance with the closest paired base.

Parameters

pairings (list(int) – List of base pairings.
size (int) – Dimension of the matrix. 0 for no padding.
normalize (bool) – If True, scale distances so that paired elements are 1 and the farthest elements are 0.
power (float) – Power to apply to normalized distances.

Returns (np.ndarray): Encoded matrix of the secondary structure.

to_elements() → str[source]

Convert pairings into secondary structure elements.

The possible elements or loops are:

element | character |

+=================+===========+ | bulge | b | | external loop | e | | hairpin loop | h | | internal loop | i | | multiloop | m | | stem / stacking | s |

Parameters: pairings – List of pairings as indices or bracket notations.

Returns (str): List of elements.

to_matrix(size: int = 0) → numpy.ndarray[source]

Encode a secondary structure into a contact matrix.

Transform the sequence of pairings into an n by n matrix, where n is the number of pairings, whose elements can be 0 for an unpaired base and 1 for a paired base.

Parameters

pairings (list(int) – List of base pairings.
size (int) – Dimension of the matrix. 0 for no padding.

Returns (np.ndarray): Encoded matrix of the secondary structure.

to_onehot(size: int = 0, map: dict = {'(': (1, 0, 0), ')': (0, 0, 1), '-': (0, 0, 0), '.': (0, 1, 0)}) → numpy.ndarray[source]

Encode pairings in a one-hot encoded dot-bracket secondary structure.

Parameters

pairings (List[int|str]) – A list of nucleotide pairings. The pairing (((…))) can be represented as [8, 7, 6, -1, -1, -1, 2, 1, 0] or [‘(’, ‘(’, ‘(’, ‘.’, ‘.’, ‘.’, ‘)’, ‘)’, ‘)’].
size (int) – Size of the output. 0 for no padding.
map (dict) – Assign an input to a vector.

Returns (np.ndarray): One-hot encoded secondary structure.

to_pairings() → list[source]

Convert the bracket notation to a list of pairings.

Parameters: bracket (List[str] | str) – Secondary structure.

Returns (List[int]): List of pairings.

to_shadow(size: int = 0) → list[source]

Return the shadow of a secondary structure.

Parameters

Pairings (List[str]) – Secondary structure.
size (int) – Final sequence length.

diurnal.train module

RNA secondary structure training utility module.

Author: Vincent Therrien (therrien.vincent.2@courrier.uqam.ca)
Affiliation: Département d’informatique, UQÀM
File creation date: April 2023
License: MIT

diurnal.train.categorize_matrix(prediction: numpy.ndarray) → numpy.ndarray[source]

diurnal.train.categorize_vector(prediction: list) → list[source]

Convert a vector of predicted pairings into a one-hot vector. For instance, [[0.9, 0.5, 0.1], [0.0, 0.5, 0.1]] is converted to [[1, 0, 0], [0, 1, 0]].

Parameters: prediction (list-like) – Secondary structure prediction.

Returns: Reformatted secondary structure.

diurnal.train.clean_matrices(primary: list, true: list, pred: list) → tuple[source]

Prepare a secondary structure prediction for evaluation.

Parameters

primary (list) – Vector-encoded primary structure.
true (list) – True vector-encoded secondary structure.
pred (list) – Predicted vector-encoded secondary structure.

Returns (tuple): A tuple of elements organized as:

sequence of bases
stripped true secondary structure matrix
stripped predicted secondary structure matrix

diurnal.train.clean_vectors(primary: list, true: list, pred: list) → tuple[source]

Prepare a secondary structure prediction for evaluation.

Parameters

primary (list) – Vector-encoded primary structure.
true (list) – True vector-encoded secondary structure.
pred (list) – Predicted vector-encoded secondary structure.

Returns (tuple): A tuple of elements organized as:

sequence of bases
stripped true secondary structure
stripped predicted secondary structure

diurnal.train.k_fold_indices(fractions: list, k: int, n: int) → list[source]

Return tuples of indices for K-fold splits.

Parameters

fractions – Proportion of each subset. For instance, to use 80% of the data for training and 20% for testing, use [0.8, 0.2].
k – Number of folds.
n – Number of indices.

Returns (list): k tuples containing len(fractions) of index lists.

diurnal.train.k_fold_split(data, fractions: list, k: int, i: int) → list[source]

Split the data to make a K-fold split.

Parameters

data – Array-like object containing the data to split.
fractions – Proportion of each subset. For instance, to use 80% of the data for training and 20% for testing, use [0.8, 0.2].
k – Number of folds.
i – Zero-based index of the fold.

Returns

A tuple containing the split data object.

diurnal.train.load_data(path: str, randomize: bool = True) → tuple[source]

Read formatted data into tensors.

Parameters

path (str) – Name of the directory that contains the Numpy files written by the function diurnal.database.format.
randomize (bool) – Randomize data if set to True.

Returns

Loaded data represented as: [primary structure, secondary structure, family].

Return type

list

diurnal.train.load_families(path: str, families: list, randomize=True, verbose: bool = True) → list[source]

Read formatted molecules of the specified RNA family.

Parameters

path (str) – Name of the directory that contains the Numpy files written by the function diurnal.database.format.
families (List(str) | str) – Families to read.
randomize (bool) – Randomize data if set to True.
verbose (bool) – Print informative messages.

Returns (dict): Loaded data represented as

`{: “input”: tuple[list], “secondary”: list, “names”: list(str), “family”: list

}`

diurnal.train.quantize_matrix(matrix: list[list[float]], dim: int = 0) → None[source]

Quantize a matrix.

All the rows of the matrix are formatted as follows: - The maximum element is set to 1. - The other elements are set to 0.

Parameters

matrix – Input matrix
dim – Dimension along which to quantize the matrix.

diurnal.train.shuffle_data(*args) → tuple[source]

Shuffle vectors to preserve one-to-one original pairings.

For instance, consider - a = [ 0, 1, 2 ] - b = [‘a’, ‘b’, ‘c’] Shuffling lists a and b may result in: - a = [ 2, 1, 0 ] - b = [‘c’, ‘b’, ‘a’]

Parameters: args – List-like elements to be shuffled. They need to be of the same dimensions.

Returns (tuple): Shuffled data. The vector are returned in the same: order as they were provided.

diurnal.train.split(data, fractions: tuple[float], offset: int = 0) → list[source]

Split an array of data.

Parameters

data (any) – Array-like data to split.
fractions (tuple[float]) – Fraction of data in each resulting set. Elements must sum to 1.
offset (int) – Index offset.

Returns (list[any]): List of split sets.

Example:

>>> data = [0, 1, 2, 3, 4, 5, 6, 8, 9]
>>> split(data, (0.2, 0.8), 1)
[[1, 2], [3, 4, 5, 6, 7, 8, 9, 0]]

diurnal.train.split_data(data, fractions: list, offset: int = 0) → list[source]

Split data in subsets according to the specified fractions.

Parameters

data – Array-like object containing the data to split.
fractions – Proportion of each subset. For instance, to use 80% of the data for training and 20% for testing, use [0.8, 0.2].
offset – Number of indices to offset to assemble the subsets. Used for K-fold data splits.

Returns

A list containing the split data object.

diurnal.train.split_indices(fractions: list, n: int) → list[source]

Split a range of indices in subsets according to the specified fractions.

Parameters

fractions – Proportion of each subset. For instance, to use 80% of the data for training and 20% for testing, use [0.8, 0.2].
n – Number of indices.

Returns (list): A list containing the split data object.

diurnal.visualize module

Data visualization module.

Author: Vincent Therrien (therrien.vincent.2@courrier.uqam.ca)
Affiliation: Département d’informatique, UQÀM
File creation date: April 2023
License: MIT

diurnal.visualize.compare_pairings(true: numpy.ndarray, prediction: numpy.ndarray, title: str = 'Comparison of Secondary Structures') → None[source]

Compare secondary structures.

Parameters

matrices – Contact matrices. Must contain two (2) matrices.
labels – Name of each contact matrix.
title – Graph title.

diurnal.visualize.heatmap(matrices: numpy.ndarray, title: str = '', label: bool = False) → None[source]

Visualize heatmaps.

The function opens a plot that visualizes the matrices argument. If the matrices is a 3D array, the heatmap is the sum of all arrays along the 0 axis. If matrices is a 2D array, it is used as the heatmap.

Parameters

matrices – Set 2D matrices or one 2D matrix.
title (str) – Graph title.
label (bool) – If True, label each axis.

diurnal.visualize.lengths(data) → None[source]: Display a histogram of the length of the data.

diurnal.visualize.potential_pairings(primary: str | list[str] | numpy.ndarray, secondary: list | tuple[list] = None, title: str = 'RNA Molecule Potential Pairings', map: dict = {'-': (0, 0, 0, 0, 0, 0, 0, 0), 'AU': (1, 0, 0, 0, 0, 0, 0, 0), 'CG': (0, 0, 1, 0, 0, 0, 0, 0), 'GC': (0, 0, 0, 1, 0, 0, 0, 0), 'GU': (0, 0, 0, 0, 1, 0, 0, 0), 'UA': (0, 1, 0, 0, 0, 0, 0, 0), 'UG': (0, 0, 0, 0, 0, 1, 0, 0), 'invalid': (0, 0, 0, 0, 0, 0, 1, 0)}) → None[source]

Display a heatmap of potential pairings.

Parameters

primary (str) – List of bases or potential pairing matrix.
secondary (list) – Secondary structure or tuple of secondary structures represented as contact matrices or lists of pairings.
title (str) – Name of the graph.
map – Potential pairing to string map.

diurnal.visualize.prediction(primary, true, pred) → None[source]: Compare true and predicted secondary structures.

diurnal.visualize.primary_structure(primary) → None[source]

Print the sequence of nucleotides from a one-hot encoded primary structure.

Parameters: primary – Primary structure.

diurnal.visualize.print_contact_matrix(matrix: numpy.ndarray)[source]: Print a contact matrix in the terminal.

diurnal.visualize.secondary_structure(matrix, primary: list = None, title: str = 'RNA Molecule Pairings') → None[source]

Display a heatmap of the secondary structure.

Parameters

primary (List[str]) – Primary structure.
matrix (List[List[bool]]) – Secondary structure.
title (str) –

diurnal.visualize.shadow(primary, true, pred) → None[source]: Compare shadows.

diurnal.visualize.structure_length_per_family(path: str, max_size: int = None) → None[source]

Display a histogram of RNA lengths.

Parameters

path (str) – Database file path.
max_size (int) – If provided, reject larger molecules.

Module contents

Diurnal is a Python library designed to predict RNA secondary structures.

Author: Vincent Therrien (therrien.vincent.2@courrier.uqam.ca)
Affiliation: Département d’informatique, UQÀM
File creation date: April 2023
License: MIT