diurnal package
Subpackages
Submodules
diurnal.database module
RNA secondary structure database utility module.
This module contains functions to install (i.e. download and unwrap) RNA dataset and manipulate the data into matrix formats usable by processing algorithms. Note: the word dataset is used to refer to a given set of RNA secondary structures (e.g. archiveII or RNASTRalign). The collection of datasets is referred as the database.
import diurnal.database as db
db.download("./data/", "archiveII")
db.format_basic("./data/archiveII", "./data/formatted", 512)
Author: Vincent Therrien (therrien.vincent.2@courrier.uqam.ca)
Affiliation: Département d’informatique, UQÀM
File creation date: April 2023
License: MIT
- diurnal.database.download(dst: str, datasets: list, cleanup: bool = True, verbosity: int = 1) None [source]
Download and unpack RNA secondary structure databases.
Download the datasets listed in the datasets argument, places them in the dst directory, and unpacks the downloaded files.
- Parameters
dst (str) – Directory path in which the files are downloaded and unwrapped.
datasets (list(str)) – The list of databases to download. The allowed databases are archiveII and RNASTRalign.
cleanup (bool) – If True, the raw, compressed file is deleted. If False, that file is not deleted.
verbosity (int) – Verbosity of the function. 1 (default) prints informative messages. 0 silences the function.
- diurnal.database.download_all(dst: str, cleanup: bool = True, verbosity: int = 1) None [source]
Download all available RNA secondary structure datasets (archiveII and RNASTRalign).
- Parameters
dst (str) – Directory path in which the files are downloaded and unwrapped.
cleanup (bool) – If True, the raw, compressed file is deleted. If False, that file is not deleted.
verbosity (int) – Verbosity of the function. 1 (default) prints informative messages. 0 silences the function.
- diurnal.database.format_basic(src: str, dst: str, max_size: int, primary_structure_map: any = <function Primary.to_onehot>, secondary_structure_map: any = <function Secondary.to_onehot>, verbosity: int = 1) None [source]
Transform the original datasets into the representation provided by the arguments.
This function reads the RNA dataset files comprised in the directory dataset_path, applies the encoding schemes defined by the arguments, and writes the result in the formatted_path directory. All encoded elements are zero-padded to obtain elements of dimensions [1 X max_size].
The function writes four files: - info.rst describes the data. - primary_structure.np contains the encoded primary structures of
the molecules.
- secondary_structure.np contains the encoded secondary structures
of the molecules.
families.np contains the encoded family of the molecules.
names.txt contains the newline-delimited names of the molecules.
- Parameters
src (str) – The directory in which RNA datasets are located. The function searches for RNA files recursively.
dst (str) – The directory in which the encoded RNA structures are written. If the directory does not exist, it is created.
max_size (int) – Maximal number of nucleotides in an RNA structure. If an RNA structure has more nucleotides than max_size, it is not included in the formatted dataset.
primary_structure_map – A dictionary or function that maps an RNA primary structure symbol to a vector (e.g. map A to [1, 0, 0, 0]). If None, the file x.np is not written.
secondary_structure_map – A dictionary or function that maps an RNA secondary structure symbol to a vector (e.g. map ‘.’ to [0, 1, 0]). If None, the file y.np is not written.
verbosity (int) – Verbosity level of the function. 1 (default) prints informative messages. 0 silences the function.
- diurnal.database.format_filenames(src: str, dst: str = None, size: int = 0, families: list[str] = [], randomize: bool = True, verbosity: int = 1) list[str] [source]
Obtain all file names that satisfy the arguments.
- Parameters
src (str) – Directory of the sequence files.
dst (str) – Output file name. Set to None for no output.
size (int) – Maximum length of a sequence. Provide 0 for no maximum length.
families (list[str]) – Set of RNA families to include. Provide [] to include all families.
randomize (bool) – If True, shuffle the filenames.
verbosity (int) – Verbosity level. 0 to disable the output.
Returns (list[str]): List of file names.
- diurnal.database.format_primary_secondary_structure(names: str, dst: str, size: int, map: Callable, verbosity: int = 1, epsilon: float = 0.01) str [source]
Convert a combination of primary and secondary structures into a Numpy file.
- Parameters
names (list[str]) – List of sequence file names.
dst (str) – Output file name.
size (int) – Maximum length of a sequence.
map (Callable) – Function that transforms the sequence of bases into a formatted primary structure.
verbosity (int) – Verbosity level. 0 to disable the output.
epsilon – Maximum Manhattan distance between two matrices to consider them different. Used to account for rounding.
- diurnal.database.format_primary_structure(names: str, dst: str, size: int, map: Callable, verbosity: int = 1, epsilon: float = 0.01) str [source]
Convert primary structures into a Numpy file.
- Parameters
names (list[str]) – List of sequence file names.
dst (str) – Output file name.
size (int) – Maximum length of a sequence.
map (Callable) – Function that transforms the sequence of bases into a formatted primary structure.
verbosity (int) – Verbosity level. 0 to disable the output.
epsilon – Maximum Manhattan distance between two matrices to consider them different. Used to account for rounding.
- diurnal.database.format_secondary_structure(names: str, dst: str, size: int, map: Callable, verbosity: int = 1, epsilon: float = 0.01) str [source]
Convert secondary structures into a Numpy file.
- Parameters
names (list[str]) – List of sequence file names.
dst (str) – Output file name.
size (int) – Maximum length of a sequence.
map (Callable) – Function that transforms the sequence of bases into a formatted primary structure.
verbosity (int) – Verbosity level. 0 to disable the output.
epsilon – Maximum Manhattan distance between two matrices to consider them different. Used to account for rounding.
- diurnal.database.save(matrix: numpy.ndarray, name: str) None [source]
Save a matrix into a file.
- Parameters
matrix – Input
name – Complete filename. Directories are created.
- diurnal.database.summarize(path: str, primary_structure_map, secondary_structure_map) str [source]
Summarize the content of the formatted file directory.
- Parameters
path (str) – File path of the formatted data.
primary_structure_map – A dictionary or function that maps an RNA primary structure symbol to a vector (e.g. map A to [1, 0, 0, 0]). If None, the file x.np is not written.
secondary_structure_map – A dictionary or function that maps an RNA secondary structure symbol to a vector (e.g. map ‘.’ to [0, 1, 0]). If None, the file y.np is not written.
- Returns (str): Informative file containing:
Title
Generation date and time
Number of structures
Structure size (number of nucleotides)
Primary structure encoding example
Secondary structure encoding example
diurnal.evaluate module
RNA secondary prediction evaluation module.
This module contains functions to evaluate RNA secondary predictions by comparing a predicted structure to a reference structure.
Author: Vincent Therrien (therrien.vincent.2@courrier.uqam.ca)
Affiliation: Département d’informatique, UQÀM
File creation date: April 2023
License: MIT
- class diurnal.evaluate.Bracket[source]
Bases:
object
Evaluate predictions made with the bracket notation.
- confusion_matrix(pred: list[str], symbols: str = '(.)') float [source]
Get the confusion matrix of the prediction.
- Parameters
true (list-like) – Vector of the true structure.
pred (list-like) – Vector of the predicted structure.
symbols – Set of possible elements.
- Returns (tuple): A tuple containing the confusion matrix and a
list of symbols that correspond to each row of the matrix.
- convert_to_scalars(pred: list[str], symbols: tuple[str]) tuple [source]
Convert a vector of vectors into a vector of scalars. For instance, [[0, 1], [0, 1], [1, 0]] and [‘.’, ‘.’, ‘(‘] are converted to [0, 0, 1].
- Parameters
true (list-like) – Vector of the true structure.
pred (list-like) – Vector of the predicted structure.
symbols – Set of possible elements.
Returns (list): Tuple containing the scalar vectors.
- crop(length: int) list[str | int] [source]
Return a cropped secondary structure to exclude padding.
- Parameters
bracket – Bracket notation of the secondary structure.
length – Number of bases in the primary structure.
Returns: The bracket argument from element 0 to length.
- micro_f1(pred: list[str], symbols: str = '(.)') float [source]
Compute the micro F1-score by considering the secondary structure symbols ‘(’, ‘.’, and ‘)’ as three different classes.
- Parameters
true (list-like) – Vector of the true structure.
pred (list-like) – Vector of the predicted structure.
symbols – Set of possible elements.
- Returns (float): F1-score of the prediction, i.e. a value
between 0 and 1.
- class diurnal.evaluate.ContactMatrix[source]
Bases:
object
Evaluate predictions made with contact matrices.
- crop(length: int) list[int] [source]
Return a cropped contact matrix to exclude padding.
- Parameters
contact – Contact matrix of the secondary structure.
length – Number of bases in the primary structure.
- Returns: The length by length upper left square of the
contact matrix.
- class diurnal.evaluate.Shadow[source]
Bases:
object
Evaluate predictions made with secondary structure shadows, i.e. a sequence of paired / unpaired bases.
- FN(pred: list[int]) float [source]
Compute the false negative value (predicted unpaired bases that are actually unpaired).
- FP(pred: list[int]) float [source]
Compute the false positive value (predicted paired bases that are actually unpaired).
- TN(pred: list[int]) float [source]
Compute the true negative value (predicted unpaired bases that are actually unpaired).
- TP(pred: list[int]) float [source]
Compute the true positive value (predicted paired bases that are actually paired).
- crop(length: int) list[int] [source]
Return a cropped shadow to exclude padding.
- Parameters
shadow – Shadow of the secondary structure.
length – Number of bases in the primary structure.
Returns: The shadow argument from element 0 to length.
- precision(pred) float [source]
Compute the precision obtained by comparing two secondary structures. Precision is defined as:
\[TP / (TP + FP).\]
- diurnal.evaluate.summarize_results(f1_scores: list, name: str) None [source]
Summarize the f1-scores.
- Parameters
f1_scores (list(float)) – List of f1-scores.
name (str) – Name of the results printed along with the summary.
- diurnal.evaluate.to_shadow(bracket: list[str] | str) list[int] [source]
Convert a bracket notation to a secondary structure shadow.
- Parameters
bracket – Secondary structure represented in bracket notation with the characters (, ., and ).
- Returns: Secondary structure shadow in which 0 stands for ( or
) and 1 stands for ..
diurnal.family module
RNA family utility module.
This module simplifies operations related to the encoding of RNA families into other representations.
Author: Vincent Therrien (therrien.vincent.2@courrier.uqam.ca)
Affiliation: Département d’informatique, UQÀM
File creation date: June 2023
License: MIT
- diurnal.family.all_but(families: list[str]) bool [source]
Return all RNA family names except those provided as arguments.
- Parameters
families (List(str) | str) – RNA families to exclude.
Returns (List(str)): The list of selected RNA families.
- diurnal.family.get_name(filename: str) str [source]
Attempt to determine the family of an RNA molecule based on its filename.
- Parameters
filename (str) – Name of the file containing the representation of the RNA molecule.
Returns (str): RNA family if found, empty string otherwise.
- diurnal.family.is_known(family: str) bool [source]
Check if an RNA family is recognized.
- Parameters
family (str) – Family test name.
Returns (bool): True if the family is recognized, False otherwise.
- diurnal.family.select(names: list[str], families: str | list[str]) list[str] [source]
Return a list of molecule names that belong to a provided family.
- Parameters
names (list[str]) – List of names to filter.
families (str | list[str]) – Family or families to preserve.
Returns (list[str]) List of names.
- diurnal.family.split(names: list[str]) dict [source]
Split a list of molecule names into a dictionary of names organized by family.
- Parameters
names – List of molecule names.
Returns: Dictionary formatted as {“family”: [names]}.
- diurnal.family.to_name(vector: list) str [source]
Convert a one-hot-encoded family back into its name.
- Parameters
vector (list) – A one-hot encoded family.
Returns (str): Family name.
- diurnal.family.to_onehot(family: str, map: dict = {'16s': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0], '23s': [0, 0, 1, 0, 0, 0, 0, 0, 0, 0], '5s': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'RNaseP': [0, 0, 0, 0, 0, 1, 0, 0, 0, 0], 'SRP': [0, 0, 0, 0, 0, 0, 1, 0, 0, 0], 'grp1': [0, 0, 0, 1, 0, 0, 0, 0, 0, 0], 'grp2': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0], 'tRNA': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1], 'telomerase': [0, 0, 0, 0, 0, 0, 0, 1, 0, 0], 'tmRNA': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]}) list [source]
Encode a family into a one-hot vector.
- Parameters
family (str) – RNA family.
map (dict) – A dictionary that assigns a family to a vector.
Returns (list(int)): One-hot encoded family.
diurnal.structure module
Transform RNA structures into useful representations.
Author: Vincent Therrien (therrien.vincent.2@courrier.uqam.ca)
Affiliation: Département d’informatique, UQÀM
File creation date: June 2023
License: MIT
- class diurnal.structure.Constants[source]
Bases:
object
Set of physical values that contraint RNA structures.
- LOOP_MIN_DISTANCE
Minimum number of nucleotides between two bases paired to each other. For instance, in the sequence ACCCU, the bases A and U can be paired because they are separated by three bases. However, in the sequence ACU, the bases A and U cannot be paired because they are too close.
- Type
int
- LOOP_MIN_DISTANCE = 3
- class diurnal.structure.Primary[source]
Bases:
object
Transform RNA primary structures into useful formats.
- to_mask(size: int = 0) numpy.ndarray [source]
Make a primary structure pairing mask.
Return the a copy of the input matrix in which impossible pairings are set to 0 and possible pairings are set to 1.
- Parameters
pairings (np.ndarray) – Primary structure potential pairing matrix.
size (int) – Matrix dimension. 0 for no padding.
Returns (np.ndarray): Pairing matrix mask.
- to_matrix(size: int = 0, map: dict = {'-': (0, 0, 0, 0, 0, 0, 0, 0), 'AU': (1, 0, 0, 0, 0, 0, 0, 0), 'CG': (0, 0, 1, 0, 0, 0, 0, 0), 'GC': (0, 0, 0, 1, 0, 0, 0, 0), 'GU': (0, 0, 0, 0, 1, 0, 0, 0), 'UA': (0, 1, 0, 0, 0, 0, 0, 0), 'UG': (0, 0, 0, 0, 0, 1, 0, 0), 'invalid': (0, 0, 0, 0, 0, 0, 1, 0)}) numpy.ndarray [source]
Encode a primary structure in a matrix of potential pairings.
Create an n by n matrix, where n is the number of bases, in which element each represent a potential RNA base pairing. For instance, the pairing AA is not possible and will be assigned the invalid value of the map parameter. AU is a valid pairing and the corresponding element will be assigned to its value in the map.
- Parameters
bases (list(str)) – Primary structure (sequence of bases).
size (int) – Matrix dimension. 0 for no padding.
map (dict) – Assign a pairing to a matrix element. The elements of the map must be (1) convertible to a Numpy array and (2) of the same dimension.
Returns (np.ndarray): Encoded matrix.
- to_onehot(size: int = 0, map: dict = {'-': (0, 0, 0, 0), '.': (0, 0, 0, 0), 'A': (1, 0, 0, 0), 'B': (0, 1, 1, 1), 'C': (0, 1, 0, 0), 'D': (1, 0, 1, 1), 'G': (0, 0, 1, 0), 'H': (1, 1, 0, 1), 'K': (0, 0, 1, 1), 'M': (1, 1, 0, 0), 'N': (1, 1, 1, 1), 'R': (1, 0, 1, 0), 'S': (0, 1, 1, 0), 'T': (0, 0, 0, 1), 'U': (0, 0, 0, 1), 'V': (1, 1, 1, 0), 'W': (1, 0, 0, 1), 'Y': (0, 1, 0, 1)}) numpy.ndarray [source]
Transform a sequence of bases into a one-hot encoded vector.
- Parameters
bases (List[str] | str) – A sequence of bases. E.g.:
['A', 'U']
orAU
.size (int) – Size of a normalized vector. 0 for no padding.
map (dict) – Assign an input to a vector.
- Returns (np.ndarray): One-hot encoded primary structure.
E.g.:
[[1, 0, 0, 0], [0, 1, 0, 0]]
- to_sequence(strip: bool = True, map: dict = {'-': (0, 0, 0, 0), '.': (0, 0, 0, 0), 'A': (1, 0, 0, 0), 'B': (0, 1, 1, 1), 'C': (0, 1, 0, 0), 'D': (1, 0, 1, 1), 'G': (0, 0, 1, 0), 'H': (1, 1, 0, 1), 'K': (0, 0, 1, 1), 'M': (1, 1, 0, 0), 'N': (1, 1, 1, 1), 'R': (1, 0, 1, 0), 'S': (0, 1, 1, 0), 'T': (0, 0, 0, 1), 'U': (0, 0, 0, 1), 'V': (1, 1, 1, 0), 'W': (1, 0, 0, 1), 'Y': (0, 1, 0, 1)}) list [source]
Transform a one-hot encoded vector into a sequence of bases.
- Parameters
vector (list-like) – One-hot encoded primary structure.
strip (bool) – Remove empty elements at the vector’s right end.
map – A dictionary or function that maps bases to vectors.
Returns (list): A sequence of bases. E.g.:
['A', 'U']
.
- unpad_matrix(map: dict = {'-': (0, 0, 0, 0, 0, 0, 0, 0), 'AU': (1, 0, 0, 0, 0, 0, 0, 0), 'CG': (0, 0, 1, 0, 0, 0, 0, 0), 'GC': (0, 0, 0, 1, 0, 0, 0, 0), 'GU': (0, 0, 0, 0, 1, 0, 0, 0), 'UA': (0, 1, 0, 0, 0, 0, 0, 0), 'UG': (0, 0, 0, 0, 0, 1, 0, 0), 'invalid': (0, 0, 0, 0, 0, 0, 1, 0)}) numpy.ndarray [source]
Strip a matrix of its padding elements.
- Parameters
matrix – Input matrix (Numpy array of Python lists).
map (dict) – Assign a pairing to a matrix element.
Returns (list): Unpadded matrix.
- class diurnal.structure.Schemes[source]
Bases:
object
RNA structure schemes
The attributes of this class are used to transform raw RNA sequence data into other representations that can be used for prediction problems.
- IUPAC_TO_ONEHOT
One-hot encoding dictionary for IUPAC symbols. See: https://www.bioinformatics.org/sms/iupac.html
- Type
dict
- IUPAC_ONEHOT_PAIRINGS_VECTOR
One-hot encoded nucleotide pairings, including normal ones (AU, UA, CG, and GC) and wobble pairs (GU and UG). Taken from CNNFold by Booy et al.
- Type
dict
- BRACKET_TO_ONEHOT
One-hot encoding dictionary for a secondary structure that relies on the bracket notation. . is an unpaired base. ( is a base paired to a downstream base. ) is a base paired to an upstream base. - is a padding (i.e. empty) base.
- Type
dict
- SHADOW_ENCODING
One-hot encoding dictionary to encode the shadow of the secondary structure (i.e. the symbols ( and ) of the bracket notation are considered identical).
- Type
dict
- BRACKET_TO_ONEHOT = {'(': (1, 0, 0), ')': (0, 0, 1), '-': (0, 0, 0), '.': (0, 1, 0)}
- IUPAC_ONEHOT_PAIRINGS_VECTOR = {'-': (0, 0, 0, 0, 0, 0, 0, 0), 'AU': (1, 0, 0, 0, 0, 0, 0, 0), 'CG': (0, 0, 1, 0, 0, 0, 0, 0), 'GC': (0, 0, 0, 1, 0, 0, 0, 0), 'GU': (0, 0, 0, 0, 1, 0, 0, 0), 'UA': (0, 1, 0, 0, 0, 0, 0, 0), 'UG': (0, 0, 0, 0, 0, 1, 0, 0), 'invalid': (0, 0, 0, 0, 0, 0, 1, 0)}
- IUPAC_PAIRINGS_SCALARS = {'-': 0, 'AU': 2, 'CG': 3, 'GC': 3, 'GU': 1, 'UA': 2, 'UG': 1, 'invalid': 0}
- IUPAC_PAIRINGS_SCALARS_NEGATIVE_PADDING = {'-': -1, 'AU': 2, 'CG': 3, 'GC': 3, 'GU': 1, 'UA': 2, 'UG': 1, 'invalid': 0}
- IUPAC_TO_ONEHOT = {'-': (0, 0, 0, 0), '.': (0, 0, 0, 0), 'A': (1, 0, 0, 0), 'B': (0, 1, 1, 1), 'C': (0, 1, 0, 0), 'D': (1, 0, 1, 1), 'G': (0, 0, 1, 0), 'H': (1, 1, 0, 1), 'K': (0, 0, 1, 1), 'M': (1, 1, 0, 0), 'N': (1, 1, 1, 1), 'R': (1, 0, 1, 0), 'S': (0, 1, 1, 0), 'T': (0, 0, 0, 1), 'U': (0, 0, 0, 1), 'V': (1, 1, 1, 0), 'W': (1, 0, 0, 1), 'Y': (0, 1, 0, 1)}
- SHADOW_ENCODING = {'(': 1, ')': 1, '-': 0, '.': 0}
- class diurnal.structure.Secondary[source]
Bases:
object
Transform RNA secondary structures into useful formats.
- normalize_distance_matrix() numpy.ndarray [source]
Normalize the distance matrix.
This function returns a new distance matrix whose elements are normalized within the range 0.0 (farthest from a paired base) to 1.0 (paired base).
- Parameters
distance_matrix (np.ndarray) – Result of the function to_distance_matrix.
Returns (np.ndarray): Normalized distance matrix.
- quantize(mask: numpy.ndarray, threshold: float = None) numpy.ndarray [source]
Eliminate invalid pairings in a secondary structure matrix.
Let the following represent a secondary structure matrix:
- ```
- [[_, _, _, _, c, b],
[_, _, _, _, _, a], [_, _, _, _, _, _], [_, _, _, _, _, _], [x, _, _, _, _, _], [y, z, _, _, _, _]]
It follows that (x, a), (y, b), and (z, c) must all be pairs of identical elements because they represent either paired or unpaired bases. Differing elements would indicate that a base is both paired and unpaired, which is impossible. This function assigns the value 0 to all impossible pairings and 1 to all other values.
Steps: - Symmetrize the matrix by multiplying it by its transpose. - Determine a threshold value from the average of non-paired
elements.
Assign 0 to all the elements below the threshold.
Quantize the matrix along both axes and multiply the result with each other.
- Parameters
matrix (np.ndarray) – Contact matrix.
mask (np.ndarray) – Valid pairing mask.
threshold (float) – Value below which elements are discarded. Determined at runtime if not provided.
Returns (np.ndarray): Folded pairing matrix.
- quantize_distance_matrix() numpy.ndarray [source]
Create a contact matrix from a distance matrix.
- Parameters
distance_matrix (np.ndarray) – Result of the function to_distance_matrix.
Returns (np.ndarray): Contact matrix.
- quantize_vector() numpy.ndarray [source]
Quantize a secondary structure vector.
Convert a vector of predicted brackets into a one-hot vector. For instance, [[0.9, 0.5, 0.1], [0.0, 0.5, 0.1]] is converted to [[1, 0, 0], [0, 1, 0]].
- Parameters
prediction (list-like) – Secondary structure prediction.
Returns: Reformatted secondary structure.
- to_bracket() list [source]
Convert a list of nucleotide pairings into a secondary structure bracket notation, e.g. ‘(((…)))’.
- Parameters
pairings (list(int)) – A list of nucleotide pairings, e.g. the pairing (((…))) is represented as [8, 7, 6, -1, -1, -1, 2, 1, 0].
Returns (list): Secondary structure bracket notation.
- to_distance_matrix(size: int = 0, normalize: bool = True, power: float = 1) numpy.ndarray [source]
Encode a secondary structure into a score contact matrix.
Transform the sequence of pairings into an n by n matrix, where n is the number of pairings, whose elements can be 1 for a paired base and x for unpaired bases, where x is given by: x = 1 - (d / n), in which d is the Manhattan distance with the closest paired base.
- Parameters
pairings (list(int) – List of base pairings.
size (int) – Dimension of the matrix. 0 for no padding.
normalize (bool) – If True, scale distances so that paired elements are 1 and the farthest elements are 0.
power (float) – Power to apply to normalized distances.
Returns (np.ndarray): Encoded matrix of the secondary structure.
- to_elements() str [source]
Convert pairings into secondary structure elements.
The possible elements or loops are:
element | character |+=================+===========+ | bulge | b | | external loop | e | | hairpin loop | h | | internal loop | i | | multiloop | m | | stem / stacking | s |
- Parameters
pairings – List of pairings as indices or bracket notations.
Returns (str): List of elements.
- to_matrix(size: int = 0) numpy.ndarray [source]
Encode a secondary structure into a contact matrix.
Transform the sequence of pairings into an n by n matrix, where n is the number of pairings, whose elements can be 0 for an unpaired base and 1 for a paired base.
- Parameters
pairings (list(int) – List of base pairings.
size (int) – Dimension of the matrix. 0 for no padding.
Returns (np.ndarray): Encoded matrix of the secondary structure.
- to_onehot(size: int = 0, map: dict = {'(': (1, 0, 0), ')': (0, 0, 1), '-': (0, 0, 0), '.': (0, 1, 0)}) numpy.ndarray [source]
Encode pairings in a one-hot encoded dot-bracket secondary structure.
- Parameters
pairings (List[int|str]) – A list of nucleotide pairings. The pairing (((…))) can be represented as [8, 7, 6, -1, -1, -1, 2, 1, 0] or [‘(’, ‘(’, ‘(’, ‘.’, ‘.’, ‘.’, ‘)’, ‘)’, ‘)’].
size (int) – Size of the output. 0 for no padding.
map (dict) – Assign an input to a vector.
Returns (np.ndarray): One-hot encoded secondary structure.
diurnal.train module
RNA secondary structure training utility module.
Author: Vincent Therrien (therrien.vincent.2@courrier.uqam.ca)
Affiliation: Département d’informatique, UQÀM
File creation date: April 2023
License: MIT
- diurnal.train.categorize_vector(prediction: list) list [source]
Convert a vector of predicted pairings into a one-hot vector. For instance, [[0.9, 0.5, 0.1], [0.0, 0.5, 0.1]] is converted to [[1, 0, 0], [0, 1, 0]].
- Parameters
prediction (list-like) – Secondary structure prediction.
Returns: Reformatted secondary structure.
- diurnal.train.clean_matrices(primary: list, true: list, pred: list) tuple [source]
Prepare a secondary structure prediction for evaluation.
- Parameters
primary (list) – Vector-encoded primary structure.
true (list) – True vector-encoded secondary structure.
pred (list) – Predicted vector-encoded secondary structure.
- Returns (tuple): A tuple of elements organized as:
sequence of bases
stripped true secondary structure matrix
stripped predicted secondary structure matrix
- diurnal.train.clean_vectors(primary: list, true: list, pred: list) tuple [source]
Prepare a secondary structure prediction for evaluation.
- Parameters
primary (list) – Vector-encoded primary structure.
true (list) – True vector-encoded secondary structure.
pred (list) – Predicted vector-encoded secondary structure.
- Returns (tuple): A tuple of elements organized as:
sequence of bases
stripped true secondary structure
stripped predicted secondary structure
- diurnal.train.k_fold_indices(fractions: list, k: int, n: int) list [source]
Return tuples of indices for K-fold splits.
- Parameters
fractions – Proportion of each subset. For instance, to use 80% of the data for training and 20% for testing, use [0.8, 0.2].
k – Number of folds.
n – Number of indices.
Returns (list): k tuples containing len(fractions) of index lists.
- diurnal.train.k_fold_split(data, fractions: list, k: int, i: int) list [source]
Split the data to make a K-fold split.
- Parameters
data – Array-like object containing the data to split.
fractions – Proportion of each subset. For instance, to use 80% of the data for training and 20% for testing, use [0.8, 0.2].
k – Number of folds.
i – Zero-based index of the fold.
- Returns
A tuple containing the split data object.
- diurnal.train.load_data(path: str, randomize: bool = True) tuple [source]
Read formatted data into tensors.
- Parameters
path (str) – Name of the directory that contains the Numpy files written by the function diurnal.database.format.
randomize (bool) – Randomize data if set to True.
- Returns
- Loaded data represented as
[primary structure, secondary structure, family].
- Return type
list
- diurnal.train.load_families(path: str, families: list, randomize=True, verbose: bool = True) list [source]
Read formatted molecules of the specified RNA family.
- Parameters
path (str) – Name of the directory that contains the Numpy files written by the function diurnal.database.format.
families (List(str) | str) – Families to read.
randomize (bool) – Randomize data if set to True.
verbose (bool) – Print informative messages.
- Returns (dict): Loaded data represented as
- `{
“input”: tuple[list], “secondary”: list, “names”: list(str), “family”: list
}`
- diurnal.train.quantize_matrix(matrix: list[list[float]], dim: int = 0) None [source]
Quantize a matrix.
All the rows of the matrix are formatted as follows: - The maximum element is set to 1. - The other elements are set to 0.
- Parameters
matrix – Input matrix
dim – Dimension along which to quantize the matrix.
- diurnal.train.shuffle_data(*args) tuple [source]
Shuffle vectors to preserve one-to-one original pairings.
For instance, consider - a = [ 0, 1, 2 ] - b = [‘a’, ‘b’, ‘c’] Shuffling lists a and b may result in: - a = [ 2, 1, 0 ] - b = [‘c’, ‘b’, ‘a’]
- Parameters
args – List-like elements to be shuffled. They need to be of the same dimensions.
- Returns (tuple): Shuffled data. The vector are returned in the same
order as they were provided.
- diurnal.train.split(data, fractions: tuple[float], offset: int = 0) list [source]
Split an array of data.
- Parameters
data (any) – Array-like data to split.
fractions (tuple[float]) – Fraction of data in each resulting set. Elements must sum to 1.
offset (int) – Index offset.
Returns (list[any]): List of split sets.
Example:
>>> data = [0, 1, 2, 3, 4, 5, 6, 8, 9] >>> split(data, (0.2, 0.8), 1) [[1, 2], [3, 4, 5, 6, 7, 8, 9, 0]]
- diurnal.train.split_data(data, fractions: list, offset: int = 0) list [source]
Split data in subsets according to the specified fractions.
- Parameters
data – Array-like object containing the data to split.
fractions – Proportion of each subset. For instance, to use 80% of the data for training and 20% for testing, use [0.8, 0.2].
offset – Number of indices to offset to assemble the subsets. Used for K-fold data splits.
- Returns
A list containing the split data object.
- diurnal.train.split_indices(fractions: list, n: int) list [source]
Split a range of indices in subsets according to the specified fractions.
- Parameters
fractions – Proportion of each subset. For instance, to use 80% of the data for training and 20% for testing, use [0.8, 0.2].
n – Number of indices.
Returns (list): A list containing the split data object.
diurnal.visualize module
Data visualization module.
Author: Vincent Therrien (therrien.vincent.2@courrier.uqam.ca)
Affiliation: Département d’informatique, UQÀM
File creation date: April 2023
License: MIT
- diurnal.visualize.compare_pairings(true: numpy.ndarray, prediction: numpy.ndarray, title: str = 'Comparison of Secondary Structures') None [source]
Compare secondary structures.
- Parameters
matrices – Contact matrices. Must contain two (2) matrices.
labels – Name of each contact matrix.
title – Graph title.
- diurnal.visualize.heatmap(matrices: numpy.ndarray, title: str = '', label: bool = False) None [source]
Visualize heatmaps.
The function opens a plot that visualizes the matrices argument. If the matrices is a 3D array, the heatmap is the sum of all arrays along the 0 axis. If matrices is a 2D array, it is used as the heatmap.
- Parameters
matrices – Set 2D matrices or one 2D matrix.
title (str) – Graph title.
label (bool) – If True, label each axis.
- diurnal.visualize.potential_pairings(primary: str | list[str] | numpy.ndarray, secondary: list | tuple[list] = None, title: str = 'RNA Molecule Potential Pairings', map: dict = {'-': (0, 0, 0, 0, 0, 0, 0, 0), 'AU': (1, 0, 0, 0, 0, 0, 0, 0), 'CG': (0, 0, 1, 0, 0, 0, 0, 0), 'GC': (0, 0, 0, 1, 0, 0, 0, 0), 'GU': (0, 0, 0, 0, 1, 0, 0, 0), 'UA': (0, 1, 0, 0, 0, 0, 0, 0), 'UG': (0, 0, 0, 0, 0, 1, 0, 0), 'invalid': (0, 0, 0, 0, 0, 0, 1, 0)}) None [source]
Display a heatmap of potential pairings.
- Parameters
primary (str) – List of bases or potential pairing matrix.
secondary (list) – Secondary structure or tuple of secondary structures represented as contact matrices or lists of pairings.
title (str) – Name of the graph.
map – Potential pairing to string map.
- diurnal.visualize.prediction(primary, true, pred) None [source]
Compare true and predicted secondary structures.
- diurnal.visualize.primary_structure(primary) None [source]
Print the sequence of nucleotides from a one-hot encoded primary structure.
- Parameters
primary – Primary structure.
- diurnal.visualize.print_contact_matrix(matrix: numpy.ndarray)[source]
Print a contact matrix in the terminal.
Module contents
Diurnal is a Python library designed to predict RNA secondary structures.
Author: Vincent Therrien (therrien.vincent.2@courrier.uqam.ca)
Affiliation: Département d’informatique, UQÀM
File creation date: April 2023
License: MIT