pyproteonet.masking.masked_dataset.MaskedDataset
- class pyproteonet.masking.masked_dataset.MaskedDataset(dataset: Dataset, masks: Dict[str, DataFrame] = {}, hidden: Dict[str, DataFrame] | None = None)
A dataset with some molecules masked. Used for self supervised training of predictive imputation models.
- masks
A dictionary mapping molecule names to boolean DataFrames reprenting the masks. Defaults to an empty dictionary.
- Type:
Dict[str, pd.DataFrame]
A dictionary mapping molecule names to boolean DataFrames reprenting hidden molecules (optional). Defaults to None.
- Type:
Optional[Dict[str, pd.DataFrame]]
- __init__(dataset: Dataset, masks: Dict[str, DataFrame] = {}, hidden: Dict[str, DataFrame] | None = None) None
Methods
__init__(dataset[, masks, hidden])from_ids(dataset, mask_ids[, hidden_ids])Create a MaskedDataset object from a given dataset and mask IDs.
get_hidden_ids(molecule)For a specific molecule type sets the masked molecules (given by their ids).
get_mask_ids(molecule)For a given molecule type get the ids of the masked molecules.
get_sample(key)Retrieves a dataset sample by its key.
keys()Names of samples that have eithe masked or hidden molecules.
set_hidden(molecule, hidden)Sets the hidden-mask (which molecules to hide) for a specific molecule type.
set_hidden_ids(molecule, ids)For a specific molecule type sets the hidden molecules (given by their ids).
set_mask(molecule, mask)Set the mask for a specific molecule type.
set_mask_ids(molecule, ids)For a specific molecule type sets the masked molecules (given by their ids).
set_samples_value_matrix(matrix, molecule, ...)Sets the values of a value column of the underlying to those by by a matrix (numpy array, torch tensor, or pandas DataFrame).
to_dgl_graph(feature_columns, mappings[, ...])Converts the masked dataset to a DGL heterograph.
Attributes
Check if the dataset has any hidden molecules.
- classmethod from_ids(dataset: Dataset, mask_ids: Dict[str, Index], hidden_ids: Dict[str, Index] | None = None) MaskedDataset
Create a MaskedDataset object from a given dataset and mask IDs.
- Parameters:
dataset (Dataset) – The original dataset.
mask_ids (Dict[str, pd.Index]) – A dictionary giving for each molecule type a list of molecule ids to mask.
hidden_ids (Optional[Dict[str, pd.Index]]) – A dictionary giving for each molecule type a list of molecule ids to mask. (default: None).
- Returns:
A new MaskedDataset object with the specified masked and hidden molecules.
- Return type:
For a specific molecule type sets the masked molecules (given by their ids).
- Parameters:
molecule (str) – The name of the molecule type.
ids (pd.Index) – The IDs to be masked.
- Returns:
None
- get_mask_ids(molecule: str) Index
For a given molecule type get the ids of the masked molecules.
- Parameters:
molecule (str) – The molecule type for which to retrieve the masked IDs.
- Returns:
The masked IDs for the given molecule type.
- Return type:
pd.Index
- get_sample(key: str) DatasetSample
Retrieves a dataset sample by its key.
- Parameters:
key (str) – The key of the sample to retrieve.
- Returns:
The dataset sample corresponding to the given key.
- Return type:
Check if the dataset has any hidden molecules.
- Returns:
True if the dataset has hidden molecules, False otherwise.
- Return type:
bool
- keys() Iterable[str]
Names of samples that have eithe masked or hidden molecules.
- Returns:
An iterable of keys in the dataset.
- Return type:
Iterable[str]
Sets the hidden-mask (which molecules to hide) for a specific molecule type.
- Parameters:
molecule (str) – The name of the molecule type.
mask (pd.DataFrame) – The mask dataframe.
- Returns:
None
For a specific molecule type sets the hidden molecules (given by their ids).
- Parameters:
molecule (str) – The name of the molecule type.
ids (pd.Index) – The IDs to be hidden.
- Returns:
None
- set_mask(molecule: str, mask: DataFrame) None
Set the mask for a specific molecule type.
- Parameters:
molecule (str) – The name of the molecule type.
mask (pd.DataFrame) – The mask dataframe.
- Returns:
None
- set_mask_ids(molecule: str, ids: Index) None
For a specific molecule type sets the masked molecules (given by their ids).
- Parameters:
molecule (str) – The name of the molecule type.
ids (pd.Index) – The IDs to be masked.
- Returns:
None
- set_samples_value_matrix(matrix: array | tensor | DataFrame, molecule: str, column: str, samples: List[str] | None = None, only_set_masked: bool = True) None
Sets the values of a value column of the underlying to those by by a matrix (numpy array, torch tensor, or pandas DataFrame). If specified only values for the masked molecules are set. Useful to write back the results of a model imputation.
- Parameters:
matrix (Union[np.array, torch.tensor, pd.DataFrame]) – The value matrix to be set.
molecule (str) – The name of the molecule type (e.g. protein, peptide…).
column (str) – The name of the value column.
samples (Optional[List[str]], optional) – The list of names of samples to consider. Defaults to None.
only_set_masked (bool, optional) – If True, only sets the values for masked molecules. If False, sets the values for all molecules. Defaults to True.
- Raises:
ValueError – If the provided sample names do not match the column names in the matrix.
- Returns:
None
- to_dgl_graph(feature_columns: Dict[str, str | List[str]], mappings: str | List[str], molecule_columns: Dict[str, str | List[str]] = {}, mapping_directions: Dict[str, Tuple[str, str]] = {}, make_bidirectional: bool = False, features_to_float32: bool = True, samples: List[str] | None = None) DGLGraph
Converts the masked dataset to a DGL heterograph.
- Parameters:
feature_columns (Dict[str, Union[str, List[str]]]) – Dictionary specifying the feature columns for each molecule type.
mappings (Union[str, List[str]]) – List of mapping names or a single mapping name to be used for constructing the graph.
molecule_columns (Dict[str, Union[str, List[str]]], optional) – Dictionary specifying the molecule columns for each molecule type. Defaults to {}.
mapping_directions (Dict[str, Tuple[str, str]], optional) – Dictionary specifying the mapping directions (Tuple of molecule types) for each mapping. Defaults to {}.
make_bidirectional (bool, optional) – Whether to make the graph bidirectional by adding reverse edges. Defaults to False.
features_to_float32 (bool, optional) – Whether to convert the features to float32. Defaults to True.
samples (Optional[List[str]], optional) – List of sample names to include in the graph. Defaults to None.
- Returns:
The converted DGL heterograph.
- Return type:
dgl.DGLHeteroGraph