pyproteonet.masking.masked_dataset.MaskedDataset

class pyproteonet.masking.masked_dataset.MaskedDataset(dataset: Dataset, masks: Dict[str, DataFrame] = {}, hidden: Dict[str, DataFrame] | None = None)

A dataset with some molecules masked. Used for self supervised training of predictive imputation models.

dataset

The original dataset.

Type:: Dataset

masks

A dictionary mapping molecule names to boolean DataFrames reprenting the masks. Defaults to an empty dictionary.

Type:: Dict[str, pd.DataFrame]

hidden

A dictionary mapping molecule names to boolean DataFrames reprenting hidden molecules (optional). Defaults to None.

Type:: Optional[Dict[str, pd.DataFrame]]

__init__(dataset: Dataset, masks: Dict[str, DataFrame] = {}, hidden: Dict[str, DataFrame] | None = None) → None

Methods

`__init__`(dataset[, masks, hidden])
`from_ids`(dataset, mask_ids[, hidden_ids])	Create a MaskedDataset object from a given dataset and mask IDs.
`get_hidden_ids`(molecule)	For a specific molecule type sets the masked molecules (given by their ids).
`get_mask_ids`(molecule)	For a given molecule type get the ids of the masked molecules.
`get_sample`(key)	Retrieves a dataset sample by its key.
`keys`()	Names of samples that have eithe masked or hidden molecules.
`set_hidden`(molecule, hidden)	Sets the hidden-mask (which molecules to hide) for a specific molecule type.
`set_hidden_ids`(molecule, ids)	For a specific molecule type sets the hidden molecules (given by their ids).
`set_mask`(molecule, mask)	Set the mask for a specific molecule type.
`set_mask_ids`(molecule, ids)	For a specific molecule type sets the masked molecules (given by their ids).
`set_samples_value_matrix`(matrix, molecule, ...)	Sets the values of a value column of the underlying to those by by a matrix (numpy array, torch tensor, or pandas DataFrame).
`to_dgl_graph`(feature_columns, mappings[, ...])	Converts the masked dataset to a DGL heterograph.

Attributes

has_hidden

Check if the dataset has any hidden molecules.

classmethod from_ids(dataset: Dataset, mask_ids: Dict[str, Index], hidden_ids: Dict[str, Index] | None = None) → MaskedDataset

Create a MaskedDataset object from a given dataset and mask IDs.

Parameters:

dataset (Dataset) – The original dataset.
mask_ids (Dict[str, pd.Index]) – A dictionary giving for each molecule type a list of molecule ids to mask.
hidden_ids (Optional[Dict[str, pd.Index]]) – A dictionary giving for each molecule type a list of molecule ids to mask. (default: None).

Returns:

A new MaskedDataset object with the specified masked and hidden molecules.

Return type:

MaskedDataset

get_hidden_ids(molecule: str) → Index

For a specific molecule type sets the masked molecules (given by their ids).

Parameters:

molecule (str) – The name of the molecule type.
ids (pd.Index) – The IDs to be masked.

Returns:

None

get_mask_ids(molecule: str) → Index

For a given molecule type get the ids of the masked molecules.

Parameters:: molecule (str) – The molecule type for which to retrieve the masked IDs.
Returns:: The masked IDs for the given molecule type.
Return type:: pd.Index

get_sample(key: str) → DatasetSample

Retrieves a dataset sample by its key.

Parameters:: key (str) – The key of the sample to retrieve.
Returns:: The dataset sample corresponding to the given key.
Return type:: DatasetSample

property has_hidden: bool

Check if the dataset has any hidden molecules.

Returns:: True if the dataset has hidden molecules, False otherwise.
Return type:: bool

keys() → Iterable[str]

Names of samples that have eithe masked or hidden molecules.

Returns:: An iterable of keys in the dataset.
Return type:: Iterable[str]

set_hidden(molecule: str, hidden: DataFrame) → None

Sets the hidden-mask (which molecules to hide) for a specific molecule type.

Parameters:

molecule (str) – The name of the molecule type.
mask (pd.DataFrame) – The mask dataframe.

Returns:

None

set_hidden_ids(molecule: str, ids: Index) → None

For a specific molecule type sets the hidden molecules (given by their ids).

Parameters:

molecule (str) – The name of the molecule type.
ids (pd.Index) – The IDs to be hidden.

Returns:

None

set_mask(molecule: str, mask: DataFrame) → None

Set the mask for a specific molecule type.

Parameters:

molecule (str) – The name of the molecule type.
mask (pd.DataFrame) – The mask dataframe.

Returns:

None

set_mask_ids(molecule: str, ids: Index) → None

For a specific molecule type sets the masked molecules (given by their ids).

Parameters:

molecule (str) – The name of the molecule type.
ids (pd.Index) – The IDs to be masked.

Returns:

None

set_samples_value_matrix(matrix: array | tensor | DataFrame, molecule: str, column: str, samples: List[str] | None = None, only_set_masked: bool = True) → None

Sets the values of a value column of the underlying to those by by a matrix (numpy array, torch tensor, or pandas DataFrame). If specified only values for the masked molecules are set. Useful to write back the results of a model imputation.

Parameters:

matrix (Union[np.array, torch.tensor, pd.DataFrame]) – The value matrix to be set.
molecule (str) – The name of the molecule type (e.g. protein, peptide…).
column (str) – The name of the value column.
samples (Optional[List[str]], optional) – The list of names of samples to consider. Defaults to None.
only_set_masked (bool, optional) – If True, only sets the values for masked molecules. If False, sets the values for all molecules. Defaults to True.

Raises:

ValueError – If the provided sample names do not match the column names in the matrix.

Returns:

None

to_dgl_graph(feature_columns: Dict[str, str | List[str]], mappings: str | List[str], molecule_columns: Dict[str, str | List[str]] = {}, mapping_directions: Dict[str, Tuple[str, str]] = {}, make_bidirectional: bool = False, features_to_float32: bool = True, samples: List[str] | None = None) → DGLGraph

Converts the masked dataset to a DGL heterograph.

Parameters:

feature_columns (Dict[str, Union[str, List[str]]]) – Dictionary specifying the feature columns for each molecule type.
mappings (Union[str, List[str]]) – List of mapping names or a single mapping name to be used for constructing the graph.
molecule_columns (Dict[str, Union[str, List[str]]], optional) – Dictionary specifying the molecule columns for each molecule type. Defaults to {}.
mapping_directions (Dict[str, Tuple[str, str]], optional) – Dictionary specifying the mapping directions (Tuple of molecule types) for each mapping. Defaults to {}.
make_bidirectional (bool, optional) – Whether to make the graph bidirectional by adding reverse edges. Defaults to False.
features_to_float32 (bool, optional) – Whether to convert the features to float32. Defaults to True.
samples (Optional[List[str]], optional) – List of sample names to include in the graph. Defaults to None.

Returns:

The converted DGL heterograph.

Return type:

dgl.DGLHeteroGraph