pyproteonet.masking.masked_dataset.MaskedDataset

class pyproteonet.masking.masked_dataset.MaskedDataset(dataset: Dataset, masks: Dict[str, DataFrame] = {}, hidden: Dict[str, DataFrame] | None = None)

A dataset with some molecules masked. Used for self supervised training of predictive imputation models.

dataset

The original dataset.

Type:

Dataset

masks

A dictionary mapping molecule names to boolean DataFrames reprenting the masks. Defaults to an empty dictionary.

Type:

Dict[str, pd.DataFrame]

hidden

A dictionary mapping molecule names to boolean DataFrames reprenting hidden molecules (optional). Defaults to None.

Type:

Optional[Dict[str, pd.DataFrame]]

__init__(dataset: Dataset, masks: Dict[str, DataFrame] = {}, hidden: Dict[str, DataFrame] | None = None) None

Methods

__init__(dataset[, masks, hidden])

from_ids(dataset, mask_ids[, hidden_ids])

Create a MaskedDataset object from a given dataset and mask IDs.

get_hidden_ids(molecule)

For a specific molecule type sets the masked molecules (given by their ids).

get_mask_ids(molecule)

For a given molecule type get the ids of the masked molecules.

get_sample(key)

Retrieves a dataset sample by its key.

keys()

Names of samples that have eithe masked or hidden molecules.

set_hidden(molecule, hidden)

Sets the hidden-mask (which molecules to hide) for a specific molecule type.

set_hidden_ids(molecule, ids)

For a specific molecule type sets the hidden molecules (given by their ids).

set_mask(molecule, mask)

Set the mask for a specific molecule type.

set_mask_ids(molecule, ids)

For a specific molecule type sets the masked molecules (given by their ids).

set_samples_value_matrix(matrix, molecule, ...)

Sets the values of a value column of the underlying to those by by a matrix (numpy array, torch tensor, or pandas DataFrame).

to_dgl_graph(feature_columns, mappings[, ...])

Converts the masked dataset to a DGL heterograph.

Attributes

has_hidden

Check if the dataset has any hidden molecules.

classmethod from_ids(dataset: Dataset, mask_ids: Dict[str, Index], hidden_ids: Dict[str, Index] | None = None) MaskedDataset

Create a MaskedDataset object from a given dataset and mask IDs.

Parameters:
  • dataset (Dataset) – The original dataset.

  • mask_ids (Dict[str, pd.Index]) – A dictionary giving for each molecule type a list of molecule ids to mask.

  • hidden_ids (Optional[Dict[str, pd.Index]]) – A dictionary giving for each molecule type a list of molecule ids to mask. (default: None).

Returns:

A new MaskedDataset object with the specified masked and hidden molecules.

Return type:

MaskedDataset

get_hidden_ids(molecule: str) Index

For a specific molecule type sets the masked molecules (given by their ids).

Parameters:
  • molecule (str) – The name of the molecule type.

  • ids (pd.Index) – The IDs to be masked.

Returns:

None

get_mask_ids(molecule: str) Index

For a given molecule type get the ids of the masked molecules.

Parameters:

molecule (str) – The molecule type for which to retrieve the masked IDs.

Returns:

The masked IDs for the given molecule type.

Return type:

pd.Index

get_sample(key: str) DatasetSample

Retrieves a dataset sample by its key.

Parameters:

key (str) – The key of the sample to retrieve.

Returns:

The dataset sample corresponding to the given key.

Return type:

DatasetSample

property has_hidden: bool

Check if the dataset has any hidden molecules.

Returns:

True if the dataset has hidden molecules, False otherwise.

Return type:

bool

keys() Iterable[str]

Names of samples that have eithe masked or hidden molecules.

Returns:

An iterable of keys in the dataset.

Return type:

Iterable[str]

set_hidden(molecule: str, hidden: DataFrame) None

Sets the hidden-mask (which molecules to hide) for a specific molecule type.

Parameters:
  • molecule (str) – The name of the molecule type.

  • mask (pd.DataFrame) – The mask dataframe.

Returns:

None

set_hidden_ids(molecule: str, ids: Index) None

For a specific molecule type sets the hidden molecules (given by their ids).

Parameters:
  • molecule (str) – The name of the molecule type.

  • ids (pd.Index) – The IDs to be hidden.

Returns:

None

set_mask(molecule: str, mask: DataFrame) None

Set the mask for a specific molecule type.

Parameters:
  • molecule (str) – The name of the molecule type.

  • mask (pd.DataFrame) – The mask dataframe.

Returns:

None

set_mask_ids(molecule: str, ids: Index) None

For a specific molecule type sets the masked molecules (given by their ids).

Parameters:
  • molecule (str) – The name of the molecule type.

  • ids (pd.Index) – The IDs to be masked.

Returns:

None

set_samples_value_matrix(matrix: array | tensor | DataFrame, molecule: str, column: str, samples: List[str] | None = None, only_set_masked: bool = True) None

Sets the values of a value column of the underlying to those by by a matrix (numpy array, torch tensor, or pandas DataFrame). If specified only values for the masked molecules are set. Useful to write back the results of a model imputation.

Parameters:
  • matrix (Union[np.array, torch.tensor, pd.DataFrame]) – The value matrix to be set.

  • molecule (str) – The name of the molecule type (e.g. protein, peptide…).

  • column (str) – The name of the value column.

  • samples (Optional[List[str]], optional) – The list of names of samples to consider. Defaults to None.

  • only_set_masked (bool, optional) – If True, only sets the values for masked molecules. If False, sets the values for all molecules. Defaults to True.

Raises:

ValueError – If the provided sample names do not match the column names in the matrix.

Returns:

None

to_dgl_graph(feature_columns: Dict[str, str | List[str]], mappings: str | List[str], molecule_columns: Dict[str, str | List[str]] = {}, mapping_directions: Dict[str, Tuple[str, str]] = {}, make_bidirectional: bool = False, features_to_float32: bool = True, samples: List[str] | None = None) DGLGraph

Converts the masked dataset to a DGL heterograph.

Parameters:
  • feature_columns (Dict[str, Union[str, List[str]]]) – Dictionary specifying the feature columns for each molecule type.

  • mappings (Union[str, List[str]]) – List of mapping names or a single mapping name to be used for constructing the graph.

  • molecule_columns (Dict[str, Union[str, List[str]]], optional) – Dictionary specifying the molecule columns for each molecule type. Defaults to {}.

  • mapping_directions (Dict[str, Tuple[str, str]], optional) – Dictionary specifying the mapping directions (Tuple of molecule types) for each mapping. Defaults to {}.

  • make_bidirectional (bool, optional) – Whether to make the graph bidirectional by adding reverse edges. Defaults to False.

  • features_to_float32 (bool, optional) – Whether to convert the features to float32. Defaults to True.

  • samples (Optional[List[str]], optional) – List of sample names to include in the graph. Defaults to None.

Returns:

The converted DGL heterograph.

Return type:

dgl.DGLHeteroGraph