pyproteonet.imputation.dnn.gnn.homogeneous.impute_homogeneous_gnn

pyproteonet.imputation.dnn.gnn.homogeneous.impute_homogeneous_gnn(dataset: Dataset, molecule: str, mapping: str, column: str, partner_column: str, training_fraction=0.25, feature_columns: List[str] | None = None, partner_feature_columns: List[str] | None = None, result_column: str | None = None, partner_result_column: str | None = None, max_epochs: int = 10000, validation_frequency: int | None = None, early_stopping_patience: int = 7, missing_substitute_value: float = -3, use_gatv2: bool = True, mask_partner: bool = True, train_on_partner: bool = True, uncertainty_column: str | None = None, logger: object | None = None, log_every_n_steps: int = 30, embedding_dim: int | None = None, train_sample_wise: bool = False, molecule_gt_column: str | None = None, epoch_size: int = 1) Series

Impute missing values using a homogenous graph neural network applied on the molecule graph created from two molecule types like proteins and their assigned peptides.

Parameters:
  • dataset (Dataset) – The dataset to impute.

  • molecule (str) – The main molecule type to impute (e.g. “protein”).

  • column (str) – The value column of the main molecule type to impute (e.g. “abundance”).

  • mapping (str) – The name of the mapping, connecting the main molecule type with a partner molecule type (e.g. “protein-peptide”).

  • partner_column (str) – The value column of the partner molecule type to impute.

  • training_fraction (float, optional) – Mean fraction of molecules masked during training (The masking fraction for every epoch is randomly drawn from the (0.5 * training_fraction, 1.5 * training_fraction) interval). Defaults to 0.25.

  • feature_columns (Optional[List[str]], optional) – Names of additional value columns to use as featues for the main molecule. Defaults to None.

  • partner_feature_columns (Optional[List[str]], optional) – Names of additional value columns to use as featues for the partner molecule (should be the same number as for the main molecule to allow creation of a homogeneous graph). Defaults to None.

  • result_column (Optional[str], optional) – If given, imputed results for the main molecule will be stored unders this name. Defaults to None.

  • partner_result_column (Optional[str], optional) – If given, imputed results for the partner molecule will be stored under this name. Defaults to None.

  • max_epochs (int, optional) – Maximum number of training epochs. Defaults to 10000.

  • validation_frequency (Optional[int], optional) – If given validation is run every validation_frequency epochs. Defaults to None.

  • early_stopping_patience (int, optional) – Number of epochs after which the training is stopped if the training loss does not improve. Defaults to 7.

  • missing_substitute_value (float, optional) – Value to replace missing or masked values with. Defaults to -3.

  • use_gatv2 (bool, optional) – Whether to use the DGL GATv2 graph attention layers or the original DLG GAT layers. Defaults to True.

  • mask_partner (bool, optional) – Whether to randomly mask both the main and partner molecule during training or only the main molecules. Defaults to True.

  • train_on_partner (bool, optional) – Whether to compute training loss on masked main and partner molecules or only on the main molecules. Defaults to True.

  • uncertainty_column (Optional[str], optional) – Whether to predict an uncertainty value. Defaults to None.

  • logger (Optional[object], optional) – If given this logger is used for logger (should have the lightning logger interface). Defaults to None.

  • log_every_n_steps (int, optional) – How often to log. Defaults to 30.

  • embedding_dim (Optional[int], optional) – If given every molecule will have a trainable embedding of this dimension. Defaults to None.

  • train_sample_wise (bool, optional) – Whether a training step operates only on a single sample or the whole dataset. Defaults to False.

  • molecule_gt_column (Optional[str], optional) – If given some metrics comparing predictions with ground truth values will be logged during training (helpful to evaluate training progress with respect to a ground truth). Defaults to None.

  • epoch_size (int, optional) – Number of training runs on the dataset that make up an epoch. Defaults to 1.

Raises:

ValueError – Raised when main and partner molecule feature columns are not of the same length.

Returns:

The imputed values for the main molecule.

Return type:

pd.Series