pyproteonet.data.dataset_sample.DatasetSample

class pyproteonet.data.dataset_sample.DatasetSample(dataset: Dataset, values: Dict[str, DataFrame], name: str)

Representing a dataset samples holding a set of values for every molecule. Can be thought of as a dictionary of pandas dataframes with one dataframe for each molecule.

__init__(dataset: Dataset, values: Dict[str, DataFrame], name: str)

Create a dataset samples holding a set of values for every molecule

Parameters:
  • dataset (Dataset) – The dataset this sample belongs to.

  • values (Dict[str, pd.DataFrame]) – Values for every molecule in the dataset.

  • name (str) – Name of the sample.

Methods

__init__(dataset, values, name)

Create a dataset samples holding a set of values for every molecule

apply(fn, *args, **kwargs)

Applies a function to the dataset sample.

copy([columns, molecule_ids])

Creates a copy of the dataset sample.

get_index_for(molecule_type)

returns the index of molecule ids for the given molecule type

get_node_values_for_graph(graph[, ...])

Returns the values for the given graph.

missing_mask(molecule[, column])

Returns a boolean mask indicating which values are missing for the given molecule and column.

missing_molecules(molecule[, column])

Returns all molecules of the given molecule type that are missing for the given column.

non_missing_mask(molecule[, column])

Returns a boolean mask indicating which values are non-missing for the given molecule and column.

non_missing_molecules(molecule[, column])

Returns all molecules of the given molecule type that are not missing for the given column.

plot_hist([bins])

Plots a histogram of the values for every molecule type.

Attributes

gene_mapping

missing_label_value

missing_value

molecule_set

molecules

apply(fn: Callable, *args, **kwargs) object

Applies a function to the dataset sample. Only exists to match the interface of the Dataset class.

Parameters:

fn (Callable) – the function to apply

Returns:

the result of the function

Return type:

object

copy(columns: Iterable[str] | Dict[str, str | Iterable[str]] | None = None, molecule_ids: Dict[str, Index] = {}) DatasetSample

Creates a copy of the dataset sample.

Parameters:
  • columns (Optional[ Union[Iterable[str], Dict[str, Union[str, Iterable[str]]]] ], optional) – Columns to copy. When given as list of strings the same columns are copied for every molecule, when given as dictionary the key specific columns can be specific per molecule type. Defaults to None.

  • molecule_ids (Dict[str, pd.Index], optional) – Dictionay specifying for every molecule type the molecule ids that will be copied. If a molecule type is not part of the dictionary all molecule ids will be copied for this molecule type. Defaults to {}.

Returns:

A copy of the dataset sample.

Return type:

DatasetSamples

get_index_for(molecule_type: str) Index

returns the index of molecule ids for the given molecule type

Parameters:

molecule_type (str) – The molecule type to get the index for

Returns:

The index of molecule ids for the given molecule type

Return type:

pd.Index

get_node_values_for_graph(graph: MoleculeGraph, include_id_and_type: bool = True) DataFrame

Returns the values for the given graph.

Parameters:
  • graph (MoleculeGraph) – the graph to get the values for

  • include_id_and_type (bool, optional) – Whether to include the molecule ids and molecule type into the result. Defaults to True.

Returns:

the values for the given graph

Return type:

pd.DataFrame

missing_mask(molecule: str, column: str = 'abundance') ndarray

Returns a boolean mask indicating which values are missing for the given molecule and column.

Parameters:
  • molecule (str) – the molecule type (e.g. ‘protein’ or ‘peptide’)

  • column (str, optional) – the value column. Defaults to “abundance”.

Returns:

the boolean mask indicating which values are missing for the given molecule and column.

Return type:

np.ndarray

missing_molecules(molecule: str, column: str = 'abundance') DataFrame

Returns all molecules of the given molecule type that are missing for the given column.

Parameters:
  • molecule (str) – the molecule type (e.g. ‘protein’ or ‘peptide’)

  • column (str, optional) – the value column. Defaults to “abundance”.

Returns:

the dataframe containing the missing molecules and their additional information for the given molecule and column.

Return type:

pd.DataFrame

non_missing_mask(molecule: str, column: str = 'abundance')

Returns a boolean mask indicating which values are non-missing for the given molecule and column.

Parameters:
  • molecule (str) – the molecule type (e.g. ‘protein’ or ‘peptide’)

  • column (str, optional) – the value column. Defaults to “abundance”.

Returns:

the boolean mask indicating which values are non-missing for the given molecule and column.

Return type:

np.ndarray

non_missing_molecules(molecule: str, column: str = 'abundance')

Returns all molecules of the given molecule type that are not missing for the given column.

Parameters:
  • molecule (str) – the molecule type (e.g. ‘protein’ or ‘peptide’)

  • column (str, optional) – the value column. Defaults to “abundance”.

Returns:

the dataframe containing the non-missing molecules and their additional information for the given molecule and column.

Return type:

pd.DataFrame

plot_hist(bins: List[float] | str = 'auto')

Plots a histogram of the values for every molecule type.

Parameters:

bins (str, optional) – The bins for the histogram (passed to seaborn.histplot). Defaults to “auto”.