pyproteonet.data.dataset_sample.DatasetSample

class pyproteonet.data.dataset_sample.DatasetSample(dataset: Dataset, values: Dict[str, DataFrame], name: str)

Representing a dataset samples holding a set of values for every molecule. Can be thought of as a dictionary of pandas dataframes with one dataframe for each molecule.

__init__(dataset: Dataset, values: Dict[str, DataFrame], name: str)

Create a dataset samples holding a set of values for every molecule

Parameters:

dataset (Dataset) – The dataset this sample belongs to.
values (Dict[str, pd.DataFrame]) – Values for every molecule in the dataset.
name (str) – Name of the sample.

Methods

`__init__`(dataset, values, name)	Create a dataset samples holding a set of values for every molecule
`apply`(fn, args, *kwargs)	Applies a function to the dataset sample.
`copy`([columns, molecule_ids])	Creates a copy of the dataset sample.
`get_index_for`(molecule_type)	returns the index of molecule ids for the given molecule type
`get_node_values_for_graph`(graph[, ...])	Returns the values for the given graph.
`missing_mask`(molecule[, column])	Returns a boolean mask indicating which values are missing for the given molecule and column.
`missing_molecules`(molecule[, column])	Returns all molecules of the given molecule type that are missing for the given column.
`non_missing_mask`(molecule[, column])	Returns a boolean mask indicating which values are non-missing for the given molecule and column.
`non_missing_molecules`(molecule[, column])	Returns all molecules of the given molecule type that are not missing for the given column.
`plot_hist`([bins])	Plots a histogram of the values for every molecule type.

Attributes

`gene_mapping`
`missing_label_value`
`missing_value`
`molecule_set`
`molecules`

apply(fn: Callable, *args, **kwargs) → object

Applies a function to the dataset sample. Only exists to match the interface of the Dataset class.

Parameters:: fn (Callable) – the function to apply
Returns:: the result of the function
Return type:: object

copy(columns: Iterable[str] | Dict[str, str | Iterable[str]] | None = None, molecule_ids: Dict[str, Index] = {}) → DatasetSample

Creates a copy of the dataset sample.

Parameters:

columns (Optional[ Union[Iterable[str], Dict[str, Union[str, Iterable[str]]]] ], optional) – Columns to copy. When given as list of strings the same columns are copied for every molecule, when given as dictionary the key specific columns can be specific per molecule type. Defaults to None.
molecule_ids (Dict[str, pd.Index], optional) – Dictionay specifying for every molecule type the molecule ids that will be copied. If a molecule type is not part of the dictionary all molecule ids will be copied for this molecule type. Defaults to {}.

Returns:

A copy of the dataset sample.

Return type:

DatasetSamples

get_index_for(molecule_type: str) → Index

returns the index of molecule ids for the given molecule type

Parameters:: molecule_type (str) – The molecule type to get the index for
Returns:: The index of molecule ids for the given molecule type
Return type:: pd.Index

get_node_values_for_graph(graph: MoleculeGraph, include_id_and_type: bool = True) → DataFrame

Returns the values for the given graph.

Parameters:

graph (MoleculeGraph) – the graph to get the values for
include_id_and_type (bool, optional) – Whether to include the molecule ids and molecule type into the result. Defaults to True.

Returns:

the values for the given graph

Return type:

pd.DataFrame

missing_mask(molecule: str, column: str = 'abundance') → ndarray

Returns a boolean mask indicating which values are missing for the given molecule and column.

Parameters:

molecule (str) – the molecule type (e.g. ‘protein’ or ‘peptide’)
column (str, optional) – the value column. Defaults to “abundance”.

Returns:

the boolean mask indicating which values are missing for the given molecule and column.

Return type:

np.ndarray

missing_molecules(molecule: str, column: str = 'abundance') → DataFrame

Returns all molecules of the given molecule type that are missing for the given column.

Parameters:

molecule (str) – the molecule type (e.g. ‘protein’ or ‘peptide’)
column (str, optional) – the value column. Defaults to “abundance”.

Returns:

the dataframe containing the missing molecules and their additional information for the given molecule and column.

Return type:

pd.DataFrame

non_missing_mask(molecule: str, column: str = 'abundance')

Returns a boolean mask indicating which values are non-missing for the given molecule and column.

Parameters:

molecule (str) – the molecule type (e.g. ‘protein’ or ‘peptide’)
column (str, optional) – the value column. Defaults to “abundance”.

Returns:

the boolean mask indicating which values are non-missing for the given molecule and column.

Return type:

np.ndarray

non_missing_molecules(molecule: str, column: str = 'abundance')

Returns all molecules of the given molecule type that are not missing for the given column.

Parameters:

molecule (str) – the molecule type (e.g. ‘protein’ or ‘peptide’)
column (str, optional) – the value column. Defaults to “abundance”.

Returns:

the dataframe containing the non-missing molecules and their additional information for the given molecule and column.

Return type:

pd.DataFrame

plot_hist(bins: List[float] | str = 'auto')

Plots a histogram of the values for every molecule type.

Parameters:: bins (str, optional) – The bins for the histogram (passed to seaborn.histplot). Defaults to “auto”.