pyproteonet.data.dataset.Dataset

class pyproteonet.data.dataset.Dataset(molecule_set: MoleculeSet, samples: Dict[str, DatasetSample] = {}, missing_value: float = nan)

Representing a dataset consisting of a MoleculeSet specifying molecules and relations and several DatasetSamples each holding a set of values for every molecule.

__init__(molecule_set: MoleculeSet, samples: Dict[str, DatasetSample] = {}, missing_value: float = nan)

Generates a dataset based on a MoleculeSet and an optional list of DatasetSamples.

Parameters:
  • molecule_set (MoleculeSet) – The MoleculeSet this dataset is based on

  • samples (Dict[str, DatasetSample], optional) – Dictionary of DatasetSamples containing samples for this dataset. Defaults to {}.

  • missing_value (float, optional) – Value used to represent missing values. Defaults to np.nan.

Methods

__init__(molecule_set[, samples, missing_value])

Generates a dataset based on a MoleculeSet and an optional list of DatasetSamples.

calculate_hist(molecule_name[, bins])

Calculate a histogram for the values of a given molecule type.

copy([samples, columns, copy_molecule_set, ...])

Copies the dataset.

create_sample(name, values)

Add a new sample to the dataset.

drop_values(columns[, molecules, inplace])

Drop one or several value columns.

from_mapped_dataframe(df, molecule, ...[, ...])

Transforming a pandas dataframe into a dataset.

from_pandas(dfs, mappings)

Transforming a pandas dataframe into a dataset.

get_column_flat(molecule[, column, samples, ...])

Returns a single value columns as a pandas Series with a MultiIndex with the levels: "id" (molecule id) and "sample"

get_lf(molecule[, columns, molecule_columns])

Returns a dataframe in long format with multindex (sample id, molecule id) representing the value columns for the specified molecule type.

get_mapped(molecule, mapping[, columns, ...])

Return a dataframe containing all pairs of molecules connected by the given mapping with the values for the corresponding value columns.

get_mapping_partner(molecule, mapping)

Infer the partner molecule type for a molecule type and mapping

get_molecule_subset(molecule, ids)

Create a new dataset containing only the given molecule ids for the given molecule type.

get_samples_value_matrix(molecule[, column, ...])

Returns a dataframe in wide format (molecule ids as index, sample names as columns) representing the values of the specified value column for the specified molecule type.

get_values_flat(molecule[, columns, ...])

Returns a dataframe in long format with multindex (sample id, molecule id) representing the value columns for the specified molecule type.

get_wf(molecule, column)

Returns a dataframe in wide format (molecule ids as index, sample names as columns) representing the values of the specified value column for the specified molecule type.

infer_mapping(molecule, mapping)

Infer a mapping name from a molecule type and a mapping string.

load(dir_path)

loads a previsously saved dataset from disk

number_molecules(molecule)

The number of molecules for a given molecule type.

rename_columns(columns[, inplace])

Rename one or several value columns.

rename_mapping(mapping, new_name)

Rename a mapping.

rename_molecule(molecule, new_name)

Rename a molecule type.

rename_values(columns[, molecules, inplace])

Similar to rename_columns but uses the same mapping for all molecule types.

sample_apply(fn, *args, **kwargs)

Apply a function for every dataset samples

save(dir_path[, overwrite])

Saves the dataset to disk as a directory containing .h5 files for the samples and a .h5 file for the molecule set.

set_column_lf(molecule, values[, column, ...])

Sets values from a Pandas Series which has a MultiIndex with the levels: "id" and "sample"

set_lf(molecule, values[, skip_foreign_ids, ...])

set_wf(matrix, molecule[, column, ...])

Sets a dataframe in wide format (molecule ids as index, sample names as columns) for the values of the given value column for the given molecule type.

to_dgl_graph(feature_columns, mappings[, ...])

Transform the dataset into a dgl graph.

write_tsvs(output_dir[, molecules, columns, ...])

Write .tsv files for the given molecules and columns to the given directory.

Attributes

mappings

molecules

names

num_samples

sample_names

samples

calculate_hist(molecule_name: str, bins='auto') Tuple[ndarray, ndarray]

Calculate a histogram for the values of a given molecule type.

Parameters:
  • molecule_name (str) – The molecule type to generate the histogram for.

  • bins (str, optional) – The bins. Defaults to “auto”.

Returns:

The histogram values.

Return type:

Tuple[np.ndarray, np.ndarray]

copy(samples: List[str] | None = None, columns: Iterable[str] | Dict[str, str | Iterable[str]] | None = None, copy_molecule_set: bool = True, molecule_ids: Dict[str, Index] = {})

Copies the dataset.

Parameters:
  • samples (Optional[List[str]], optional) – Dataset samples to include in the copy (all samples if not given). Defaults to None.

  • columns (Optional[ Union[Iterable[str], Dict[str, Union[str, Iterable[str]]]] ], optional) – Which value columns to copy for every molecule. Defaults to None.

  • copy_molecule_set (bool, optional) – Wheter to copy the MoleculeSet or just store a reference to the original MoleculeSet. Defaults to True.

  • molecule_ids (Dict[str, pd.Index], optional) – Which molecule ids to copy for every molecule type (all molecule ids are copied if a molecule type is not specified). Defaults to {}.

Returns:

_description_

Return type:

_type_

create_sample(name: str, values: Dict[str, DataFrame])

Add a new sample to the dataset.

Parameters:
  • name (str) – The name of the sample to add.

  • values (Dict[str, pd.DataFrame]) – The values for the sample. The keys are the molecule types and the values are dataframes with the molecule ids as index and the values as columns.

Raises:

ValueError – Raised if index of the given dataframes does not align with the molecule ids of the dataset.

drop_values(columns: List[str], molecules: List[str] | None = None, inplace: bool = False) Dataset | None

Drop one or several value columns.

Parameters:
  • columns (List[str]) – The columns to drop.

  • molecules (Optional[List[str]], optional) – The molecules for which the given columns are dropped if they exist. Defaults to None.

  • inplace (bool, optional) – Whether to return a new dataset. Defaults to False.

Returns:

The resulting dataset if inplace is False, otherwise None.

Return type:

_type_

classmethod from_mapped_dataframe(df: DataFrame, molecule: str, sample_columns: List[str], id_column: str | None = None, result_column_name: str = 'abundance', mapping_column: str | None = None, mapping_sep: str = ',', partner_molecule: str = 'protein', mapping_name='peptide-protein') Dataset

Transforming a pandas dataframe into a dataset. Useful for loading tabular peptide abundance data with a mapping column mapping peptides to proteins.

Parameters:
  • df (pd.DataFrame) – The dataframe containing the data.

  • molecule (str) – The molecule whose values are contained in the dataframe.

  • sample_columns (List[str]) – The list of columns representing the dataset samples

  • id_column (Optional[str], optional) – The name of the column representing molecule ids. If none the dataframe index is used. Defaults to None.

  • result_column_name (str, optional) – Name of the value column used for the loaded values. Defaults to “abundance”.

  • mapping_column (Optional[str], optional) – Column containing lists of partner molecule ids. Defaults to None.

  • mapping_sep (str, optional) – The seperator character used to separed partner molecuel ids. Defaults to “,”.

  • mapping_molecule (str, optional) – The name of the partner molecule type. Defaults to “protein”.

  • mapping_name (str, optional) – The name of the mapping created from the mapping_column. Defaults to “peptide-protein”.

Returns:

the loaded dataset.

Return type:

Dataset

classmethod from_pandas(dfs: Dict[str, Dict[str, DataFrame]], mappings: Dict[str, DataFrame]) Dataset

Transforming a pandas dataframe into a dataset. Useful for loading tabular peptide abundance data with a mapping column mapping peptides to proteins.

Parameters:
  • dfs (Dict[str, Dict[str, pd.DataFrame]]) – Two level dictionary molecule type name, value name, and nxs pandas dataframes where n is the number of molecules and s the number of samples.

  • mappings (Dict[str, pd.DataFrame]) – A dictionary of mapping name and pandas dataframe with multilevel index describing the molecule mapping.

Returns:

the created dataset.

Return type:

Dataset

get_column_flat(molecule: str, column: str = 'abundance', samples: List[str] | None = None, ids: Iterable | None = None, return_missing_mask: bool = False, drop_sample_id: bool = False) Series

Returns a single value columns as a pandas Series with a MultiIndex with the levels: “id” (molecule id) and “sample”

Parameters:
  • molecule (str) – The molecule type to get the values for.

  • column (str, optional) – The value column to get the values for. Defaults to “abundance”.

  • samples (Optional[List[str]], optional) – The name of the samples to consider or None to consider all samples. Defaults to None.

  • ids (Optional[Iterable], optional) – The molecule ids to consider. Defaults to None.

  • return_missing_mask (bool, optional) – Whether to return a mask of missing values. Defaults to False.

  • drop_sample_id (bool, optional) – Wheter to trop the sample id from the result’s index. Defaults to False.

Returns:

The resulting pandas Series.

Return type:

pd.Series

get_lf(molecule: str, columns: str | List[str] | None = None, molecule_columns: List[str] = [])

Returns a dataframe in long format with multindex (sample id, molecule id) representing the value columns for the specified molecule type.

Parameters:
  • molecule (str) – The molecule type (e.g. protein, peptide …)

  • columns (Optional[Union[str, List[str]], optional) – The value columns to include in the result, default to all vall columns. Defaults to None.

  • molecule_columns (List[str], optional) – Any molecule columns from the MoleculeSet to include in the resulting dataframe. Defaults to [].

Returns:

the resulting dataframe

Return type:

pd.DataFrame

get_mapped(molecule: str, mapping: str, columns: str | List[str] = [], samples: List | None = None, partner_columns: str | List[str] = [], molecule_columns: str | List[str] = [], molecule_columns_partner: str | List[str] = [], return_partner_index_name: bool = False) DataFrame | Tuple[DataFrame, str]

Return a dataframe containing all pairs of molecules connected by the given mapping with the values for the corresponding value columns.

Parameters:
  • molecule (str) – A molecule type like protein, peptide…

  • mapping (str) – A mapping name.

  • columns (Union[str, List[str]], optional) – The value columns of the given molecule type to include in the results. Defaults to [].

  • samples (Optional[List], optional) – The names of the samples to include in the results. Defaults to None.

  • partner_columns (Union[str, List[str]], optional) – The value columns of the partner molecule type to include. Defaults to [].

  • molecule_columns (Union[str, List[str]], optional) – Any molecule columns from the MoleculeSet to include for the given molecule. Defaults to [].

  • molecule_columns_partner (Union[str, List[str]], optional) – Any molecule columns from the MoleculeSet to include for the given partner molecule. Defaults to [].

  • return_partner_index_name (bool, optional) – Whether to return the name of the partner index. Defaults to False.

Returns:

the resulting dataframe and an optinal partner index name.

Return type:

Union[pd.DataFrame, Tuple[pd.DataFrame, str]]

get_mapping_partner(molecule: str, mapping: str) str

Infer the partner molecule type for a molecule type and mapping

Parameters:
  • molecule (str) – The one molecule type of the mapping

  • mapping (str) – The mapping name.

Returns:

The other molecule type of the mapping.

Return type:

str

get_molecule_subset(molecule: str, ids: Index)

Create a new dataset containing only the given molecule ids for the given molecule type.

Parameters:
  • molecule (str) – The molecule type to copy

  • ids (pd.Index) – The molecule ids to copy

Returns:

A new dataset containing the specified subset of the old dataset

Return type:

Dataset

get_samples_value_matrix(molecule: str, column: str = 'abundance', molecule_columns: bool | List[str] = [], samples: List[str] | None = None, ids: Iterable | None = None) DataFrame

Returns a dataframe in wide format (molecule ids as index, sample names as columns) representing the values of the specified value column for the specified molecule type.

Parameters:
  • molecule (str) – The molecule type (e.g. protein, peptide …)

  • column (Optional[List[str]], optional) – The value column to use.

  • molecule_columns (Union[bool, List[str]], optional) – Any molecule columns from the MoleculeSet to include in the resulting dataframe. Defaults to [].

  • samples (Optional[List[str]], optional) – The names of the samples to consider for the genrated result. Defaults to None.

  • ids (Optional[Iterable], optional) – The molecule ids to consider for the generated result. Defaults to None.

Returns:

the resulting dataframe

Return type:

pd.DataFrame

get_values_flat(molecule: str, columns: str | List[str] | None = None, molecule_columns: List[str] = []) DataFrame

Returns a dataframe in long format with multindex (sample id, molecule id) representing the value columns for the specified molecule type.

Parameters:
  • molecule (str) – The molecule type (e.g. protein, peptide …)

  • columns (Optional[Union[str, List[str]]], optional) – The value columns to include in the result, default to all vall columns. Defaults to None.

  • molecule_columns (List[str], optional) – Any molecule columns from the MoleculeSet to include in the resulting dataframe. Defaults to [].

Returns:

the resulting dataframe

Return type:

pd.DataFrame

get_wf(molecule: str, column: str) DataFrame

Returns a dataframe in wide format (molecule ids as index, sample names as columns) representing the values of the specified value column for the specified molecule type.

Parameters:
  • molecule (str) – The molecule type (e.g. protein, peptide …)

  • column (Optional[List[str]], optional) – The value column to use.

Returns:

the resulting dataframe

Return type:

pd.DataFrame

infer_mapping(molecule: str, mapping: str) Tuple[str, str, str]

Infer a mapping name from a molecule type and a mapping string.

Parameters:
  • molecule (str) – Molecule type like protein, peptide …

  • mapping (str) – If the name of a molecule type is given it is tried to infer the mapping name connecting both molecule types. If a mapping name is given it is returned as is.

Returns:

The from molecule type, the mapping name, and the to molecule type.

Return type:

Tuple[str, str, str]

classmethod load(dir_path: str | Path) Dataset

loads a previsously saved dataset from disk

Parameters:

dir_path (Union[str, Path]) – path to the directory representing the dataset

Returns:

the loaded dataset

Return type:

Dataset

number_molecules(molecule: str) int

The number of molecules for a given molecule type.

Parameters:

molecule (str) – The molecule type to get the number of molecules for (e.g. protein, peptide …)

Returns:

The number of molecules.

Return type:

int

rename_columns(columns: Dict[str, Dict[str, str]], inplace: bool = False) Dataset | None

Rename one or several value columns.

Parameters:
  • columns (Dict[str, Dict[str, str]]) – A dictionary mapping old to new column names for every molecule type (protein, peptide etc.)

  • inplace (bool, optional) – Whether to perform the operation inplace or return a copy. Defaults to False.

Returns:

A copy of the dataset with the renamed columns if inplace is False, otherwise None.

Return type:

Optional[Dataset]

rename_mapping(mapping: str, new_name: str)

Rename a mapping.

Parameters:
  • mapping (str) – The old name of the mapping.

  • new_name (str) – The new name of the mapping.

rename_molecule(molecule: str, new_name: str)

Rename a molecule type.

Parameters:
  • molecule (str) – The current name.

  • new_name (str) – The new name.

Raises:

KeyError – Raised when the new name already exists.

rename_values(columns: Dict[str, str], molecules: List[str] | None = None, inplace: bool = False)

Similar to rename_columns but uses the same mapping for all molecule types.

sample_apply(fn: Callable, *args, **kwargs)

Apply a function for every dataset samples

Parameters:

fn (Callable) – The function to apply.

Returns:

The transformed dataset.

Return type:

_type_

save(dir_path: str | Path, overwrite: bool = False)

Saves the dataset to disk as a directory containing .h5 files for the samples and a .h5 file for the molecule set.

Parameters:
  • dir_path (Union[str, Path]) – Directory path to save the dataset to.

  • overwrite (bool, optional) – Wheter to overwrite any existing data. Defaults to False.

Raises:

FileExistsError – Raised if the directory already exists and overwrite is False.

set_column_lf(molecule: str, values: Series | int | float, column: str | None = None, skip_foreign_ids: bool = False, fill_missing: bool = False)

Sets values from a Pandas Series which has a MultiIndex with the levels: “id” and “sample”

Parameters:
  • molecule (str) – The molecule type to set the values for.

  • values (Union[pd.Series, int, float]) – The values to set (must either be a pandas Series with a MultiIndex containing the levels “id” and “sample” or a single value).

  • column (Optional[str], optional) – If given this column name is used otherwise the name of the Series is used as column name. Defaults to None.

set_wf(matrix: DataFrame, molecule: str, column: str = 'abundance', create_samples_if_not_exists: bool = False)

Sets a dataframe in wide format (molecule ids as index, sample names as columns) for the values of the given value column for the given molecule type.

Parameters:
  • molecule (str) – The molecule type (e.g. protein, peptide …)

  • column (Optional[List[str]], optional) – The name of the value column to store the result in.

  • create_samples_if_not_exists (bool, optional) – Whether to create new samples if they do not exist. Defaults to False.

Returns:

the resulting dataframe

Return type:

pd.DataFrame

to_dgl_graph(feature_columns: Dict[str, str | List[str]], mappings: str | List[str], molecule_columns: Dict[str, str | List[str]] = {}, mapping_directions: Dict[str, Tuple[str, str]] = {}, make_bidirectional: bool = False, features_to_float32: bool = True, samples: List[str] | None = None) dgl.DGLHeteroGraph

Transform the dataset into a dgl graph.

Parameters:
  • feature_columns (Dict[str, Union[str, List[str]]]) – value columns to include as features for the nodes of the graph.

  • mappings (Union[str, List[str]]) – Names of the mappings to use for the edges of the graph.

  • mapping_directions (Dict[str, Tuple[str, str]], optional) – Used to specifies the direction of edges between molecule types. Defaults to {}.

  • make_bidirectional (bool, optional) – Whether to make the graph edges bidirectional. Defaults to False.

  • features_to_float32 (bool, optional) – Cast all feature values to float32. Defaults to True.

  • samples (Optional[List[str]], optional) – The names of the samples to include in the graph. If not given all samples are included. Defaults to None.

Raises:

KeyError – Raised if feature columns with the reserved names ‘hidden’ and ‘mask’ are specified

Returns:

the created graph

Return type:

dgl.DGLHeteroGraph

write_tsvs(output_dir: Path, molecules: List[str] = ['protein', 'peptide'], columns: List[str] = ['abundance'], molecule_columns: bool | List[str] = [], index_names: List[str] | None = None, na_rep='NA')

Write .tsv files for the given molecules and columns to the given directory.

Parameters:
  • output_dir (Path) – The output directory path.

  • molecules (List[str], optional) – The molecules whose columns should be written to .tsv files. Defaults to [“protein”, “peptide”].

  • columns (List["str"], optional) – The column to write. Every column produces a .tsv file the with column values for every samples. Defaults to [“abundance”].

  • molecule_columns (Union[bool, List[str]], optional) – Any columns from the MoleculeSet to add to the .tsv files. Defaults to [].

  • index_names (Optional[List[str]], optional) – How to name the index columns in the .tsv files. Defaults to None.

  • na_rep (str, optional) – How to represent missing (NaN) values in the .tsv files. Defaults to “NA”.