pyproteonet.data.dataset.Dataset

class pyproteonet.data.dataset.Dataset(molecule_set: MoleculeSet, samples: Dict[str, DatasetSample] = {}, missing_value: float = nan)

Representing a dataset consisting of a MoleculeSet specifying molecules and relations and several DatasetSamples each holding a set of values for every molecule.

__init__(molecule_set: MoleculeSet, samples: Dict[str, DatasetSample] = {}, missing_value: float = nan)

Generates a dataset based on a MoleculeSet and an optional list of DatasetSamples.

Parameters:

molecule_set (MoleculeSet) – The MoleculeSet this dataset is based on
samples (Dict[str, DatasetSample], optional) – Dictionary of DatasetSamples containing samples for this dataset. Defaults to {}.
missing_value (float, optional) – Value used to represent missing values. Defaults to np.nan.

Methods

`__init__`(molecule_set[, samples, missing_value])	Generates a dataset based on a MoleculeSet and an optional list of DatasetSamples.
`calculate_hist`(molecule_name[, bins])	Calculate a histogram for the values of a given molecule type.
`copy`([samples, columns, copy_molecule_set, ...])	Copies the dataset.
`create_sample`(name, values)	Add a new sample to the dataset.
`drop_values`(columns[, molecules, inplace])	Drop one or several value columns.
`from_mapped_dataframe`(df, molecule, ...[, ...])	Transforming a pandas dataframe into a dataset.
`from_pandas`(dfs, mappings)	Transforming a pandas dataframe into a dataset.
`get_column_flat`(molecule[, column, samples, ...])	Returns a single value columns as a pandas Series with a MultiIndex with the levels: "id" (molecule id) and "sample"
`get_lf`(molecule[, columns, molecule_columns])	Returns a dataframe in long format with multindex (sample id, molecule id) representing the value columns for the specified molecule type.
`get_mapped`(molecule, mapping[, columns, ...])	Return a dataframe containing all pairs of molecules connected by the given mapping with the values for the corresponding value columns.
`get_mapping_partner`(molecule, mapping)	Infer the partner molecule type for a molecule type and mapping
`get_molecule_subset`(molecule, ids)	Create a new dataset containing only the given molecule ids for the given molecule type.
`get_samples_value_matrix`(molecule[, column, ...])	Returns a dataframe in wide format (molecule ids as index, sample names as columns) representing the values of the specified value column for the specified molecule type.
`get_values_flat`(molecule[, columns, ...])	Returns a dataframe in long format with multindex (sample id, molecule id) representing the value columns for the specified molecule type.
`get_wf`(molecule, column)	Returns a dataframe in wide format (molecule ids as index, sample names as columns) representing the values of the specified value column for the specified molecule type.
`infer_mapping`(molecule, mapping)	Infer a mapping name from a molecule type and a mapping string.
`load`(dir_path)	loads a previsously saved dataset from disk
`number_molecules`(molecule)	The number of molecules for a given molecule type.
`rename_columns`(columns[, inplace])	Rename one or several value columns.
`rename_mapping`(mapping, new_name)	Rename a mapping.
`rename_molecule`(molecule, new_name)	Rename a molecule type.
`rename_values`(columns[, molecules, inplace])	Similar to rename_columns but uses the same mapping for all molecule types.
`sample_apply`(fn, args, *kwargs)	Apply a function for every dataset samples
`save`(dir_path[, overwrite])	Saves the dataset to disk as a directory containing .h5 files for the samples and a .h5 file for the molecule set.
`set_column_lf`(molecule, values[, column, ...])	Sets values from a Pandas Series which has a MultiIndex with the levels: "id" and "sample"
`set_lf`(molecule, values[, skip_foreign_ids, ...])
`set_wf`(matrix, molecule[, column, ...])	Sets a dataframe in wide format (molecule ids as index, sample names as columns) for the values of the given value column for the given molecule type.
`to_dgl_graph`(feature_columns, mappings[, ...])	Transform the dataset into a dgl graph.
`write_tsvs`(output_dir[, molecules, columns, ...])	Write .tsv files for the given molecules and columns to the given directory.

Attributes

`mappings`
`molecules`
`names`
`num_samples`
`sample_names`
`samples`

calculate_hist(molecule_name: str, bins='auto') → Tuple[ndarray, ndarray]

Calculate a histogram for the values of a given molecule type.

Parameters:

molecule_name (str) – The molecule type to generate the histogram for.
bins (str, optional) – The bins. Defaults to “auto”.

Returns:

The histogram values.

Return type:

Tuple[np.ndarray, np.ndarray]

copy(samples: List[str] | None = None, columns: Iterable[str] | Dict[str, str | Iterable[str]] | None = None, copy_molecule_set: bool = True, molecule_ids: Dict[str, Index] = {})

Copies the dataset.

Parameters:

samples (Optional[List[str]], optional) – Dataset samples to include in the copy (all samples if not given). Defaults to None.
columns (Optional[ Union[Iterable[str], Dict[str, Union[str, Iterable[str]]]] ], optional) – Which value columns to copy for every molecule. Defaults to None.
copy_molecule_set (bool, optional) – Wheter to copy the MoleculeSet or just store a reference to the original MoleculeSet. Defaults to True.
molecule_ids (Dict[str, pd.Index], optional) – Which molecule ids to copy for every molecule type (all molecule ids are copied if a molecule type is not specified). Defaults to {}.

Returns:

_description_

Return type:

_type_

create_sample(name: str, values: Dict[str, DataFrame])

Add a new sample to the dataset.

Parameters:

name (str) – The name of the sample to add.
values (Dict[str, pd.DataFrame]) – The values for the sample. The keys are the molecule types and the values are dataframes with the molecule ids as index and the values as columns.

Raises:

ValueError – Raised if index of the given dataframes does not align with the molecule ids of the dataset.

drop_values(columns: List[str], molecules: List[str] | None = None, inplace: bool = False) → Dataset | None

Drop one or several value columns.

Parameters:

columns (List[str]) – The columns to drop.
molecules (Optional[List[str]], optional) – The molecules for which the given columns are dropped if they exist. Defaults to None.
inplace (bool, optional) – Whether to return a new dataset. Defaults to False.

Returns:

The resulting dataset if inplace is False, otherwise None.

Return type:

_type_

classmethod from_mapped_dataframe(df: DataFrame, molecule: str, sample_columns: List[str], id_column: str | None = None, result_column_name: str = 'abundance', mapping_column: str | None = None, mapping_sep: str = ',', partner_molecule: str = 'protein', mapping_name='peptide-protein') → Dataset

Transforming a pandas dataframe into a dataset. Useful for loading tabular peptide abundance data with a mapping column mapping peptides to proteins.

Parameters:

df (pd.DataFrame) – The dataframe containing the data.
molecule (str) – The molecule whose values are contained in the dataframe.
sample_columns (List[str]) – The list of columns representing the dataset samples
id_column (Optional[str], optional) – The name of the column representing molecule ids. If none the dataframe index is used. Defaults to None.
result_column_name (str, optional) – Name of the value column used for the loaded values. Defaults to “abundance”.
mapping_column (Optional[str], optional) – Column containing lists of partner molecule ids. Defaults to None.
mapping_sep (str, optional) – The seperator character used to separed partner molecuel ids. Defaults to “,”.
mapping_molecule (str, optional) – The name of the partner molecule type. Defaults to “protein”.
mapping_name (str, optional) – The name of the mapping created from the mapping_column. Defaults to “peptide-protein”.

Returns:

the loaded dataset.

Return type:

Dataset

classmethod from_pandas(dfs: Dict[str, Dict[str, DataFrame]], mappings: Dict[str, DataFrame]) → Dataset

Transforming a pandas dataframe into a dataset. Useful for loading tabular peptide abundance data with a mapping column mapping peptides to proteins.

Parameters:

dfs (Dict[str, Dict[str, pd.DataFrame]]) – Two level dictionary molecule type name, value name, and nxs pandas dataframes where n is the number of molecules and s the number of samples.
mappings (Dict[str, pd.DataFrame]) – A dictionary of mapping name and pandas dataframe with multilevel index describing the molecule mapping.

Returns:

the created dataset.

Return type:

Dataset

get_column_flat(molecule: str, column: str = 'abundance', samples: List[str] | None = None, ids: Iterable | None = None, return_missing_mask: bool = False, drop_sample_id: bool = False) → Series

Returns a single value columns as a pandas Series with a MultiIndex with the levels: “id” (molecule id) and “sample”

Parameters:

molecule (str) – The molecule type to get the values for.
column (str, optional) – The value column to get the values for. Defaults to “abundance”.
samples (Optional[List[str]], optional) – The name of the samples to consider or None to consider all samples. Defaults to None.
ids (Optional[Iterable], optional) – The molecule ids to consider. Defaults to None.
return_missing_mask (bool, optional) – Whether to return a mask of missing values. Defaults to False.
drop_sample_id (bool, optional) – Wheter to trop the sample id from the result’s index. Defaults to False.

Returns:

The resulting pandas Series.

Return type:

pd.Series

get_lf(molecule: str, columns: str | List[str] | None = None, molecule_columns: List[str] = [])

Returns a dataframe in long format with multindex (sample id, molecule id) representing the value columns for the specified molecule type.

Parameters:

molecule (str) – The molecule type (e.g. protein, peptide …)
columns (Optional[Union[str, List[str]], optional) – The value columns to include in the result, default to all vall columns. Defaults to None.
molecule_columns (List[str], optional) – Any molecule columns from the MoleculeSet to include in the resulting dataframe. Defaults to [].

Returns:

the resulting dataframe

Return type:

pd.DataFrame

get_mapped(molecule: str, mapping: str, columns: str | List[str] = [], samples: List | None = None, partner_columns: str | List[str] = [], molecule_columns: str | List[str] = [], molecule_columns_partner: str | List[str] = [], return_partner_index_name: bool = False) → DataFrame | Tuple[DataFrame, str]

Return a dataframe containing all pairs of molecules connected by the given mapping with the values for the corresponding value columns.

Parameters:

molecule (str) – A molecule type like protein, peptide…
mapping (str) – A mapping name.
columns (Union[str, List[str]], optional) – The value columns of the given molecule type to include in the results. Defaults to [].
samples (Optional[List], optional) – The names of the samples to include in the results. Defaults to None.
partner_columns (Union[str, List[str]], optional) – The value columns of the partner molecule type to include. Defaults to [].
molecule_columns (Union[str, List[str]], optional) – Any molecule columns from the MoleculeSet to include for the given molecule. Defaults to [].
molecule_columns_partner (Union[str, List[str]], optional) – Any molecule columns from the MoleculeSet to include for the given partner molecule. Defaults to [].
return_partner_index_name (bool, optional) – Whether to return the name of the partner index. Defaults to False.

Returns:

the resulting dataframe and an optinal partner index name.

Return type:

Union[pd.DataFrame, Tuple[pd.DataFrame, str]]

get_mapping_partner(molecule: str, mapping: str) → str

Infer the partner molecule type for a molecule type and mapping

Parameters:

molecule (str) – The one molecule type of the mapping
mapping (str) – The mapping name.

Returns:

The other molecule type of the mapping.

Return type:

str

get_molecule_subset(molecule: str, ids: Index)

Create a new dataset containing only the given molecule ids for the given molecule type.

Parameters:

molecule (str) – The molecule type to copy
ids (pd.Index) – The molecule ids to copy

Returns:

A new dataset containing the specified subset of the old dataset

Return type:

Dataset

get_samples_value_matrix(molecule: str, column: str = 'abundance', molecule_columns: bool | List[str] = [], samples: List[str] | None = None, ids: Iterable | None = None) → DataFrame

Returns a dataframe in wide format (molecule ids as index, sample names as columns) representing the values of the specified value column for the specified molecule type.

Parameters:

molecule (str) – The molecule type (e.g. protein, peptide …)
column (Optional[List[str]], optional) – The value column to use.
molecule_columns (Union[bool, List[str]], optional) – Any molecule columns from the MoleculeSet to include in the resulting dataframe. Defaults to [].
samples (Optional[List[str]], optional) – The names of the samples to consider for the genrated result. Defaults to None.
ids (Optional[Iterable], optional) – The molecule ids to consider for the generated result. Defaults to None.

Returns:

the resulting dataframe

Return type:

pd.DataFrame

get_values_flat(molecule: str, columns: str | List[str] | None = None, molecule_columns: List[str] = []) → DataFrame

Returns a dataframe in long format with multindex (sample id, molecule id) representing the value columns for the specified molecule type.

Parameters:

molecule (str) – The molecule type (e.g. protein, peptide …)
columns (Optional[Union[str, List[str]]], optional) – The value columns to include in the result, default to all vall columns. Defaults to None.
molecule_columns (List[str], optional) – Any molecule columns from the MoleculeSet to include in the resulting dataframe. Defaults to [].

Returns:

the resulting dataframe

Return type:

pd.DataFrame

get_wf(molecule: str, column: str) → DataFrame

Returns a dataframe in wide format (molecule ids as index, sample names as columns) representing the values of the specified value column for the specified molecule type.

Parameters:

molecule (str) – The molecule type (e.g. protein, peptide …)
column (Optional[List[str]], optional) – The value column to use.

Returns:

the resulting dataframe

Return type:

pd.DataFrame

infer_mapping(molecule: str, mapping: str) → Tuple[str, str, str]

Infer a mapping name from a molecule type and a mapping string.

Parameters:

molecule (str) – Molecule type like protein, peptide …
mapping (str) – If the name of a molecule type is given it is tried to infer the mapping name connecting both molecule types. If a mapping name is given it is returned as is.

Returns:

The from molecule type, the mapping name, and the to molecule type.

Return type:

Tuple[str, str, str]

classmethod load(dir_path: str | Path) → Dataset

loads a previsously saved dataset from disk

Parameters:: dir_path (Union[str, Path]) – path to the directory representing the dataset
Returns:: the loaded dataset
Return type:: Dataset

number_molecules(molecule: str) → int

The number of molecules for a given molecule type.

Parameters:: molecule (str) – The molecule type to get the number of molecules for (e.g. protein, peptide …)
Returns:: The number of molecules.
Return type:: int

rename_columns(columns: Dict[str, Dict[str, str]], inplace: bool = False) → Dataset | None

Rename one or several value columns.

Parameters:

columns (Dict[str, Dict[str, str]]) – A dictionary mapping old to new column names for every molecule type (protein, peptide etc.)
inplace (bool, optional) – Whether to perform the operation inplace or return a copy. Defaults to False.

Returns:

A copy of the dataset with the renamed columns if inplace is False, otherwise None.

Return type:

Optional[Dataset]

rename_mapping(mapping: str, new_name: str)

Rename a mapping.

Parameters:

mapping (str) – The old name of the mapping.
new_name (str) – The new name of the mapping.

rename_molecule(molecule: str, new_name: str)

Rename a molecule type.

Parameters:

molecule (str) – The current name.
new_name (str) – The new name.

Raises:

KeyError – Raised when the new name already exists.

rename_values(columns: Dict[str, str], molecules: List[str] | None = None, inplace: bool = False): Similar to rename_columns but uses the same mapping for all molecule types.

sample_apply(fn: Callable, *args, **kwargs)

Apply a function for every dataset samples

Parameters:: fn (Callable) – The function to apply.
Returns:: The transformed dataset.
Return type:: _type_

save(dir_path: str | Path, overwrite: bool = False)

Saves the dataset to disk as a directory containing .h5 files for the samples and a .h5 file for the molecule set.

Parameters:

dir_path (Union[str, Path]) – Directory path to save the dataset to.
overwrite (bool, optional) – Wheter to overwrite any existing data. Defaults to False.

Raises:

FileExistsError – Raised if the directory already exists and overwrite is False.

set_column_lf(molecule: str, values: Series | int | float, column: str | None = None, skip_foreign_ids: bool = False, fill_missing: bool = False)

Sets values from a Pandas Series which has a MultiIndex with the levels: “id” and “sample”

Parameters:

molecule (str) – The molecule type to set the values for.
values (Union[pd.Series, int, float]) – The values to set (must either be a pandas Series with a MultiIndex containing the levels “id” and “sample” or a single value).
column (Optional[str], optional) – If given this column name is used otherwise the name of the Series is used as column name. Defaults to None.

set_wf(matrix: DataFrame, molecule: str, column: str = 'abundance', create_samples_if_not_exists: bool = False)

Sets a dataframe in wide format (molecule ids as index, sample names as columns) for the values of the given value column for the given molecule type.

Parameters:

molecule (str) – The molecule type (e.g. protein, peptide …)
column (Optional[List[str]], optional) – The name of the value column to store the result in.
create_samples_if_not_exists (bool, optional) – Whether to create new samples if they do not exist. Defaults to False.

Returns:

the resulting dataframe

Return type:

pd.DataFrame

to_dgl_graph(feature_columns: Dict[str, str | List[str]], mappings: str | List[str], molecule_columns: Dict[str, str | List[str]] = {}, mapping_directions: Dict[str, Tuple[str, str]] = {}, make_bidirectional: bool = False, features_to_float32: bool = True, samples: List[str] | None = None) → dgl.DGLHeteroGraph

Transform the dataset into a dgl graph.

Parameters:

feature_columns (Dict[str, Union[str, List[str]]]) – value columns to include as features for the nodes of the graph.
mappings (Union[str, List[str]]) – Names of the mappings to use for the edges of the graph.
mapping_directions (Dict[str, Tuple[str, str]], optional) – Used to specifies the direction of edges between molecule types. Defaults to {}.
make_bidirectional (bool, optional) – Whether to make the graph edges bidirectional. Defaults to False.
features_to_float32 (bool, optional) – Cast all feature values to float32. Defaults to True.
samples (Optional[List[str]], optional) – The names of the samples to include in the graph. If not given all samples are included. Defaults to None.

Raises:

KeyError – Raised if feature columns with the reserved names ‘hidden’ and ‘mask’ are specified

Returns:

the created graph

Return type:

dgl.DGLHeteroGraph

write_tsvs(output_dir: Path, molecules: List[str] = ['protein', 'peptide'], columns: List[str] = ['abundance'], molecule_columns: bool | List[str] = [], index_names: List[str] | None = None, na_rep='NA')

Write .tsv files for the given molecules and columns to the given directory.

Parameters:

output_dir (Path) – The output directory path.
molecules (List[str], optional) – The molecules whose columns should be written to .tsv files. Defaults to [“protein”, “peptide”].
columns (List["str"], optional) – The column to write. Every column produces a .tsv file the with column values for every samples. Defaults to [“abundance”].
molecule_columns (Union[bool, List[str]], optional) – Any columns from the MoleculeSet to add to the .tsv files. Defaults to [].
index_names (Optional[List[str]], optional) – How to name the index columns in the .tsv files. Defaults to None.
na_rep (str, optional) – How to represent missing (NaN) values in the .tsv files. Defaults to “NA”.