pyproteonet.data.dataset.Dataset
- class pyproteonet.data.dataset.Dataset(molecule_set: MoleculeSet, samples: Dict[str, DatasetSample] = {}, missing_value: float = nan)
Representing a dataset consisting of a MoleculeSet specifying molecules and relations and several DatasetSamples each holding a set of values for every molecule.
- __init__(molecule_set: MoleculeSet, samples: Dict[str, DatasetSample] = {}, missing_value: float = nan)
Generates a dataset based on a MoleculeSet and an optional list of DatasetSamples.
- Parameters:
molecule_set (MoleculeSet) – The MoleculeSet this dataset is based on
samples (Dict[str, DatasetSample], optional) – Dictionary of DatasetSamples containing samples for this dataset. Defaults to {}.
missing_value (float, optional) – Value used to represent missing values. Defaults to np.nan.
Methods
__init__(molecule_set[, samples, missing_value])Generates a dataset based on a MoleculeSet and an optional list of DatasetSamples.
calculate_hist(molecule_name[, bins])Calculate a histogram for the values of a given molecule type.
copy([samples, columns, copy_molecule_set, ...])Copies the dataset.
create_sample(name, values)Add a new sample to the dataset.
drop_values(columns[, molecules, inplace])Drop one or several value columns.
from_mapped_dataframe(df, molecule, ...[, ...])Transforming a pandas dataframe into a dataset.
from_pandas(dfs, mappings)Transforming a pandas dataframe into a dataset.
get_column_flat(molecule[, column, samples, ...])Returns a single value columns as a pandas Series with a MultiIndex with the levels: "id" (molecule id) and "sample"
get_lf(molecule[, columns, molecule_columns])Returns a dataframe in long format with multindex (sample id, molecule id) representing the value columns for the specified molecule type.
get_mapped(molecule, mapping[, columns, ...])Return a dataframe containing all pairs of molecules connected by the given mapping with the values for the corresponding value columns.
get_mapping_partner(molecule, mapping)Infer the partner molecule type for a molecule type and mapping
get_molecule_subset(molecule, ids)Create a new dataset containing only the given molecule ids for the given molecule type.
get_samples_value_matrix(molecule[, column, ...])Returns a dataframe in wide format (molecule ids as index, sample names as columns) representing the values of the specified value column for the specified molecule type.
get_values_flat(molecule[, columns, ...])Returns a dataframe in long format with multindex (sample id, molecule id) representing the value columns for the specified molecule type.
get_wf(molecule, column)Returns a dataframe in wide format (molecule ids as index, sample names as columns) representing the values of the specified value column for the specified molecule type.
infer_mapping(molecule, mapping)Infer a mapping name from a molecule type and a mapping string.
load(dir_path)loads a previsously saved dataset from disk
number_molecules(molecule)The number of molecules for a given molecule type.
rename_columns(columns[, inplace])Rename one or several value columns.
rename_mapping(mapping, new_name)Rename a mapping.
rename_molecule(molecule, new_name)Rename a molecule type.
rename_values(columns[, molecules, inplace])Similar to rename_columns but uses the same mapping for all molecule types.
sample_apply(fn, *args, **kwargs)Apply a function for every dataset samples
save(dir_path[, overwrite])Saves the dataset to disk as a directory containing .h5 files for the samples and a .h5 file for the molecule set.
set_column_lf(molecule, values[, column, ...])Sets values from a Pandas Series which has a MultiIndex with the levels: "id" and "sample"
set_lf(molecule, values[, skip_foreign_ids, ...])set_wf(matrix, molecule[, column, ...])Sets a dataframe in wide format (molecule ids as index, sample names as columns) for the values of the given value column for the given molecule type.
to_dgl_graph(feature_columns, mappings[, ...])Transform the dataset into a dgl graph.
write_tsvs(output_dir[, molecules, columns, ...])Write .tsv files for the given molecules and columns to the given directory.
Attributes
mappingsmoleculesnamesnum_samplessample_namessamples- calculate_hist(molecule_name: str, bins='auto') Tuple[ndarray, ndarray]
Calculate a histogram for the values of a given molecule type.
- Parameters:
molecule_name (str) – The molecule type to generate the histogram for.
bins (str, optional) – The bins. Defaults to “auto”.
- Returns:
The histogram values.
- Return type:
Tuple[np.ndarray, np.ndarray]
- copy(samples: List[str] | None = None, columns: Iterable[str] | Dict[str, str | Iterable[str]] | None = None, copy_molecule_set: bool = True, molecule_ids: Dict[str, Index] = {})
Copies the dataset.
- Parameters:
samples (Optional[List[str]], optional) – Dataset samples to include in the copy (all samples if not given). Defaults to None.
columns (Optional[ Union[Iterable[str], Dict[str, Union[str, Iterable[str]]]] ], optional) – Which value columns to copy for every molecule. Defaults to None.
copy_molecule_set (bool, optional) – Wheter to copy the MoleculeSet or just store a reference to the original MoleculeSet. Defaults to True.
molecule_ids (Dict[str, pd.Index], optional) – Which molecule ids to copy for every molecule type (all molecule ids are copied if a molecule type is not specified). Defaults to {}.
- Returns:
_description_
- Return type:
_type_
- create_sample(name: str, values: Dict[str, DataFrame])
Add a new sample to the dataset.
- Parameters:
name (str) – The name of the sample to add.
values (Dict[str, pd.DataFrame]) – The values for the sample. The keys are the molecule types and the values are dataframes with the molecule ids as index and the values as columns.
- Raises:
ValueError – Raised if index of the given dataframes does not align with the molecule ids of the dataset.
- drop_values(columns: List[str], molecules: List[str] | None = None, inplace: bool = False) Dataset | None
Drop one or several value columns.
- Parameters:
columns (List[str]) – The columns to drop.
molecules (Optional[List[str]], optional) – The molecules for which the given columns are dropped if they exist. Defaults to None.
inplace (bool, optional) – Whether to return a new dataset. Defaults to False.
- Returns:
The resulting dataset if inplace is False, otherwise None.
- Return type:
_type_
- classmethod from_mapped_dataframe(df: DataFrame, molecule: str, sample_columns: List[str], id_column: str | None = None, result_column_name: str = 'abundance', mapping_column: str | None = None, mapping_sep: str = ',', partner_molecule: str = 'protein', mapping_name='peptide-protein') Dataset
Transforming a pandas dataframe into a dataset. Useful for loading tabular peptide abundance data with a mapping column mapping peptides to proteins.
- Parameters:
df (pd.DataFrame) – The dataframe containing the data.
molecule (str) – The molecule whose values are contained in the dataframe.
sample_columns (List[str]) – The list of columns representing the dataset samples
id_column (Optional[str], optional) – The name of the column representing molecule ids. If none the dataframe index is used. Defaults to None.
result_column_name (str, optional) – Name of the value column used for the loaded values. Defaults to “abundance”.
mapping_column (Optional[str], optional) – Column containing lists of partner molecule ids. Defaults to None.
mapping_sep (str, optional) – The seperator character used to separed partner molecuel ids. Defaults to “,”.
mapping_molecule (str, optional) – The name of the partner molecule type. Defaults to “protein”.
mapping_name (str, optional) – The name of the mapping created from the mapping_column. Defaults to “peptide-protein”.
- Returns:
the loaded dataset.
- Return type:
- classmethod from_pandas(dfs: Dict[str, Dict[str, DataFrame]], mappings: Dict[str, DataFrame]) Dataset
Transforming a pandas dataframe into a dataset. Useful for loading tabular peptide abundance data with a mapping column mapping peptides to proteins.
- Parameters:
dfs (Dict[str, Dict[str, pd.DataFrame]]) – Two level dictionary molecule type name, value name, and nxs pandas dataframes where n is the number of molecules and s the number of samples.
mappings (Dict[str, pd.DataFrame]) – A dictionary of mapping name and pandas dataframe with multilevel index describing the molecule mapping.
- Returns:
the created dataset.
- Return type:
- get_column_flat(molecule: str, column: str = 'abundance', samples: List[str] | None = None, ids: Iterable | None = None, return_missing_mask: bool = False, drop_sample_id: bool = False) Series
Returns a single value columns as a pandas Series with a MultiIndex with the levels: “id” (molecule id) and “sample”
- Parameters:
molecule (str) – The molecule type to get the values for.
column (str, optional) – The value column to get the values for. Defaults to “abundance”.
samples (Optional[List[str]], optional) – The name of the samples to consider or None to consider all samples. Defaults to None.
ids (Optional[Iterable], optional) – The molecule ids to consider. Defaults to None.
return_missing_mask (bool, optional) – Whether to return a mask of missing values. Defaults to False.
drop_sample_id (bool, optional) – Wheter to trop the sample id from the result’s index. Defaults to False.
- Returns:
The resulting pandas Series.
- Return type:
pd.Series
- get_lf(molecule: str, columns: str | List[str] | None = None, molecule_columns: List[str] = [])
Returns a dataframe in long format with multindex (sample id, molecule id) representing the value columns for the specified molecule type.
- Parameters:
molecule (str) – The molecule type (e.g. protein, peptide …)
columns (Optional[Union[str, List[str]], optional) – The value columns to include in the result, default to all vall columns. Defaults to None.
molecule_columns (List[str], optional) – Any molecule columns from the MoleculeSet to include in the resulting dataframe. Defaults to [].
- Returns:
the resulting dataframe
- Return type:
pd.DataFrame
- get_mapped(molecule: str, mapping: str, columns: str | List[str] = [], samples: List | None = None, partner_columns: str | List[str] = [], molecule_columns: str | List[str] = [], molecule_columns_partner: str | List[str] = [], return_partner_index_name: bool = False) DataFrame | Tuple[DataFrame, str]
Return a dataframe containing all pairs of molecules connected by the given mapping with the values for the corresponding value columns.
- Parameters:
molecule (str) – A molecule type like protein, peptide…
mapping (str) – A mapping name.
columns (Union[str, List[str]], optional) – The value columns of the given molecule type to include in the results. Defaults to [].
samples (Optional[List], optional) – The names of the samples to include in the results. Defaults to None.
partner_columns (Union[str, List[str]], optional) – The value columns of the partner molecule type to include. Defaults to [].
molecule_columns (Union[str, List[str]], optional) – Any molecule columns from the MoleculeSet to include for the given molecule. Defaults to [].
molecule_columns_partner (Union[str, List[str]], optional) – Any molecule columns from the MoleculeSet to include for the given partner molecule. Defaults to [].
return_partner_index_name (bool, optional) – Whether to return the name of the partner index. Defaults to False.
- Returns:
the resulting dataframe and an optinal partner index name.
- Return type:
Union[pd.DataFrame, Tuple[pd.DataFrame, str]]
- get_mapping_partner(molecule: str, mapping: str) str
Infer the partner molecule type for a molecule type and mapping
- Parameters:
molecule (str) – The one molecule type of the mapping
mapping (str) – The mapping name.
- Returns:
The other molecule type of the mapping.
- Return type:
str
- get_molecule_subset(molecule: str, ids: Index)
Create a new dataset containing only the given molecule ids for the given molecule type.
- Parameters:
molecule (str) – The molecule type to copy
ids (pd.Index) – The molecule ids to copy
- Returns:
A new dataset containing the specified subset of the old dataset
- Return type:
- get_samples_value_matrix(molecule: str, column: str = 'abundance', molecule_columns: bool | List[str] = [], samples: List[str] | None = None, ids: Iterable | None = None) DataFrame
Returns a dataframe in wide format (molecule ids as index, sample names as columns) representing the values of the specified value column for the specified molecule type.
- Parameters:
molecule (str) – The molecule type (e.g. protein, peptide …)
column (Optional[List[str]], optional) – The value column to use.
molecule_columns (Union[bool, List[str]], optional) – Any molecule columns from the MoleculeSet to include in the resulting dataframe. Defaults to [].
samples (Optional[List[str]], optional) – The names of the samples to consider for the genrated result. Defaults to None.
ids (Optional[Iterable], optional) – The molecule ids to consider for the generated result. Defaults to None.
- Returns:
the resulting dataframe
- Return type:
pd.DataFrame
- get_values_flat(molecule: str, columns: str | List[str] | None = None, molecule_columns: List[str] = []) DataFrame
Returns a dataframe in long format with multindex (sample id, molecule id) representing the value columns for the specified molecule type.
- Parameters:
molecule (str) – The molecule type (e.g. protein, peptide …)
columns (Optional[Union[str, List[str]]], optional) – The value columns to include in the result, default to all vall columns. Defaults to None.
molecule_columns (List[str], optional) – Any molecule columns from the MoleculeSet to include in the resulting dataframe. Defaults to [].
- Returns:
the resulting dataframe
- Return type:
pd.DataFrame
- get_wf(molecule: str, column: str) DataFrame
Returns a dataframe in wide format (molecule ids as index, sample names as columns) representing the values of the specified value column for the specified molecule type.
- Parameters:
molecule (str) – The molecule type (e.g. protein, peptide …)
column (Optional[List[str]], optional) – The value column to use.
- Returns:
the resulting dataframe
- Return type:
pd.DataFrame
- infer_mapping(molecule: str, mapping: str) Tuple[str, str, str]
Infer a mapping name from a molecule type and a mapping string.
- Parameters:
molecule (str) – Molecule type like protein, peptide …
mapping (str) – If the name of a molecule type is given it is tried to infer the mapping name connecting both molecule types. If a mapping name is given it is returned as is.
- Returns:
The from molecule type, the mapping name, and the to molecule type.
- Return type:
Tuple[str, str, str]
- classmethod load(dir_path: str | Path) Dataset
loads a previsously saved dataset from disk
- Parameters:
dir_path (Union[str, Path]) – path to the directory representing the dataset
- Returns:
the loaded dataset
- Return type:
- number_molecules(molecule: str) int
The number of molecules for a given molecule type.
- Parameters:
molecule (str) – The molecule type to get the number of molecules for (e.g. protein, peptide …)
- Returns:
The number of molecules.
- Return type:
int
- rename_columns(columns: Dict[str, Dict[str, str]], inplace: bool = False) Dataset | None
Rename one or several value columns.
- Parameters:
columns (Dict[str, Dict[str, str]]) – A dictionary mapping old to new column names for every molecule type (protein, peptide etc.)
inplace (bool, optional) – Whether to perform the operation inplace or return a copy. Defaults to False.
- Returns:
A copy of the dataset with the renamed columns if inplace is False, otherwise None.
- Return type:
Optional[Dataset]
- rename_mapping(mapping: str, new_name: str)
Rename a mapping.
- Parameters:
mapping (str) – The old name of the mapping.
new_name (str) – The new name of the mapping.
- rename_molecule(molecule: str, new_name: str)
Rename a molecule type.
- Parameters:
molecule (str) – The current name.
new_name (str) – The new name.
- Raises:
KeyError – Raised when the new name already exists.
- rename_values(columns: Dict[str, str], molecules: List[str] | None = None, inplace: bool = False)
Similar to rename_columns but uses the same mapping for all molecule types.
- sample_apply(fn: Callable, *args, **kwargs)
Apply a function for every dataset samples
- Parameters:
fn (Callable) – The function to apply.
- Returns:
The transformed dataset.
- Return type:
_type_
- save(dir_path: str | Path, overwrite: bool = False)
Saves the dataset to disk as a directory containing .h5 files for the samples and a .h5 file for the molecule set.
- Parameters:
dir_path (Union[str, Path]) – Directory path to save the dataset to.
overwrite (bool, optional) – Wheter to overwrite any existing data. Defaults to False.
- Raises:
FileExistsError – Raised if the directory already exists and overwrite is False.
- set_column_lf(molecule: str, values: Series | int | float, column: str | None = None, skip_foreign_ids: bool = False, fill_missing: bool = False)
Sets values from a Pandas Series which has a MultiIndex with the levels: “id” and “sample”
- Parameters:
molecule (str) – The molecule type to set the values for.
values (Union[pd.Series, int, float]) – The values to set (must either be a pandas Series with a MultiIndex containing the levels “id” and “sample” or a single value).
column (Optional[str], optional) – If given this column name is used otherwise the name of the Series is used as column name. Defaults to None.
- set_wf(matrix: DataFrame, molecule: str, column: str = 'abundance', create_samples_if_not_exists: bool = False)
Sets a dataframe in wide format (molecule ids as index, sample names as columns) for the values of the given value column for the given molecule type.
- Parameters:
molecule (str) – The molecule type (e.g. protein, peptide …)
column (Optional[List[str]], optional) – The name of the value column to store the result in.
create_samples_if_not_exists (bool, optional) – Whether to create new samples if they do not exist. Defaults to False.
- Returns:
the resulting dataframe
- Return type:
pd.DataFrame
- to_dgl_graph(feature_columns: Dict[str, str | List[str]], mappings: str | List[str], molecule_columns: Dict[str, str | List[str]] = {}, mapping_directions: Dict[str, Tuple[str, str]] = {}, make_bidirectional: bool = False, features_to_float32: bool = True, samples: List[str] | None = None) dgl.DGLHeteroGraph
Transform the dataset into a dgl graph.
- Parameters:
feature_columns (Dict[str, Union[str, List[str]]]) – value columns to include as features for the nodes of the graph.
mappings (Union[str, List[str]]) – Names of the mappings to use for the edges of the graph.
mapping_directions (Dict[str, Tuple[str, str]], optional) – Used to specifies the direction of edges between molecule types. Defaults to {}.
make_bidirectional (bool, optional) – Whether to make the graph edges bidirectional. Defaults to False.
features_to_float32 (bool, optional) – Cast all feature values to float32. Defaults to True.
samples (Optional[List[str]], optional) – The names of the samples to include in the graph. If not given all samples are included. Defaults to None.
- Raises:
KeyError – Raised if feature columns with the reserved names ‘hidden’ and ‘mask’ are specified
- Returns:
the created graph
- Return type:
dgl.DGLHeteroGraph
- write_tsvs(output_dir: Path, molecules: List[str] = ['protein', 'peptide'], columns: List[str] = ['abundance'], molecule_columns: bool | List[str] = [], index_names: List[str] | None = None, na_rep='NA')
Write .tsv files for the given molecules and columns to the given directory.
- Parameters:
output_dir (Path) – The output directory path.
molecules (List[str], optional) – The molecules whose columns should be written to .tsv files. Defaults to [“protein”, “peptide”].
columns (List["str"], optional) – The column to write. Every column produces a .tsv file the with column values for every samples. Defaults to [“abundance”].
molecule_columns (Union[bool, List[str]], optional) – Any columns from the MoleculeSet to add to the .tsv files. Defaults to [].
index_names (Optional[List[str]], optional) – How to name the index columns in the .tsv files. Defaults to None.
na_rep (str, optional) – How to represent missing (NaN) values in the .tsv files. Defaults to “NA”.