Getting Started
Molecules, Graphs, and MoleculeSets
At the heart of PyProteoNet are types of molecules (like proteins and peptides) and connections between those molecules. Even though for PyProteoNet molecules are just nodes in a graph and molecule types are just strings we will focus on protein and peptide molecules as used during many mass spectrometry (MS) experiments in the field of proteomics. In addition, most functions of PyProteonNet are currenlty focused on proteins and peptides aggregation and imputation.
Ultimately most experiments want to measure protein abundances. However for most MS experiments proteins are digested into peptides because only those smaller peptides are measured during an MS-experiment. So we have two kind of molecules (proteins and peptides) plus a mapping between them because every peptide can be mapped to at least one protein which the peptide can result from during digestion.
In PyProteoNet such a group of different types of molecules together with mappings between those molecule types is represented by a MoleculeSet.
So lets import this:
from pyproteonet.data import MoleculeSet
Next we need some data to create a MoleculeSet. For simplicity, we do not use real data but come up with a simple toy examples of 10 proteins and 100 peptides which we identify by integers.
import pandas as pd
proteins = pd.DataFrame(index=range(10))
peptides = pd.DataFrame(index=range(100))
As you can see proteins and peptides are represented by pandas dataframes. The index of the dataframes will be used as identifiers for our molecules. With real-world data it might make sense to use the protein name or peptide sequence as index but here we just use integers. Our dataframe could also have additional columns storing other molecule attributes, however, those are not required.
The only thing missing is a mapping between proteins and peptides. Mappings are also created from pandas dataframes. To identify mapping partners those dataframes must have a multiindex with every index level containing molecules ids of the mapped molecules.
Here we just map every 10th peptide to the same protein.
peptide_protein_mapping = pd.DataFrame({'peptide':peptides.index, 'protein':peptides.index%10}).set_index(['peptide', 'protein'])
peptide_protein_mapping
| peptide | protein |
|---|---|
| 0 | 0 |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| ... | ... |
| 95 | 5 |
| 96 | 6 |
| 97 | 7 |
| 98 | 8 |
| 99 | 9 |
100 rows × 0 columns
Side Note: Internally mappings are wrapped by the
MoleculeMappingclass. This wrapper class can hold some additional information and facilitates e.g. mappings between the same molecule type. However, for most simple use cases like the protein-peptide use case this can be ignored.
From this data we can now generate a MoleculeSet. To support arbitratry molecule types and multiple mappings, molecules and mappings need to be given as dictionaries.
ms = MoleculeSet(molecules = {'protein':proteins, 'peptide':peptides},
mappings = {'peptide-protein': peptide_protein_mapping}
)
Samples and Values
The MoleculeSet alone is not that helpful. Usually, we also want to attach (abundance) values to our molecules.
With MS experiments we usually even have multiple samples measuring the same value multiple times. For this PyProteoNet provides
Datasets. A Dataset consists of a molecule graph and a variable number of DatasetSamples each representing values of one sample.
We can create a Dataset without any samples as follows:
from pyproteonet.data import Dataset
ds = Dataset(molecule_set=ms)
Next we add some samples to the dataset. Every sample is identified by a name and contains dataframes with values for our different molecule types. E.g. lets assume we measured some abundance values for our peptides which we want to add.
import numpy as np
for i in range(3):
sample_name = f'sample{i}'
peptide_values = pd.DataFrame({'abundance': np.random.uniform(size=100) * 10000}, index=range(100))
sample_molecule_values = {'peptide':peptide_values}
ds.create_sample(name=sample_name, values=sample_molecule_values)
Again we give the sample values as a dictionary to assign them to the correct molecule type.
Creating a Dataset Directly from Pandas Dataframes
Alternatively, we can just create a dataset from abundance matrices (for proteins and peptides) given as pandas DataFrames and mappings
peptide_abundance = pd.DataFrame(np.random.uniform(size=(100, 10)) * 10000, index=range(100), columns=[f'sample{i}' for i in range(10)])
peptide_abundance
| sample0 | sample1 | sample2 | sample3 | sample4 | sample5 | sample6 | sample7 | sample8 | sample9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7,238.747 | 9,477.294 | 2,482.859 | 2,813.936 | 4,393.726 | 9,029.044 | 1,113.775 | 7,041.178 | 6,878.413 | 8,024.203 |
| 1 | 321.908 | 7,953.998 | 6,135.773 | 7,011.221 | 975.855 | 2,266.122 | 4,920.040 | 3,304.337 | 694.311 | 230.674 |
| 2 | 9,593.379 | 7,722.989 | 3,229.229 | 7,529.776 | 3,863.926 | 7,613.917 | 2,756.370 | 3,592.661 | 4,966.780 | 9,382.149 |
| 3 | 7,811.972 | 4,188.765 | 6,074.888 | 8,542.109 | 3,452.596 | 1,447.429 | 6,121.701 | 4,387.109 | 9,578.632 | 6,511.629 |
| 4 | 351.829 | 6,345.102 | 5,339.551 | 9,053.414 | 54.181 | 221.845 | 5,506.478 | 5,259.496 | 2,272.418 | 8,904.757 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 95 | 765.926 | 1,498.311 | 6,617.097 | 7,533.247 | 9,910.481 | 5,687.050 | 7,992.420 | 1,737.802 | 1,155.065 | 7,174.543 |
| 96 | 9,243.786 | 5,982.204 | 4,186.412 | 4,564.316 | 7,313.234 | 1,794.702 | 7,797.560 | 8,233.446 | 8,213.435 | 3,201.958 |
| 97 | 4,453.980 | 4,693.264 | 1,459.200 | 5,572.061 | 4,018.447 | 2,316.927 | 2,034.111 | 7,180.904 | 7,644.204 | 6,497.266 |
| 98 | 2,473.045 | 1,036.253 | 283.278 | 5,436.428 | 608.993 | 3,627.040 | 762.640 | 7,235.080 | 1,142.041 | 347.414 |
| 99 | 3,092.429 | 8,029.691 | 1,078.567 | 6,065.146 | 1,864.244 | 1,307.507 | 3,048.553 | 3,207.517 | 4,792.990 | 8,438.943 |
100 rows × 10 columns
ds = Dataset.from_pandas(dfs={'peptide':{'abundance':peptide_abundance}}, mappings={'peptide-protein': peptide_protein_mapping})
Protein Aggregation
Since in proteomics it is often worked with logarithmic abundance values, also the peptide abundances of our artifical dataset should first be logarithmized.
This also helps to understand how to access and modify dataset values in PyProteoNet. One convinient way shown here is to access a single value field or column as padas DataFrame in long format, containing all abundance values with their sample and protein id as multi index. Logarithmization can then be done as shown below, saving the logarithmized values under a new column.
ds.values['peptide']['abundance_log'] = np.log(ds.values['peptide']['abundance'])
# as alternative PyProteoNet also provides a function to logarithmize whole datasets
from pyproteonet.processing import logarithmize
ds_log = logarithmize(ds)
Usually we are interested in protein abundances. Therefore, the measured peptide abundance values need to be aggregated into protein abundance values. This is done via the peptide-protein mapping using an aggregation function (also called quantification function).
Here we apply Top3 aggregation as a simple but commonly used aggregation function. This function computes protein abundance from the average of the three most abundant peptides corresponding to a protein. In real-world datasets some peptides are usually shared between different proteins. Since their abundance values cannot be uniquly assigned to a protein, shared peptides are often ignored during abundane aggregation and only unique peptides are considered.
The result is represented as a pandas Series in long format with a multiindex to identify samples and protein ids.
from pyproteonet.aggregation.partner_summarization import partner_top_n_mean
top3 = partner_top_n_mean(dataset=ds, molecule='protein', mapping='peptide-protein', partner_column='abundance_log',
top_n=3, only_unique=True)
top3
sample id
sample0 0 8.973
1 8.965
2 9.147
3 9.063
4 8.800
...
sample9 5 8.874
6 8.961
7 9.136
8 9.041
9 9.055
Name: quanti, Length: 100, dtype: float64
To assign this to our dataset the following syntax can be used:
ds.values['protein']['top3'] = top3
Alternatively, most functions also allow the direct specification of a result column. So an alternative formulation of the Top3 aggregation could be as follows:
_ = partner_top_n_mean(dataset=ds, molecule='protein', mapping='peptide-protein', partner_column='abundance_log',
top_n=3, only_unique=True, result_column='top3')
The long format of the Top3 aggregated protein abundance values is little intuitive and a matrix representation is often used instead. Therefore, the Dataset class provides functions to represent data in different formats allowing. To get the Top3 results as a pandas DataFrame with samples as columns ans proteins as rows the get_samples_value_matrix function can be useful:
ds.get_samples_value_matrix(molecule='protein', column='top3')
| sample0 | sample1 | sample2 | sample3 | sample4 | sample5 | sample6 | sample7 | sample8 | sample9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| id | ||||||||||
| 0 | 8.973 | 9.051 | 8.868 | 9.093 | 9.108 | 9.031 | 9.020 | 8.917 | 8.922 | 9.038 |
| 1 | 8.965 | 8.975 | 9.110 | 9.090 | 9.069 | 9.056 | 8.983 | 8.702 | 9.082 | 9.050 |
| 2 | 9.147 | 8.979 | 8.948 | 9.096 | 9.115 | 9.049 | 8.966 | 8.962 | 9.068 | 9.124 |
| 3 | 9.063 | 8.585 | 9.034 | 8.866 | 8.872 | 9.092 | 9.087 | 9.099 | 9.138 | 8.963 |
| 4 | 8.800 | 9.119 | 9.061 | 9.122 | 8.678 | 9.083 | 8.854 | 9.020 | 9.081 | 8.954 |
| 5 | 8.937 | 9.034 | 8.908 | 9.078 | 9.030 | 8.701 | 9.137 | 9.002 | 8.971 | 8.874 |
| 6 | 9.023 | 8.764 | 9.097 | 8.998 | 8.884 | 9.063 | 9.087 | 9.068 | 8.967 | 8.961 |
| 7 | 9.057 | 9.023 | 9.053 | 8.993 | 8.863 | 8.994 | 9.146 | 9.021 | 9.083 | 9.136 |
| 8 | 8.969 | 8.928 | 9.107 | 9.113 | 9.099 | 9.030 | 9.129 | 9.017 | 8.582 | 9.041 |
| 9 | 8.820 | 9.059 | 9.174 | 8.972 | 8.991 | 9.035 | 8.928 | 8.965 | 8.845 | 9.055 |
Next to the rather simple neighbor average (Top3), PyProteoNet also provides an efficient implementation of the more complex MaxLFQ protein aggregation method propose by Cox et al..
This methods takes peptide abundance ratios between all samples into account and then solves a least squares optimization probel to find protein abundance values best representing the observed peptide abundances. Similar to Cox et al. we here require at least two non missing peptide abundances. Additionally, we need to specify that the given values are already logarithmic to allows the correct calculation of peptide ratios between samples.
from pyproteonet.aggregation import maxlfq
_ = maxlfq(dataset=ds, molecule='protein', mapping='peptide-protein', partner_column='abundance_log',
min_ratios=2, median_fallback=False, result_column='maxlfq', is_log=True)
Next to the matrix representation we can also look at all columns of a molecule using its long format representation as pandas DataFrame with multiindex. To do so we can again use the values attribute of our dataset. Using the df shortcut, we get a DataFrame of all columns for a molecule type in long format.
ds.values['protein'].df
| top3 | maxlfq | ||
|---|---|---|---|
| sample | id | ||
| sample0 | 0 | 8.973 | 8.369 |
| 1 | 8.965 | 7.666 | |
| 2 | 9.147 | 7.573 | |
| 3 | 9.063 | 8.418 | |
| 4 | 8.800 | 8.036 | |
| ... | ... | ... | ... |
| sample9 | 5 | 8.874 | 8.315 |
| 6 | 8.961 | 7.841 | |
| 7 | 9.136 | 8.543 | |
| 8 | 9.041 | 8.223 | |
| 9 | 9.055 | 8.188 |
100 rows × 2 columns
Missing Value Imputation
Pyproteonet provides a wide range of established as well as newly proposed, graph neural network (GNN) based missing value imputation functions.
The interface for most imputation functions is similar to this of the aggregation functions shown above. Next to a dataset you need to provide the molecole type as well as the column(s) to impute and, optionally, values for method specific hyperparameters.
Of course, for imputation, we first of all need some missing values. So for our example we just mask some of the Top3 values. To do so we, again, use the values attribute of our dataset to get all Top3 values as pandas DataFrame in long format. Then, we replace some of the values with Na using pandas and, finally, we write the result back as a new column in our dataset
vals = ds.values['protein']['top3']
vals.loc[vals.sample(frac=0.33).index] = np.nan
ds.values['protein']['top3_masked'] = vals
ds.values['protein'].df
| top3 | maxlfq | top3_masked | ||
|---|---|---|---|---|
| sample | id | |||
| sample0 | 0 | 8.973 | 8.369 | 8.973 |
| 1 | 8.965 | 7.666 | NaN | |
| 2 | 9.147 | 7.573 | 9.147 | |
| 3 | 9.063 | 8.418 | NaN | |
| 4 | 8.800 | 8.036 | NaN | |
| ... | ... | ... | ... | ... |
| sample9 | 5 | 8.874 | 8.315 | 8.874 |
| 6 | 8.961 | 7.841 | 8.961 | |
| 7 | 9.136 | 8.543 | NaN | |
| 8 | 9.041 | 8.223 | 9.041 | |
| 9 | 9.055 | 8.188 | 9.055 |
100 rows × 3 columns
Let’s use the commonly used MissForrest as well as BPCA imputation methods to impute missing values. PyProteoNet provides a high level api for common imputation function operating only on a single molecule type (e.g. on protein level).
from pyproteonet.imputation.high_level_api import impute_molecule
impute_molecule(dataset=ds, molecule='protein', column='top3_masked', methods=['missforest', 'bpca'],
result_columns=['top3_missforest', 'top3_bpca'])
Imputing with methods missforest, storig results in value column top3_missforest
Iteration: 0
Iteration: 1
Iteration: 2
Iteration: 3
Iteration: 4
Imputing with methods bpca, storig results in value column top3_bpca
Alternatively, PyProteoNet provides seperate functions for all imputation methods, allowing for example specifying additional argumets
from pyproteonet.imputation.r.miss_forest import miss_forest_impute
ds.values['protein']['top3_missforest'] = miss_forest_impute(dataset=ds, molecule='protein', column='top3_masked', ntree=5)
Looking at the result we can see that the missing values are gone:
ds.values['protein'].df
| top3 | maxlfq | top3_masked | top3_missforest | top3_bpca | ||
|---|---|---|---|---|---|---|
| sample | id | |||||
| sample0 | 0 | 8.973 | 8.369 | 8.973 | 8.973 | 8.973 |
| 1 | 8.965 | 7.666 | NaN | 8.992 | 9.034 | |
| 2 | 9.147 | 7.573 | 9.147 | 9.147 | 9.147 | |
| 3 | 9.063 | 8.418 | NaN | 9.046 | 9.034 | |
| 4 | 8.800 | 8.036 | NaN | 9.073 | 9.034 | |
| ... | ... | ... | ... | ... | ... | ... |
| sample9 | 5 | 8.874 | 8.315 | 8.874 | 8.874 | 8.874 |
| 6 | 8.961 | 7.841 | 8.961 | 8.961 | 8.961 | |
| 7 | 9.136 | 8.543 | NaN | 9.050 | 9.008 | |
| 8 | 9.041 | 8.223 | 9.041 | 9.041 | 9.041 | |
| 9 | 9.055 | 8.188 | 9.055 | 9.055 | 9.055 |
100 rows × 5 columns
If you look at the import of the impute_miss_forest function you will notice that this function is part of the “r” subpackage.
Most established imputation algorithms are implemented in the R programming language and provided as R packages. To also provide those algorithms while maintaining a unified Python interfact PyProteoNet wraps those R packages.
Therefore, all algorithms in the “r” subpackage require an existing R installation (if you use a conda/mamba environment you could simply install one with the command conda install -c conda-forge r-base). For the user it is transparent whether an imputation function is implemented in Python or wrapped from an R package. All installation of R dependencies and conversion of data types between Python and R is done in the background by PyProteoNet (internally the rpy2 package is used for this).
Note: Compared to other proteomics- or imputation-focused packages PyProteoNet allows to jointly manage protein and peptide values as well as the relation between them (plus any other additional molecule types if required). This together with the implemented aggragation and imputation functions allows for a more versatile usage scenarios. E.g. imputation can be applied both on peptide level (before aggregation) as well as on protein level (after aggregation). In addition, the application and comparision of different imputation algorithmis and stratigies on the same dataset is facilitated
Graph Neural Network Imputation
While traditionally, imputation is either applied on peptide OR on protein level modelling protein and peptides a graph structure allows for flexilbe imputation strategies jointly taking information from both molecules into account. Therefore, imputation is formulated as a regression problem on the protein-peptide graph which is then solved by training a graph neural network (GNN).
While PyProteoNet provides different flavors and implementations using different network architectures and training schemes for the underlying GNN, those imputation methods can be called via a similar interface as other imputation methods. Since two types of molecules (proteins and peptides) are taken into account, the name of those molecule types as well as two value columns have to be specified.
Additional hyperparameters can be set aswell (here set to values used for real-world datasets)
from pyproteonet.imputation.dnn.gnn import impute_heterogeneous_gnn
# For the toy example with random data we use a small number of epochs and patience
impute_heterogeneous_gnn(dataset=ds, molecule='protein', column='top3_masked', mapping='peptide-protein', partner_column='abundance_log',
molecule_result_column=f'gnn_hetero', partner_result_column=f'gnn_hetero',
max_epochs=3, early_stopping_patience=3)
seed: 881061604
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
------------------------------------------------------
0 | embedding | Embedding | 50
1 | molecule_fc_model | Sequential | 11.0 K
2 | partner_fc_model | Sequential | 11.4 K
3 | molecule_gat | HeteroGraphConv | 34.4 K
4 | partner_gat | HeteroGraphConv | 50.4 K
5 | molecule_gat2 | HeteroGraphConv | 66.4 K
6 | molecule_linear | Linear | 820
7 | partner_linear | Linear | 1.2 K
8 | loss_fn | GaussianNLLLoss | 0
------------------------------------------------------
175 K Trainable params
0 Non-trainable params
175 K Total params
0.703 Total estimated model params size (MB)
step29: num_masked_molecule:67.000 || num_masked_partner:239.167 || molecule_loss:0.466 || partner_loss:0.512 || train_loss:0.978 || epoch:0.000 ||
step59: num_masked_molecule:67.000 || num_masked_partner:235.867 || molecule_loss:0.385 || partner_loss:0.470 || train_loss:0.855 || epoch:1.000 ||
`Trainer.fit` stopped: `max_epochs=3` reached.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
step89: num_masked_molecule:67.000 || num_masked_partner:249.233 || molecule_loss:0.298 || partner_loss:0.452 || train_loss:0.749 || epoch:2.000 ||
sample id
sample0 0 8.973
sample1 0 9.051
sample2 0 9.041
sample3 0 9.093
sample4 0 9.108
...
sample5 9 9.035
sample6 9 8.928
sample7 9 8.958
sample8 9 8.994
sample9 9 9.055
Length: 100, dtype: float64
ds.values['protein'].df
| top3 | maxlfq | top3_masked | top3_missforest | top3_bpca | gnn_hetero | ||
|---|---|---|---|---|---|---|---|
| sample | id | ||||||
| sample0 | 0 | 8.973 | 8.369 | 8.973 | 8.973 | 8.973 | 8.973 |
| 1 | 8.965 | 7.666 | NaN | 8.992 | 9.034 | 9.029 | |
| 2 | 9.147 | 7.573 | 9.147 | 9.147 | 9.147 | 9.147 | |
| 3 | 9.063 | 8.418 | NaN | 9.046 | 9.034 | 9.029 | |
| 4 | 8.800 | 8.036 | NaN | 9.073 | 9.034 | 9.030 | |
| ... | ... | ... | ... | ... | ... | ... | ... |
| sample9 | 5 | 8.874 | 8.315 | 8.874 | 8.874 | 8.874 | 8.874 |
| 6 | 8.961 | 7.841 | 8.961 | 8.961 | 8.961 | 8.961 | |
| 7 | 9.136 | 8.543 | NaN | 9.050 | 9.008 | 9.007 | |
| 8 | 9.041 | 8.223 | 9.041 | 9.041 | 9.041 | 9.041 | |
| 9 | 9.055 | 8.188 | 9.055 | 9.055 | 9.055 | 9.055 |
100 rows × 6 columns