Getting Started

Molecules, Graphs, and MoleculeSets

At the heart of PyProteoNet are types of molecules (like proteins and peptides) and connections between those molecules. Even though for PyProteoNet molecules are just nodes in a graph and molecule types are just strings we will focus on protein and peptide molecules as used during many mass spectrometry (MS) experiments in the field of proteomics. In addition, most functions of PyProteonNet are currenlty focused on proteins and peptides aggregation and imputation.

Ultimately most experiments want to measure protein abundances. However for most MS experiments proteins are digested into peptides because only those smaller peptides are measured during an MS-experiment. So we have two kind of molecules (proteins and peptides) plus a mapping between them because every peptide can be mapped to at least one protein which the peptide can result from during digestion.

In PyProteoNet such a group of different types of molecules together with mappings between those molecule types is represented by a MoleculeSet. So lets import this:

from pyproteonet.data import MoleculeSet

Next we need some data to create a MoleculeSet. For simplicity, we do not use real data but come up with a simple toy examples of 10 proteins and 100 peptides which we identify by integers.

import pandas as pd

proteins = pd.DataFrame(index=range(10))
peptides = pd.DataFrame(index=range(100))

As you can see proteins and peptides are represented by pandas dataframes. The index of the dataframes will be used as identifiers for our molecules. With real-world data it might make sense to use the protein name or peptide sequence as index but here we just use integers. Our dataframe could also have additional columns storing other molecule attributes, however, those are not required.

The only thing missing is a mapping between proteins and peptides. Mappings are also created from pandas dataframes. To identify mapping partners those dataframes must have a multiindex with every index level containing molecules ids of the mapped molecules.

Here we just map every 10th peptide to the same protein.

peptide_protein_mapping = pd.DataFrame({'peptide':peptides.index, 'protein':peptides.index%10}).set_index(['peptide', 'protein'])

peptide_protein_mapping


peptide	protein
0	0
1	1
2	2
3	3
4	4
...	...
95	5
96	6
97	7
98	8
99	9

100 rows × 0 columns

Side Note: Internally mappings are wrapped by the MoleculeMapping class. This wrapper class can hold some additional information and facilitates e.g. mappings between the same molecule type. However, for most simple use cases like the protein-peptide use case this can be ignored.

From this data we can now generate a MoleculeSet. To support arbitratry molecule types and multiple mappings, molecules and mappings need to be given as dictionaries.

ms = MoleculeSet(molecules = {'protein':proteins, 'peptide':peptides},
                 mappings = {'peptide-protein': peptide_protein_mapping}
                )

Samples and Values

The MoleculeSet alone is not that helpful. Usually, we also want to attach (abundance) values to our molecules. With MS experiments we usually even have multiple samples measuring the same value multiple times. For this PyProteoNet provides Datasets. A Dataset consists of a molecule graph and a variable number of DatasetSamples each representing values of one sample. We can create a Dataset without any samples as follows:

from pyproteonet.data import Dataset
ds = Dataset(molecule_set=ms)

Next we add some samples to the dataset. Every sample is identified by a name and contains dataframes with values for our different molecule types. E.g. lets assume we measured some abundance values for our peptides which we want to add.

import numpy as np
for i in range(3):
    sample_name = f'sample{i}'
    peptide_values = pd.DataFrame({'abundance': np.random.uniform(size=100) * 10000}, index=range(100))
    sample_molecule_values = {'peptide':peptide_values}
    ds.create_sample(name=sample_name, values=sample_molecule_values)

Again we give the sample values as a dictionary to assign them to the correct molecule type.

Creating a Dataset Directly from Pandas Dataframes

Alternatively, we can just create a dataset from abundance matrices (for proteins and peptides) given as pandas DataFrames and mappings

peptide_abundance = pd.DataFrame(np.random.uniform(size=(100, 10)) * 10000, index=range(100), columns=[f'sample{i}' for i in range(10)])
peptide_abundance

	sample0	sample1	sample2	sample3	sample4	sample5	sample6	sample7	sample8	sample9
0	7,238.747	9,477.294	2,482.859	2,813.936	4,393.726	9,029.044	1,113.775	7,041.178	6,878.413	8,024.203
1	321.908	7,953.998	6,135.773	7,011.221	975.855	2,266.122	4,920.040	3,304.337	694.311	230.674
2	9,593.379	7,722.989	3,229.229	7,529.776	3,863.926	7,613.917	2,756.370	3,592.661	4,966.780	9,382.149
3	7,811.972	4,188.765	6,074.888	8,542.109	3,452.596	1,447.429	6,121.701	4,387.109	9,578.632	6,511.629
4	351.829	6,345.102	5,339.551	9,053.414	54.181	221.845	5,506.478	5,259.496	2,272.418	8,904.757
...	...	...	...	...	...	...	...	...	...	...
95	765.926	1,498.311	6,617.097	7,533.247	9,910.481	5,687.050	7,992.420	1,737.802	1,155.065	7,174.543
96	9,243.786	5,982.204	4,186.412	4,564.316	7,313.234	1,794.702	7,797.560	8,233.446	8,213.435	3,201.958
97	4,453.980	4,693.264	1,459.200	5,572.061	4,018.447	2,316.927	2,034.111	7,180.904	7,644.204	6,497.266
98	2,473.045	1,036.253	283.278	5,436.428	608.993	3,627.040	762.640	7,235.080	1,142.041	347.414
99	3,092.429	8,029.691	1,078.567	6,065.146	1,864.244	1,307.507	3,048.553	3,207.517	4,792.990	8,438.943

100 rows × 10 columns

ds = Dataset.from_pandas(dfs={'peptide':{'abundance':peptide_abundance}}, mappings={'peptide-protein': peptide_protein_mapping})

Protein Aggregation

Since in proteomics it is often worked with logarithmic abundance values, also the peptide abundances of our artifical dataset should first be logarithmized.

This also helps to understand how to access and modify dataset values in PyProteoNet. One convinient way shown here is to access a single value field or column as padas DataFrame in long format, containing all abundance values with their sample and protein id as multi index. Logarithmization can then be done as shown below, saving the logarithmized values under a new column.

ds.values['peptide']['abundance_log'] = np.log(ds.values['peptide']['abundance'])

# as alternative PyProteoNet also provides a function to logarithmize whole datasets
from pyproteonet.processing import logarithmize
ds_log = logarithmize(ds)

Usually we are interested in protein abundances. Therefore, the measured peptide abundance values need to be aggregated into protein abundance values. This is done via the peptide-protein mapping using an aggregation function (also called quantification function).

Here we apply Top3 aggregation as a simple but commonly used aggregation function. This function computes protein abundance from the average of the three most abundant peptides corresponding to a protein. In real-world datasets some peptides are usually shared between different proteins. Since their abundance values cannot be uniquly assigned to a protein, shared peptides are often ignored during abundane aggregation and only unique peptides are considered.

The result is represented as a pandas Series in long format with a multiindex to identify samples and protein ids.

from pyproteonet.aggregation.partner_summarization import partner_top_n_mean
top3 = partner_top_n_mean(dataset=ds, molecule='protein', mapping='peptide-protein', partner_column='abundance_log',
                          top_n=3, only_unique=True)

top3

sample   id
sample0  0    8.973
         1    8.965
         2    9.147
         3    9.063
         4    8.800
               ... 
sample9  5    8.874
         6    8.961
         7    9.136
         8    9.041
         9    9.055
Name: quanti, Length: 100, dtype: float64

To assign this to our dataset the following syntax can be used:

ds.values['protein']['top3'] = top3

Alternatively, most functions also allow the direct specification of a result column. So an alternative formulation of the Top3 aggregation could be as follows:

_ = partner_top_n_mean(dataset=ds, molecule='protein', mapping='peptide-protein', partner_column='abundance_log',
                       top_n=3, only_unique=True, result_column='top3')

The long format of the Top3 aggregated protein abundance values is little intuitive and a matrix representation is often used instead. Therefore, the Dataset class provides functions to represent data in different formats allowing. To get the Top3 results as a pandas DataFrame with samples as columns ans proteins as rows the get_samples_value_matrix function can be useful:

ds.get_samples_value_matrix(molecule='protein', column='top3')

	sample0	sample1	sample2	sample3	sample4	sample5	sample6	sample7	sample8	sample9
id
0	8.973	9.051	8.868	9.093	9.108	9.031	9.020	8.917	8.922	9.038
1	8.965	8.975	9.110	9.090	9.069	9.056	8.983	8.702	9.082	9.050
2	9.147	8.979	8.948	9.096	9.115	9.049	8.966	8.962	9.068	9.124
3	9.063	8.585	9.034	8.866	8.872	9.092	9.087	9.099	9.138	8.963
4	8.800	9.119	9.061	9.122	8.678	9.083	8.854	9.020	9.081	8.954
5	8.937	9.034	8.908	9.078	9.030	8.701	9.137	9.002	8.971	8.874
6	9.023	8.764	9.097	8.998	8.884	9.063	9.087	9.068	8.967	8.961
7	9.057	9.023	9.053	8.993	8.863	8.994	9.146	9.021	9.083	9.136
8	8.969	8.928	9.107	9.113	9.099	9.030	9.129	9.017	8.582	9.041
9	8.820	9.059	9.174	8.972	8.991	9.035	8.928	8.965	8.845	9.055

Next to the rather simple neighbor average (Top3), PyProteoNet also provides an efficient implementation of the more complex MaxLFQ protein aggregation method propose by Cox et al..

This methods takes peptide abundance ratios between all samples into account and then solves a least squares optimization probel to find protein abundance values best representing the observed peptide abundances. Similar to Cox et al. we here require at least two non missing peptide abundances. Additionally, we need to specify that the given values are already logarithmic to allows the correct calculation of peptide ratios between samples.

from pyproteonet.aggregation import maxlfq
_ = maxlfq(dataset=ds, molecule='protein', mapping='peptide-protein', partner_column='abundance_log',
           min_ratios=2, median_fallback=False, result_column='maxlfq', is_log=True)

Next to the matrix representation we can also look at all columns of a molecule using its long format representation as pandas DataFrame with multiindex. To do so we can again use the values attribute of our dataset. Using the df shortcut, we get a DataFrame of all columns for a molecule type in long format.

ds.values['protein'].df

		top3	maxlfq
sample	id
sample0	0	8.973	8.369
	1	8.965	7.666
	2	9.147	7.573
	3	9.063	8.418
	4	8.800	8.036
...	...	...	...
sample9	5	8.874	8.315
	6	8.961	7.841
	7	9.136	8.543
	8	9.041	8.223
	9	9.055	8.188

100 rows × 2 columns

Missing Value Imputation

Pyproteonet provides a wide range of established as well as newly proposed, graph neural network (GNN) based missing value imputation functions.

The interface for most imputation functions is similar to this of the aggregation functions shown above. Next to a dataset you need to provide the molecole type as well as the column(s) to impute and, optionally, values for method specific hyperparameters.

Of course, for imputation, we first of all need some missing values. So for our example we just mask some of the Top3 values. To do so we, again, use the values attribute of our dataset to get all Top3 values as pandas DataFrame in long format. Then, we replace some of the values with Na using pandas and, finally, we write the result back as a new column in our dataset

vals = ds.values['protein']['top3']
vals.loc[vals.sample(frac=0.33).index] = np.nan
ds.values['protein']['top3_masked'] = vals

ds.values['protein'].df

		top3	maxlfq	top3_masked
sample	id
sample0	0	8.973	8.369	8.973
	1	8.965	7.666	NaN
	2	9.147	7.573	9.147
	3	9.063	8.418	NaN
	4	8.800	8.036	NaN
...	...	...	...	...
sample9	5	8.874	8.315	8.874
	6	8.961	7.841	8.961
	7	9.136	8.543	NaN
	8	9.041	8.223	9.041
	9	9.055	8.188	9.055

100 rows × 3 columns

Let’s use the commonly used MissForrest as well as BPCA imputation methods to impute missing values. PyProteoNet provides a high level api for common imputation function operating only on a single molecule type (e.g. on protein level).

from pyproteonet.imputation.high_level_api import impute_molecule

impute_molecule(dataset=ds, molecule='protein', column='top3_masked', methods=['missforest', 'bpca'],
                result_columns=['top3_missforest', 'top3_bpca'])

Imputing with methods missforest, storig results in value column top3_missforest
Iteration: 0
Iteration: 1
Iteration: 2
Iteration: 3
Iteration: 4
Imputing with methods bpca, storig results in value column top3_bpca

Alternatively, PyProteoNet provides seperate functions for all imputation methods, allowing for example specifying additional argumets

from pyproteonet.imputation.r.miss_forest import miss_forest_impute

ds.values['protein']['top3_missforest'] = miss_forest_impute(dataset=ds, molecule='protein', column='top3_masked', ntree=5)

Looking at the result we can see that the missing values are gone:

ds.values['protein'].df

		top3	maxlfq	top3_masked	top3_missforest	top3_bpca
sample	id
sample0	0	8.973	8.369	8.973	8.973	8.973
	1	8.965	7.666	NaN	8.992	9.034
	2	9.147	7.573	9.147	9.147	9.147
	3	9.063	8.418	NaN	9.046	9.034
	4	8.800	8.036	NaN	9.073	9.034
...	...	...	...	...	...	...
sample9	5	8.874	8.315	8.874	8.874	8.874
	6	8.961	7.841	8.961	8.961	8.961
	7	9.136	8.543	NaN	9.050	9.008
	8	9.041	8.223	9.041	9.041	9.041
	9	9.055	8.188	9.055	9.055	9.055

100 rows × 5 columns

If you look at the import of the impute_miss_forest function you will notice that this function is part of the “r” subpackage. Most established imputation algorithms are implemented in the R programming language and provided as R packages. To also provide those algorithms while maintaining a unified Python interfact PyProteoNet wraps those R packages. Therefore, all algorithms in the “r” subpackage require an existing R installation (if you use a conda/mamba environment you could simply install one with the command conda install -c conda-forge r-base). For the user it is transparent whether an imputation function is implemented in Python or wrapped from an R package. All installation of R dependencies and conversion of data types between Python and R is done in the background by PyProteoNet (internally the rpy2 package is used for this).

Note: Compared to other proteomics- or imputation-focused packages PyProteoNet allows to jointly manage protein and peptide values as well as the relation between them (plus any other additional molecule types if required). This together with the implemented aggragation and imputation functions allows for a more versatile usage scenarios. E.g. imputation can be applied both on peptide level (before aggregation) as well as on protein level (after aggregation). In addition, the application and comparision of different imputation algorithmis and stratigies on the same dataset is facilitated

Graph Neural Network Imputation

While traditionally, imputation is either applied on peptide OR on protein level modelling protein and peptides a graph structure allows for flexilbe imputation strategies jointly taking information from both molecules into account. Therefore, imputation is formulated as a regression problem on the protein-peptide graph which is then solved by training a graph neural network (GNN).

While PyProteoNet provides different flavors and implementations using different network architectures and training schemes for the underlying GNN, those imputation methods can be called via a similar interface as other imputation methods. Since two types of molecules (proteins and peptides) are taken into account, the name of those molecule types as well as two value columns have to be specified.

Additional hyperparameters can be set aswell (here set to values used for real-world datasets)

from pyproteonet.imputation.dnn.gnn import impute_heterogeneous_gnn

# For the toy example with random data we use a small number of epochs and patience
impute_heterogeneous_gnn(dataset=ds, molecule='protein', column='top3_masked', mapping='peptide-protein', partner_column='abundance_log',
                         molecule_result_column=f'gnn_hetero', partner_result_column=f'gnn_hetero',
                         max_epochs=3, early_stopping_patience=3)

seed: 881061604

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type            | Params
------------------------------------------------------
0 | embedding         | Embedding       | 50    
1 | molecule_fc_model | Sequential      | 11.0 K
2 | partner_fc_model  | Sequential      | 11.4 K
3 | molecule_gat      | HeteroGraphConv | 34.4 K
4 | partner_gat       | HeteroGraphConv | 50.4 K
5 | molecule_gat2     | HeteroGraphConv | 66.4 K
6 | molecule_linear   | Linear          | 820   
7 | partner_linear    | Linear          | 1.2 K 
8 | loss_fn           | GaussianNLLLoss | 0     
------------------------------------------------------
175 K     Trainable params
0         Non-trainable params
175 K     Total params
0.703     Total estimated model params size (MB)

step29: num_masked_molecule:67.000 || num_masked_partner:239.167 || molecule_loss:0.466 || partner_loss:0.512 || train_loss:0.978 || epoch:0.000 || 
step59: num_masked_molecule:67.000 || num_masked_partner:235.867 || molecule_loss:0.385 || partner_loss:0.470 || train_loss:0.855 || epoch:1.000 || 

`Trainer.fit` stopped: `max_epochs=3` reached.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

step89: num_masked_molecule:67.000 || num_masked_partner:249.233 || molecule_loss:0.298 || partner_loss:0.452 || train_loss:0.749 || epoch:2.000 || 

sample   id
sample0  0    8.973
sample1  0    9.051
sample2  0    9.041
sample3  0    9.093
sample4  0    9.108
               ... 
sample5  9    9.035
sample6  9    8.928
sample7  9    8.958
sample8  9    8.994
sample9  9    9.055
Length: 100, dtype: float64

ds.values['protein'].df

		top3	maxlfq	top3_masked	top3_missforest	top3_bpca	gnn_hetero
sample	id
sample0	0	8.973	8.369	8.973	8.973	8.973	8.973
	1	8.965	7.666	NaN	8.992	9.034	9.029
	2	9.147	7.573	9.147	9.147	9.147	9.147
	3	9.063	8.418	NaN	9.046	9.034	9.029
	4	8.800	8.036	NaN	9.073	9.034	9.030
...	...	...	...	...	...	...	...
sample9	5	8.874	8.315	8.874	8.874	8.874	8.874
	6	8.961	7.841	8.961	8.961	8.961	8.961
	7	9.136	8.543	NaN	9.050	9.008	9.007
	8	9.041	8.223	9.041	9.041	9.041	9.041
	9	9.055	8.188	9.055	9.055	9.055	9.055

100 rows × 6 columns