Evaluating Imputation Against Ground Truth Abundance Values

%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import seaborn as sns

from pyproteonet.simulation import molecule_set_from_degree_distribution, simulate_protein_peptide_dataset, simulate_mcars, simulate_mnars_thresholding
from pyproteonet.aggregation import maxlfq
from pyproteonet.processing import logarithmize
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Simulating a Dataset

# We define some degree distributions roughly assembling those of a real world dataset
protein_deg_distribution = [0, 0.1445, 0.1221, 0.1151, 0.0933, 0.0692, 0.0655, 0.0508, 0.0472, 0.0362, 0.0311, 0.0277, 0.0209, 0.0199, 0.0163, 0.0143,
                            0.012, 0.0105, 0.0093, 0.0087, 0.0081, 0.0063, 0.0063, 0.0055, 0.0054, 0.0043, 0.0043, 0.0042, 0.0039, 0.0037, 0.0034,
                            0.0031, 0.0022, 0.0021, 0.0019, 0.0019, 0.0019, 0.0015, 0.0012, 0.001, 0.001]
peptide_deg_distribution = [0, 0.9591, 0.0341, 0.0046, 0.0014]

First, we create a set of proteins with related peptides. Next, we simulate abundance values for those peptides

# We create a simulated dataset with 100 proteins and 10 samples
num_proteins = 100
num_samples = 10

# We use a simple heuristic to determine the number of peptides for the given number of proteins while still closely matching the degree distributions
protein_degs = np.round(num_proteins * np.array(protein_deg_distribution))
prot_edges = np.sum(np.arange(len(protein_deg_distribution)) * protein_degs)
num_peptides = 1
pep_edges = 0
while pep_edges < prot_edges:
    num_peptides += 1
    peptide_degs = np.round(num_peptides * np.array(peptide_deg_distribution))
    pep_edges = np.sum(np.arange(len(peptide_deg_distribution)) * peptide_degs)
if pep_edges > prot_edges:
    diff = pep_edges - prot_edges
    for i in range(len(peptide_degs)-1, -1, -1):
        if peptide_degs[i] > 0 and i <= diff:
            peptide_degs[i] -= 1
            diff -= i
        if diff == 0:
            break

# Create a protein peptide molecule set for the given number of proteins/peptides and degree distribution
ms = molecule_set_from_degree_distribution(molecule1_name='protein', molecule2_name='peptide', mapping_name='peptide-protein',
                                           molecule1_degree_distribution=protein_degs, molecule2_degree_distribution=peptide_degs)
# Lets simulate some abundance values for the given molecule set
ds = simulate_protein_peptide_dataset(molecule_set=ms, mapping='peptide-protein', samples=num_samples,
                                      log_abundance_mu=15.9, log_abundance_sigma=1.8,
                                      log_protein_error_sigma=0.3, peptide_noise_sigma= 115005.3,
                                      flyability_alpha=0.7, flyability_beta=2.1, simulate_flyability=True)

Finally, we incorporate some missing values (MNARs and MCARs)

simulate_mnars_thresholding(dataset=ds, thresh_mu=115005.3 / 2, thresh_sigma=115005.3 / 4, molecule='peptide', column='abundance',
                            result_column='abundance_missing', mask_column='is_mnar', inplace=True)
simulate_mcars(dataset=ds, amount=0.3, molecule='peptide', column='abundance', result_column='abundance_missing', mask_column='is_mcar', inplace=True)
<pyproteonet.data.dataset.Dataset at 0x7f78feb7e6e0>
df = ds.values['peptide'].df
df.is_mnar.sum() / df.shape[0], df.is_mcar.sum() / df.shape[0]
(0.0, 0.2996688741721854)

In the end all abundance/aggregated values are logarithmized as it is commonly done in proteomics because logarithmized values are more normally distributed.

ds = logarithmize(data=ds, columns=['abundance', 'abundance_gt', 'abundance_missing'])

MaxLFQ aggregation

ds.values['protein']['aggregated'] = maxlfq(dataset=ds, molecule='protein', mapping='peptide-protein', partner_column='abundance_missing',
                                            min_ratios=2, median_fallback=False, is_log=True)

Now the ‘aggregated’ value column holds the aggregated values and the ‘abundance_gt’ value column which was written during the simulation holds the ground truth values

ds.values['protein'].df
abundance_gt aggregated
sample id
sample0 0 15.681 NaN
1 16.349 NaN
2 15.980 NaN
3 15.953 NaN
4 17.506 NaN
... ... ... ...
sample9 92 16.108 16.049
93 14.821 14.358
94 16.364 16.816
95 17.186 16.916
96 17.538 17.259

970 rows × 2 columns

Missing Value Imputation

Pyproteonet provides a wide range of established imputation functions combining both native python implementation and wrappers around R packages for imputation functions where no Python implementation is available yet.

Here we use the high level api to impute on both protein and peptide level using a bunch of different imputation functions.

from pyproteonet.imputation import impute_molecule

imputation_methods = ["minprob", "mindet", "bpca", "missforest", "knn", "isvd", "dae"]

impute_molecule(dataset=ds, molecule='protein', column='aggregated', methods=imputation_methods)
impute_molecule(dataset=ds, molecule='peptide', column='abundance_missing', methods=imputation_methods)
Hide code cell output
minprob minprob
[1] 0.2529305
mindet mindet
bpca bpca
missforest missforest
Iteration: 0
Iteration: 1
Iteration: 2
Iteration: 3
Iteration: 4
Iteration: 5
Iteration: 6
knn knn
isvd isvd
[IterativeSVD] Iter 1: observed MAE=0.942344
[IterativeSVD] Iter 2: observed MAE=0.350876
[IterativeSVD] Iter 3: observed MAE=0.258267
[IterativeSVD] Iter 4: observed MAE=0.227213
[IterativeSVD] Iter 5: observed MAE=0.211764
[IterativeSVD] Iter 6: observed MAE=0.204686
[IterativeSVD] Iter 7: observed MAE=0.200392
[IterativeSVD] Iter 8: observed MAE=0.197704
[IterativeSVD] Iter 9: observed MAE=0.195750
[IterativeSVD] Iter 10: observed MAE=0.194525
[IterativeSVD] Iter 11: observed MAE=0.193618
15.146899784126177
dae dae
epoch train_loss valid_loss time
0 725.728516 91.013054 00:00
1 709.095764 91.077766 00:00
2 708.331848 90.922592 00:00
3 705.660583 90.476593 00:00
4 696.783264 90.100899 00:00
5 685.555237 89.355019 00:00
6 672.083923 88.618347 00:00
7 653.671997 87.444458 00:00
8 632.348083 85.870407 00:00
9 608.691650 84.063354 00:00
10 582.108459 82.260605 00:00
11 555.385254 79.811363 00:00
12 528.457642 77.238266 00:00
13 503.068329 74.764008 00:00
14 478.512085 71.810822 00:00
15 455.935547 69.176422 00:00
16 434.320984 66.161781 00:00
17 414.498169 62.710480 00:00
18 396.038483 58.973877 00:00
19 378.899445 55.197704 00:00
20 363.228455 51.754570 00:00
21 348.546173 47.994987 00:00
22 335.909119 44.215851 00:00
23 323.789581 41.902397 00:00
24 312.636841 38.087860 00:00
25 302.076935 35.011318 00:00
26 291.629364 31.977169 00:00
27 282.320282 29.299732 00:00
28 273.569611 27.784550 00:00
29 265.416718 25.950260 00:00
30 257.404907 24.255640 00:00
31 250.069244 23.220112 00:00
32 243.216873 22.005398 00:00
33 236.291687 20.901123 00:00
34 230.121155 19.042309 00:00
35 224.284653 18.446800 00:00
36 218.718582 18.114464 00:00
37 213.279007 17.501831 00:00
38 207.986801 17.174189 00:00
39 203.142578 16.500206 00:00
40 198.317932 16.405998 00:00
41 194.001770 15.670680 00:00
42 189.928879 15.808111 00:00
43 186.135056 15.554245 00:00
44 182.421234 15.500771 00:00
45 178.827637 15.241837 00:00
46 175.299973 14.754601 00:00
47 172.087616 14.403175 00:00
48 169.119553 14.403598 00:00
49 166.035294 14.120859 00:00
minprob minprob
[1] 0.2632525
mindet mindet
bpca bpca
missforest missforest
Iteration: 0
Iteration: 1
Iteration: 2
Iteration: 3
Iteration: 4
Iteration: 5
Iteration: 6
knn knn
isvd isvd
[IterativeSVD] Iter 1: observed MAE=3.272085
[IterativeSVD] Iter 2: observed MAE=0.941136
[IterativeSVD] Iter 3: observed MAE=0.464382
[IterativeSVD] Iter 4: observed MAE=0.322353
[IterativeSVD] Iter 5: observed MAE=0.266482
[IterativeSVD] Iter 6: observed MAE=0.237536
[IterativeSVD] Iter 7: observed MAE=0.219830
[IterativeSVD] Iter 8: observed MAE=0.208195
[IterativeSVD] Iter 9: observed MAE=0.200088
[IterativeSVD] Iter 10: observed MAE=0.194486
[IterativeSVD] Iter 11: observed MAE=0.190412
[IterativeSVD] Iter 12: observed MAE=0.187479
16.010288091787228
dae dae
epoch train_loss valid_loss time
0 4493.552246 404.716217 00:00
1 4468.966309 404.322144 00:00
2 4447.557129 403.467987 00:00
3 4409.976562 402.153320 00:00
4 4379.576172 400.218506 00:00
5 4323.508301 397.358063 00:00
6 4260.140137 393.432800 00:00
7 4182.843750 388.475464 00:00
8 4094.818604 382.642151 00:00
9 3987.919189 375.565155 00:00
10 3867.635254 367.591614 00:00
11 3742.343994 356.750305 00:00
12 3611.705078 345.030212 00:00
13 3477.803711 332.897705 00:00
14 3342.454590 322.812622 00:00
15 3210.765625 310.069885 00:00
16 3085.559082 295.018188 00:00
17 2965.295166 281.148682 00:00
18 2848.843262 268.510559 00:00
19 2739.402832 256.520935 00:00
20 2636.009521 242.482666 00:00
21 2540.154541 231.278122 00:00
22 2447.782715 221.356979 00:00
23 2359.931641 213.276093 00:00
24 2279.674805 206.133545 00:00
25 2203.294678 198.306946 00:00
26 2131.332764 186.534164 00:00
27 2063.686279 176.166138 00:00
28 2000.210815 166.791840 00:00
29 1941.170044 158.309631 00:00
30 1883.979248 150.332123 00:00
31 1828.688477 141.152344 00:00
32 1778.762817 134.889053 00:00
33 1729.396118 129.041748 00:00
34 1682.929199 122.430740 00:00
35 1638.676758 117.555725 00:00
36 1597.069458 110.338028 00:00
37 1556.690430 105.001938 00:00
38 1518.916138 99.025734 00:00
39 1483.880737 92.003006 00:00
40 1449.457275 87.371155 00:00
41 1416.749023 82.971382 00:00
42 1386.405884 77.076126 00:00
43 1356.068481 72.550751 00:00
44 1328.091431 67.971817 00:00
45 1300.896484 66.230911 00:00
46 1274.791992 64.563316 00:00
47 1249.918091 62.941284 00:00
48 1226.718384 60.634460 00:00
49 1203.742554 57.658623 00:00
../_images/eb2f4ef8829db341b4f7983ec0a234934deeeca6792c853dc5ef800bd5646d47.png ../_images/eae2ca0201961b9f4712028964b2065508afad472c2cce0df409017f6b69226d.png ../_images/7979f828d0f174ce1a95d01d570fc61c442825b98a5b8bbea925d6ffc7870eed.png ../_images/96b613fe6e453a2b82055253b381f4b77027c6f610ce18462a84b64353a5fd78.png

Looking at the result we can see that the missing values are gone:

ds.values['peptide'].df
abundance abundance_gt abundance_missing is_mnar is_mcar minprob mindet bpca missforest knn isvd dae
sample id
sample0 0 15.295 15.681 15.295 False False 15.295 15.295 15.295 15.295 15.295 15.295 15.295
1 16.469 16.349 NaN False True 12.429 12.393 16.614 16.566 16.625 16.446 16.406
2 15.682 15.980 NaN False True 12.489 12.393 15.963 15.642 16.131 16.075 15.713
3 16.372 15.953 16.372 False False 16.372 16.372 16.372 16.372 16.372 16.372 16.372
4 17.208 17.506 17.208 False False 17.208 17.208 17.208 17.208 17.208 17.208 17.208
... ... ... ... ... ... ... ... ... ... ... ... ... ...
sample9 599 17.746 17.807 17.746 False False 17.746 17.746 17.746 17.746 17.746 17.746 17.746
600 18.388 18.548 18.388 False False 18.388 18.388 18.388 18.388 18.388 18.388 18.388
601 17.853 18.019 NaN False True 12.511 12.748 17.967 17.732 17.748 17.984 17.382
602 18.369 18.595 18.369 False False 18.369 18.369 18.369 18.369 18.369 18.369 18.369
603 18.173 18.284 18.173 False False 18.173 18.173 18.173 18.173 18.173 18.173 18.173

6040 rows × 12 columns

Graph Neural Network Imputation

from pyproteonet.imputation.dnn.gnn import impute_heterogeneous_gnn

_ = impute_heterogeneous_gnn(dataset=ds, molecule='protein', column='aggregated', mapping='peptide-protein', partner_column='abundance_missing',
                             molecule_result_column=f'gnn_hetero', partner_result_column=f'gnn_hetero',
                             max_epochs=1000, early_stopping_patience=7, epoch_size=30, training_fraction=0.25, log_every_n_steps=30)
Hide code cell output
seed: 351895759
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name              | Type            | Params
------------------------------------------------------
0 | embedding         | Embedding       | 485   
1 | molecule_fc_model | Sequential      | 11.0 K
2 | partner_fc_model  | Sequential      | 11.4 K
3 | molecule_gat      | HeteroGraphConv | 34.4 K
4 | partner_gat       | HeteroGraphConv | 50.4 K
5 | molecule_gat2     | HeteroGraphConv | 66.4 K
6 | molecule_linear   | Linear          | 820   
7 | partner_linear    | Linear          | 1.2 K 
8 | loss_fn           | GaussianNLLLoss | 0     
------------------------------------------------------
176 K     Trainable params
0         Non-trainable params
176 K     Total params
0.705     Total estimated model params size (MB)
step29: num_masked_molecule:706.0 || num_masked_partner:1026.800048828125 || molecule_loss:0.5021055936813354 || partner_loss:0.4748190641403198 || train_loss:0.9769246578216553 || epoch:0 || 
step59: num_masked_molecule:706.0 || num_masked_partner:1160.433349609375 || molecule_loss:0.3709450662136078 || partner_loss:0.12557192146778107 || train_loss:0.49651697278022766 || epoch:1 || 
step89: num_masked_molecule:706.0 || num_masked_partner:1090.5333251953125 || molecule_loss:-0.012235956266522408 || partner_loss:-0.33215370774269104 || train_loss:-0.34438958764076233 || epoch:2 || 
step119: num_masked_molecule:706.0 || num_masked_partner:1158.4666748046875 || molecule_loss:-0.32520750164985657 || partner_loss:-0.4651690125465393 || train_loss:-0.790376603603363 || epoch:3 || 
step149: num_masked_molecule:706.0 || num_masked_partner:1099.7332763671875 || molecule_loss:-0.4504413902759552 || partner_loss:-0.510000467300415 || train_loss:-0.9604418873786926 || epoch:4 || 
step179: num_masked_molecule:706.0 || num_masked_partner:1078.0 || molecule_loss:-0.5105648040771484 || partner_loss:-0.5449689030647278 || train_loss:-1.055533766746521 || epoch:5 || 
step209: num_masked_molecule:706.0 || num_masked_partner:990.6666870117188 || molecule_loss:-0.6079190373420715 || partner_loss:-0.61372971534729 || train_loss:-1.2216488122940063 || epoch:6 || 
step239: num_masked_molecule:706.0 || num_masked_partner:968.5999755859375 || molecule_loss:-0.6419838666915894 || partner_loss:-0.6361216306686401 || train_loss:-1.2781054973602295 || epoch:7 || 
step269: num_masked_molecule:706.0 || num_masked_partner:1091.7332763671875 || molecule_loss:-0.6184263825416565 || partner_loss:-0.6072388887405396 || train_loss:-1.2256652116775513 || epoch:8 || 
step299: num_masked_molecule:706.0 || num_masked_partner:994.2000122070312 || molecule_loss:-0.7532567381858826 || partner_loss:-0.7296491861343384 || train_loss:-1.4829059839248657 || epoch:9 || 
step329: num_masked_molecule:706.0 || num_masked_partner:1048.4000244140625 || molecule_loss:-0.7736703753471375 || partner_loss:-0.7557526230812073 || train_loss:-1.5294231176376343 || epoch:10 || 
step359: num_masked_molecule:706.0 || num_masked_partner:1050.4666748046875 || molecule_loss:-0.7907215356826782 || partner_loss:-0.7705901861190796 || train_loss:-1.5613116025924683 || epoch:11 || 
step389: num_masked_molecule:706.0 || num_masked_partner:1090.3333740234375 || molecule_loss:-0.7522632479667664 || partner_loss:-0.767548143863678 || train_loss:-1.5198115110397339 || epoch:12 || 
step419: num_masked_molecule:706.0 || num_masked_partner:1026.86669921875 || molecule_loss:-0.806643545627594 || partner_loss:-0.7848774790763855 || train_loss:-1.5915215015411377 || epoch:13 || 
step449: num_masked_molecule:706.0 || num_masked_partner:1015.0333251953125 || molecule_loss:-0.8478059768676758 || partner_loss:-0.8209801316261292 || train_loss:-1.668785810470581 || epoch:14 || 
step479: num_masked_molecule:706.0 || num_masked_partner:1064.300048828125 || molecule_loss:-0.8574027419090271 || partner_loss:-0.8280267715454102 || train_loss:-1.6854294538497925 || epoch:15 || 
step509: num_masked_molecule:706.0 || num_masked_partner:958.933349609375 || molecule_loss:-0.8429433107376099 || partner_loss:-0.8150697350502014 || train_loss:-1.658013105392456 || epoch:16 || 
step539: num_masked_molecule:706.0 || num_masked_partner:1020.7333374023438 || molecule_loss:-0.8313974738121033 || partner_loss:-0.81135493516922 || train_loss:-1.6427521705627441 || epoch:17 || 
step569: num_masked_molecule:706.0 || num_masked_partner:1027.933349609375 || molecule_loss:-0.8664554357528687 || partner_loss:-0.8272954225540161 || train_loss:-1.6937506198883057 || epoch:18 || 
step599: num_masked_molecule:706.0 || num_masked_partner:1179.800048828125 || molecule_loss:-0.8586979508399963 || partner_loss:-0.8399512767791748 || train_loss:-1.6986489295959473 || epoch:19 || 
step629: num_masked_molecule:706.0 || num_masked_partner:1107.13330078125 || molecule_loss:-0.758622944355011 || partner_loss:-0.7492792010307312 || train_loss:-1.5079021453857422 || epoch:20 || 
step659: num_masked_molecule:706.0 || num_masked_partner:1056.9000244140625 || molecule_loss:-0.8923173546791077 || partner_loss:-0.8846205472946167 || train_loss:-1.7769378423690796 || epoch:21 || 
step689: num_masked_molecule:706.0 || num_masked_partner:1156.0999755859375 || molecule_loss:-0.8382619023323059 || partner_loss:-0.7892665863037109 || train_loss:-1.627528429031372 || epoch:22 || 
step719: num_masked_molecule:706.0 || num_masked_partner:998.1333618164062 || molecule_loss:-0.9267680644989014 || partner_loss:-0.9003163576126099 || train_loss:-1.8270844221115112 || epoch:23 || 
step749: num_masked_molecule:706.0 || num_masked_partner:1127.933349609375 || molecule_loss:-0.9382941722869873 || partner_loss:-0.9007239937782288 || train_loss:-1.8390179872512817 || epoch:24 || 
step779: num_masked_molecule:706.0 || num_masked_partner:1018.5999755859375 || molecule_loss:-0.9493134021759033 || partner_loss:-0.9253332614898682 || train_loss:-1.8746470212936401 || epoch:25 || 
step809: num_masked_molecule:706.0 || num_masked_partner:1107.6666259765625 || molecule_loss:-0.9543604254722595 || partner_loss:-0.9190396666526794 || train_loss:-1.873400330543518 || epoch:26 || 
step839: num_masked_molecule:706.0 || num_masked_partner:1085.300048828125 || molecule_loss:-0.933240532875061 || partner_loss:-0.895464301109314 || train_loss:-1.828704595565796 || epoch:27 || 
step869: num_masked_molecule:706.0 || num_masked_partner:1079.0999755859375 || molecule_loss:-0.9389467239379883 || partner_loss:-0.9173352718353271 || train_loss:-1.856282114982605 || epoch:28 || 
step899: num_masked_molecule:706.0 || num_masked_partner:1007.8666381835938 || molecule_loss:-0.9523510336875916 || partner_loss:-0.9178689122200012 || train_loss:-1.8702200651168823 || epoch:29 || 
step929: num_masked_molecule:706.0 || num_masked_partner:930.566650390625 || molecule_loss:-0.9857296347618103 || partner_loss:-0.9465729594230652 || train_loss:-1.932302713394165 || epoch:30 || 
step959: num_masked_molecule:706.0 || num_masked_partner:1076.4666748046875 || molecule_loss:-0.9728419184684753 || partner_loss:-0.9399837851524353 || train_loss:-1.9128258228302002 || epoch:31 || 
step989: num_masked_molecule:706.0 || num_masked_partner:1027.3333740234375 || molecule_loss:-1.0037721395492554 || partner_loss:-0.9678016304969788 || train_loss:-1.9715734720230103 || epoch:32 || 
step1019: num_masked_molecule:706.0 || num_masked_partner:1077.36669921875 || molecule_loss:-1.0080838203430176 || partner_loss:-0.9495093822479248 || train_loss:-1.957593321800232 || epoch:33 || 
step1049: num_masked_molecule:706.0 || num_masked_partner:1100.566650390625 || molecule_loss:-0.9795469641685486 || partner_loss:-0.9460999369621277 || train_loss:-1.9256469011306763 || epoch:34 || 
step1079: num_masked_molecule:706.0 || num_masked_partner:1154.0 || molecule_loss:-0.9345295429229736 || partner_loss:-0.9085943698883057 || train_loss:-1.843124270439148 || epoch:35 || 
step1109: num_masked_molecule:706.0 || num_masked_partner:950.7333374023438 || molecule_loss:-1.0188764333724976 || partner_loss:-0.9866272211074829 || train_loss:-2.0055034160614014 || epoch:36 || 
step1139: num_masked_molecule:706.0 || num_masked_partner:1053.4666748046875 || molecule_loss:-1.0177667140960693 || partner_loss:-0.9690415859222412 || train_loss:-1.9868087768554688 || epoch:37 || 
step1169: num_masked_molecule:706.0 || num_masked_partner:1112.13330078125 || molecule_loss:-1.0239351987838745 || partner_loss:-0.9681205749511719 || train_loss:-1.9920555353164673 || epoch:38 || 
step1199: num_masked_molecule:706.0 || num_masked_partner:1150.0333251953125 || molecule_loss:-1.036136269569397 || partner_loss:-0.987048327922821 || train_loss:-2.0231850147247314 || epoch:39 || 
step1229: num_masked_molecule:706.0 || num_masked_partner:982.2000122070312 || molecule_loss:-1.0431923866271973 || partner_loss:-1.0042524337768555 || train_loss:-2.0474445819854736 || epoch:40 || 
step1259: num_masked_molecule:706.0 || num_masked_partner:1129.4666748046875 || molecule_loss:-1.024781346321106 || partner_loss:-0.9864785671234131 || train_loss:-2.0112600326538086 || epoch:41 || 
step1289: num_masked_molecule:706.0 || num_masked_partner:1092.0999755859375 || molecule_loss:-1.0318939685821533 || partner_loss:-1.0066903829574585 || train_loss:-2.038583993911743 || epoch:42 || 
step1319: num_masked_molecule:706.0 || num_masked_partner:1045.199951171875 || molecule_loss:-1.0409079790115356 || partner_loss:-1.0015157461166382 || train_loss:-2.0424234867095947 || epoch:43 || 
step1349: num_masked_molecule:706.0 || num_masked_partner:1131.5 || molecule_loss:-1.0382781028747559 || partner_loss:-0.9959096312522888 || train_loss:-2.0341877937316895 || epoch:44 || 
step1379: num_masked_molecule:706.0 || num_masked_partner:983.7333374023438 || molecule_loss:-1.057854175567627 || partner_loss:-1.0095196962356567 || train_loss:-2.067373752593994 || epoch:45 || 
step1409: num_masked_molecule:706.0 || num_masked_partner:1011.4666748046875 || molecule_loss:-1.0266164541244507 || partner_loss:-0.9871013760566711 || train_loss:-2.0137178897857666 || epoch:46 || 
step1439: num_masked_molecule:706.0 || num_masked_partner:1082.2667236328125 || molecule_loss:-1.044403314590454 || partner_loss:-1.0015701055526733 || train_loss:-2.045973539352417 || epoch:47 || 
step1469: num_masked_molecule:706.0 || num_masked_partner:1065.7667236328125 || molecule_loss:-1.0337443351745605 || partner_loss:-0.9949927926063538 || train_loss:-2.0287368297576904 || epoch:48 || 
step1499: num_masked_molecule:706.0 || num_masked_partner:1039.9666748046875 || molecule_loss:-1.0541470050811768 || partner_loss:-0.9889230132102966 || train_loss:-2.0430703163146973 || epoch:49 || 
step1529: num_masked_molecule:706.0 || num_masked_partner:1022.5999755859375 || molecule_loss:-1.071032166481018 || partner_loss:-1.0220104455947876 || train_loss:-2.0930426120758057 || epoch:50 || 
step1559: num_masked_molecule:706.0 || num_masked_partner:1079.433349609375 || molecule_loss:-1.0609084367752075 || partner_loss:-1.0071407556533813 || train_loss:-2.068049192428589 || epoch:51 || 
step1589: num_masked_molecule:706.0 || num_masked_partner:997.5 || molecule_loss:-1.0327026844024658 || partner_loss:-1.0056102275848389 || train_loss:-2.038313150405884 || epoch:52 || 
step1619: num_masked_molecule:706.0 || num_masked_partner:1013.1333618164062 || molecule_loss:-1.0988285541534424 || partner_loss:-1.039958119392395 || train_loss:-2.1387863159179688 || epoch:53 || 
step1649: num_masked_molecule:706.0 || num_masked_partner:1085.6666259765625 || molecule_loss:-1.0658906698226929 || partner_loss:-1.0185890197753906 || train_loss:-2.084479570388794 || epoch:54 || 
step1679: num_masked_molecule:706.0 || num_masked_partner:923.7999877929688 || molecule_loss:-1.0951734781265259 || partner_loss:-1.0456804037094116 || train_loss:-2.1408538818359375 || epoch:55 || 
step1709: num_masked_molecule:706.0 || num_masked_partner:1113.7332763671875 || molecule_loss:-1.0575143098831177 || partner_loss:-1.0204428434371948 || train_loss:-2.0779569149017334 || epoch:56 || 
step1739: num_masked_molecule:706.0 || num_masked_partner:1064.7332763671875 || molecule_loss:-1.0925005674362183 || partner_loss:-1.0493261814117432 || train_loss:-2.141826629638672 || epoch:57 || 
step1769: num_masked_molecule:706.0 || num_masked_partner:1054.066650390625 || molecule_loss:-1.084751009941101 || partner_loss:-1.0421618223190308 || train_loss:-2.126912832260132 || epoch:58 || 
step1799: num_masked_molecule:706.0 || num_masked_partner:1163.5333251953125 || molecule_loss:-1.0990279912948608 || partner_loss:-1.0486626625061035 || train_loss:-2.147690534591675 || epoch:59 || 
step1829: num_masked_molecule:706.0 || num_masked_partner:1080.7332763671875 || molecule_loss:-1.1158238649368286 || partner_loss:-1.053543210029602 || train_loss:-2.1693673133850098 || epoch:60 || 
step1859: num_masked_molecule:706.0 || num_masked_partner:1100.8333740234375 || molecule_loss:-1.0958619117736816 || partner_loss:-1.0411535501480103 || train_loss:-2.1370155811309814 || epoch:61 || 
step1889: num_masked_molecule:706.0 || num_masked_partner:1056.066650390625 || molecule_loss:-1.1076759099960327 || partner_loss:-1.0397175550460815 || train_loss:-2.1473934650421143 || epoch:62 || 
step1919: num_masked_molecule:706.0 || num_masked_partner:1096.433349609375 || molecule_loss:-1.1148617267608643 || partner_loss:-1.0689504146575928 || train_loss:-2.183812379837036 || epoch:63 || 
step1949: num_masked_molecule:706.0 || num_masked_partner:1039.199951171875 || molecule_loss:-1.1124831438064575 || partner_loss:-1.064082384109497 || train_loss:-2.176565408706665 || epoch:64 || 
step1979: num_masked_molecule:706.0 || num_masked_partner:1049.566650390625 || molecule_loss:-1.1219178438186646 || partner_loss:-1.0634821653366089 || train_loss:-2.1854004859924316 || epoch:65 || 
step2009: num_masked_molecule:706.0 || num_masked_partner:1072.9000244140625 || molecule_loss:-1.1082804203033447 || partner_loss:-1.052480936050415 || train_loss:-2.1607613563537598 || epoch:66 || 
step2039: num_masked_molecule:706.0 || num_masked_partner:1000.2333374023438 || molecule_loss:-1.1242812871932983 || partner_loss:-1.0726218223571777 || train_loss:-2.1969032287597656 || epoch:67 || 
step2069: num_masked_molecule:706.0 || num_masked_partner:1071.9000244140625 || molecule_loss:-1.1164376735687256 || partner_loss:-1.0617098808288574 || train_loss:-2.178147315979004 || epoch:68 || 
step2099: num_masked_molecule:706.0 || num_masked_partner:1065.433349609375 || molecule_loss:-1.1219402551651 || partner_loss:-1.0614031553268433 || train_loss:-2.1833431720733643 || epoch:69 || 
step2129: num_masked_molecule:706.0 || num_masked_partner:933.8333129882812 || molecule_loss:-1.1334900856018066 || partner_loss:-1.0872844457626343 || train_loss:-2.2207746505737305 || epoch:70 || 
step2159: num_masked_molecule:706.0 || num_masked_partner:1012.0999755859375 || molecule_loss:-1.139643907546997 || partner_loss:-1.09285569190979 || train_loss:-2.232499599456787 || epoch:71 || 
step2189: num_masked_molecule:706.0 || num_masked_partner:1070.4000244140625 || molecule_loss:-1.1274845600128174 || partner_loss:-1.0885952711105347 || train_loss:-2.2160797119140625 || epoch:72 || 
step2219: num_masked_molecule:706.0 || num_masked_partner:997.0 || molecule_loss:-1.1332112550735474 || partner_loss:-1.0625876188278198 || train_loss:-2.195798635482788 || epoch:73 || 
step2249: num_masked_molecule:706.0 || num_masked_partner:1010.433349609375 || molecule_loss:-1.1446971893310547 || partner_loss:-1.0848357677459717 || train_loss:-2.2295329570770264 || epoch:74 || 
step2279: num_masked_molecule:706.0 || num_masked_partner:1121.63330078125 || molecule_loss:-1.1328314542770386 || partner_loss:-1.076156735420227 || train_loss:-2.2089881896972656 || epoch:75 || 
step2309: num_masked_molecule:706.0 || num_masked_partner:1091.5999755859375 || molecule_loss:-1.152953863143921 || partner_loss:-1.0956823825836182 || train_loss:-2.24863600730896 || epoch:76 || 
step2339: num_masked_molecule:706.0 || num_masked_partner:1058.36669921875 || molecule_loss:-1.15134596824646 || partner_loss:-1.0903767347335815 || train_loss:-2.241722583770752 || epoch:77 || 
step2369: num_masked_molecule:706.0 || num_masked_partner:993.8333129882812 || molecule_loss:-1.1373097896575928 || partner_loss:-1.0991852283477783 || train_loss:-2.23649525642395 || epoch:78 || 
step2399: num_masked_molecule:706.0 || num_masked_partner:1059.6666259765625 || molecule_loss:-1.1557230949401855 || partner_loss:-1.101866602897644 || train_loss:-2.25758957862854 || epoch:79 || 
step2429: num_masked_molecule:706.0 || num_masked_partner:1103.4666748046875 || molecule_loss:-1.1524230241775513 || partner_loss:-1.0957988500595093 || train_loss:-2.2482216358184814 || epoch:80 || 
step2459: num_masked_molecule:706.0 || num_masked_partner:1028.566650390625 || molecule_loss:-1.16344153881073 || partner_loss:-1.1027735471725464 || train_loss:-2.2662158012390137 || epoch:81 || 
step2489: num_masked_molecule:706.0 || num_masked_partner:976.2999877929688 || molecule_loss:-1.1658706665039062 || partner_loss:-1.1084386110305786 || train_loss:-2.2743096351623535 || epoch:82 || 
step2519: num_masked_molecule:706.0 || num_masked_partner:1003.7333374023438 || molecule_loss:-1.1428661346435547 || partner_loss:-1.0996185541152954 || train_loss:-2.2424845695495605 || epoch:83 || 
step2549: num_masked_molecule:706.0 || num_masked_partner:1014.3333129882812 || molecule_loss:-1.1486085653305054 || partner_loss:-1.1001842021942139 || train_loss:-2.248792886734009 || epoch:84 || 
step2579: num_masked_molecule:706.0 || num_masked_partner:958.5 || molecule_loss:-1.1815340518951416 || partner_loss:-1.1107405424118042 || train_loss:-2.2922744750976562 || epoch:85 || 
step2609: num_masked_molecule:706.0 || num_masked_partner:1176.5999755859375 || molecule_loss:-1.1537457704544067 || partner_loss:-1.0964839458465576 || train_loss:-2.250230073928833 || epoch:86 || 
step2639: num_masked_molecule:706.0 || num_masked_partner:995.0999755859375 || molecule_loss:-1.157749056816101 || partner_loss:-1.097983717918396 || train_loss:-2.255732536315918 || epoch:87 || 
step2669: num_masked_molecule:706.0 || num_masked_partner:1127.9000244140625 || molecule_loss:-1.1710950136184692 || partner_loss:-1.105139136314392 || train_loss:-2.2762341499328613 || epoch:88 || 
step2699: num_masked_molecule:706.0 || num_masked_partner:1144.13330078125 || molecule_loss:-1.1701854467391968 || partner_loss:-1.1100586652755737 || train_loss:-2.2802443504333496 || epoch:89 || 
step2729: num_masked_molecule:706.0 || num_masked_partner:1079.2332763671875 || molecule_loss:-1.1851451396942139 || partner_loss:-1.1264252662658691 || train_loss:-2.311570405960083 || epoch:90 || 
step2759: num_masked_molecule:706.0 || num_masked_partner:1087.199951171875 || molecule_loss:-1.1831681728363037 || partner_loss:-1.1189769506454468 || train_loss:-2.30214524269104 || epoch:91 || 
step2789: num_masked_molecule:706.0 || num_masked_partner:1091.36669921875 || molecule_loss:-1.1657309532165527 || partner_loss:-1.1075778007507324 || train_loss:-2.273308515548706 || epoch:92 || 
step2819: num_masked_molecule:706.0 || num_masked_partner:1115.199951171875 || molecule_loss:-1.1752351522445679 || partner_loss:-1.1235820055007935 || train_loss:-2.2988173961639404 || epoch:93 || 
step2849: num_masked_molecule:706.0 || num_masked_partner:1043.5999755859375 || molecule_loss:-1.1790860891342163 || partner_loss:-1.11478853225708 || train_loss:-2.2938742637634277 || epoch:94 || 
step2879: num_masked_molecule:706.0 || num_masked_partner:1050.2667236328125 || molecule_loss:-1.1862127780914307 || partner_loss:-1.114065408706665 || train_loss:-2.300278425216675 || epoch:95 || 
step2909: num_masked_molecule:706.0 || num_masked_partner:980.2999877929688 || molecule_loss:-1.1831430196762085 || partner_loss:-1.118802785873413 || train_loss:-2.301945686340332 || epoch:96 || 
step2939: num_masked_molecule:706.0 || num_masked_partner:1032.7332763671875 || molecule_loss:-1.1847190856933594 || partner_loss:-1.1261345148086548 || train_loss:-2.3108534812927246 || epoch:97 || 

Just using pandas we can get the mean average error of the imputed values

df = ds.values['peptide'].df
for imp in imputation_methods + ['gnn_hetero']:
    print(imp, (df[imp] - df['abundance_gt']).abs().mean())
minprob 1.161707908701399
mindet 1.1617361347381416
bpca 0.1944661251166557
missforest 0.2203599013751484
knn 0.2181944885286117
isvd 0.22642413639656336
dae 0.27632932655821524
gnn_hetero 0.1984486647359422

Evaluating every dataset sample individually gives a better idea of the variance of imputation results. For this the compare_columns(...) can be used, returning the evaluation results according to an evaluation metrics. Here we use the RMSE. Those can then be plotted using seaborn. We evaluate both, protein and peptide imputation. In addition, only evaluating molecules with a missingness <= 80% allows for a more fine grained evaluation.

from matplotlib import pyplot as plt
import seaborn as sns

from pyproteonet.metrics import compare_columns


for molecule in ['protein', 'peptide',]:
    metric_df = compare_columns(
        dataset=ds,
        molecule=molecule,
        columns=imputation_methods + ['gnn_hetero'],
        comparison_column='abundance_gt',
        metric='RMSE',
        per_sample=True,
        ignore_missing=False,
        logarithmize=False,
        replace_nan_metric_with=0,
    )
    metric_df.rename(columns={"metric": 'RMSE'}, inplace=True)
    fig, ax = plt.subplots(figsize=(10, 5))
    ax = sns.boxplot(
        data=metric_df,
        x="column",
        y='RMSE',
        ax=ax
    )
    _ = ax.set_xticklabels(ax.get_xticklabels(),rotation=30, ha='right')
    ax.set_title(f'RMSE of {molecule} imputation methods')
../_images/a7e89fdd0ee54baafeb2ac46cae4ac5942e03abdf2c80e4b8cdb8d6b706f6fa3.png ../_images/72b77544939170d38b3445421addfb227455562275047d79d8a880202bf8890e.png