pyproteonet.metrics.abundance_comparison.compare_columns

pyproteonet.metrics.abundance_comparison.compare_columns(dataset: Dataset | List[Dataset], molecule: str, columns: List[str], comparison_column: str, ids: Index | None = None, metric: Literal['PearsonR', 'SpearmanR', 'MSE', 'MAE', 'RMSE', 'AE'] = 'PearsonR', ignore_missing: bool = True, logarithmize: bool = True, per_sample: bool = False, replace_nan_metric_with: float | None = None) DataFrame
Compare a set of value columns to a reference column according to a set of metrics.

This can be used to compare imputation methods by evaluating imputation results against some form of ground truth. Evaluation can be done on a per-sample basis, i.e. the metric is computed for each sample separately and the results are returned as a dataframe with one row per sample.

Parameters:
  • dataset (Union[Dataset, List[Dataset]]) – Either a single dataset or a list of datasets to evaluate. If multiple datasets are given they should alll share the same molecule and the results are concatenated.

  • molecule (str) – The molecule type to evaluate (e.g. ‘protein’, ‘peptide’, …).

  • columns (List[str]) – The columns to compare against the reference column.

  • comparison_column (str) – The reference column to compare against.

  • ids (Optional[pd.Index], optional) – If given, only compare values for molecules with those ids. Defaults to None.

  • metric (Literal['PearsonR', 'SpearmanR', 'MSE', 'MAE', 'RMSE', 'AE'], optional) – The evaluation metric. Defaults to ‘PearsonR’.

  • ignore_missing (bool, optional) – Whether to ignore missing values or raise an error. Defaults to True.

  • logarithmize (bool, optional) – Whether to logarithmize values before comparison. Defaults to True.

  • per_sample (bool, optional) – Whether to compute one metric per dataset sample. Defaults to False.

  • replace_nan_metric_with (Optional[float], optional) – Use a constant value in case no metric can be calculated (e.g. for PearsonR if all values are constant). Defaults to None.

Returns:

A dataframe with one row per sample if per_sample is True, otherwise one row per value column containing the evaluation results.

Return type:

pd.DataFrame