External validation of predictive models of sexual, urinary, bowel and hormonal function after surgery in prostate cancer subjects

Background In 2020, a research group published five linear longitudinal models, predict Expanded Prostate Cancer Index Composite-26 (EPIC-26) scores post-treatment for radical prostatectomy, external beam radiotherapy and active surveillance collectively in US patients with localized prostate cancer. Methods Our study externally validates the five prediction models for patient reported outcomes post-surgery for localised prostate cancer. The models’ calibration, fit, variance explained and discrimination (concordance-indices) were assessed. Two Australian validation cohorts 1 and 2 years post-prostatectomy were constructed, consisting of 669 and 439 subjects, respectively (750 in total). Patient reported function in five domains post-prostatectomy: sexual, bowel, hormonal, urinary incontinence and other urinary dysfunction (irritation/obstruction). Domain function was assessed using the EPIC-26 questionnaire. Results 1 year post-surgery, R2 was highest for the sexual domain (35%, SD = 0.02), lower for the bowel (21%, SD = 0.03) and hormone (15%, SD = 0.03) domains, and close to zero for urinary incontinence (1%, SD = 0.01) and irritation/obstruction (− 5%, SD = 0.04). Calibration slopes for these five models were 1.04 (SD = 0.04), 0.84 (SD = 0.06), 0.85 (SD = 0.06), 1.16 (SD = 0.13) and 0.45 (SD = 0.04), respectively. Calibration-in-the-large values were − 2.2 (SD = 0.6), 2.1 (SD = 0.01), 5.1 (SD = 0.1), 9.6 (SD = 0.9) and 4.0 (SD = 0.2), respectively. Concordance-indices were 0.73, 0.70, 0.70, 0.58 and 0.62, respectively (all had SD = 0.01). Mean absolute error and root mean square error were similar across the validation and development cohorts. The validation measures were largely similar at 2 years post-surgery. Conclusions The sexual, bowel and hormone domain models validated well and show promise for accurately predicting patient reported outcomes in a non-US surgical population. The urinary domain models validated poorly and may require recalibration or revision. Supplementary Information The online version contains supplementary material available at 10.1186/s12894-023-01373-9.


Introduction
Prostate cancer is the most common cancer globally in males with an estimated 16,741 new cases in Australia in 2020 [1,2].Prostatectomy is a standard of care treatment for localized prostate cancer [3].However, prostate cancer often progresses slowly without noticeable symptoms [2]; patients with localized prostate cancer commonly die from other causes [4].This complicates the rationale for treating prostate cancer.Treatment choices should incorporate both patient and clinical factors and avoid overtreatment of prostate cancer [4].Prostatectomy and its associated adverse events on quality of life, including erectile dysfunction and urinary incontinence, may outweigh the benefit of treatment [4].Hence patient reported outcomes measures (PROMs) may be equally as important as clinical outcomes such as survival.PROMs can provide a more holistic interpretation of treatment benefit and improve clinical decision making, patientdoctor communication, and patient outcomes [5][6][7].
Formal prediction models can assist clinicians in accurately predicting PROMs post-prostatectomy to determine the potential treatment benefit of prostatectomy [8].Laviana et al. recently developed five publicly available online models to predict PROMs after three potential treatment options in pre-treatment US patients with clinical localized stage T1-T2 prostate cancer [9,10].The treatment options were radical prostatectomy (RP), external beam radiation therapy (EBRT) and active surveillance (AS); they are analysed collectively.The models reported good calibration and model fit [9].External validation is important to determine if the model is useful for predicting outcomes in populations similar to, but different from, the development population and if they can be applied individually to each specific treatment [11].We were also concerned that one treatment (RP) is historically associated with higher rates of sexual dysfunction and incontinence, and another (AS) has potentially very little incidence [12].Given that the three therapies were analysed collectively and not individually, this study externally validates the online tool and determines its utility at 1-and 2-years post-surgery.

Development models
Laviana et al. developed five longitudinal linear regression models post-treatment for localized prostate cancer [9].Each model estimated one EPIC-26 domain score: sexual function, urinary incontinence, urinary irritation/obstruction, bowel function or hormonal function.Scores ranged from 0 to 100 where higher numbers indicate better outcomes [13].Laviana et al. excluded the last sexual domain question (utilizing five questions instead of six); we validate the 5-question sexual domain model in the main analysis and the 6-question model as a supplementary analysis.Predictor variables were time posttreatment, primary treatment choice (RP, AS, and EBRT), age at diagnosis, race, baseline domain score, overall health, prostate specific antigen (PSA) and Gleason score.Overall health was assessed with the SF-36 Health Survey question "In general, would you say your health is Poor, Fair, Good, Very good or Excellent [14]." The exclusion criteria were receiving androgen deprivation therapy, brachytherapy or cryoablation, age > 80, PSA ≥ 50 ng/dL and non-localized prostate cancer.Laviana et al. analysed all three primary treatments choices collectively in the same models.

Validation cohort
The South Australian Prostate Cancer Clinical Outcomes Collaborative (SA-PCCOC) database is a populationbased registry currently recruiting over 90% of newly diagnosed prostate cancer cases in the state of South Australia, including over 19,000 men [15].Established in 1998, the registry collects data from collaborating public and private institutions and clinicians [15,16].PROMs are collected before treatment commences and at 3, 6, 12, 24 and 60 months post-primary treatment.Participants were mailed paper copies of the Expanded Prostate Cancer Index Composite-26 (EPIC-26) questionnaires and a reply-paid envelope for response.
We sought to verify the Laviana et al. models specifically for RP and followed their inclusion criteria in accordance with the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) checklist (Supplementary Table 1) [17].De-identified data was obtained from SA-PCCOC for post-prostatectomy patients diagnosed between January 2007 and March 2020.The registry does not record participant race; hence we assumed all subjects were Caucasian, which likely matched both the majority of SA-PCCOC patients and 74% of subjects in Laviana's development cohort [9].Patients were included only if they completed both a pre-and post-surgery EPIC-26 questionnaire.Post-surgery assessments were required to be within 8-16 or 20-28 months post-prostatectomy, defining a 1-and a 2-year validation cohort, respectively.If a patient completed multiple assessments per time range, only the closest to the target time was included.The 1-and 2-year survey time points are those where SA-PCCOC has the largest quantity of complete data.
SA-PCCOC data are not publicly available but can be obtained after establishing a Data Use Agreement with SA-PCCOC, which has permission to authorize data accessibility from the Southern Australian Clinical Human Research Ethics Committee (SALHN HREC).
The data collected by SA-PCCOC has been approved by the SALHN HREC (protocol 307.14).

Statistical methods
Model predictive values were assessed for calibrationthe agreement between the observed endpoints and predictions (calibration-in-the-large (CL) and calibration slope (CS)) -predictive accuracy (mean absolute error (MAE) and root mean square error (RMSE)), proportion of variance explained (R 2 ) and discrimination -the effectiveness at predicting which scores would be better or worse (concordance-index, abbreviated as c-index) [11,18].CL was estimated as the intercept in a regression with the predicted EPIC-26 score as an offset, and CS was the estimated slope of the predicted score in a univariable regression.Theoretically perfect CL and CS values are 0 and 1, respectively.Because a large proportion of the EPIC-26 domain scores were close to either the upper (100) or lower (0) boundaries, the two calibration metrics were calculated using quantile (median) regression, with linear (mean) regression reported as a supplementary analysis.Predicted versus observed EPIC-26 scores are shown graphically locally weighted smoothing regressions superimposed over scatter plots.R 2 is calculated as one minus the ratio of residual and total variance i.e. [18].For predictive models, R 2 can be negative [18].MAE and RMSE are defined as MAE = .The c-index is equal to the probability that, for two randomly selected subjects after surgery, the subject with a higher predicted score has a higher observed score instead of a lower score i.e.P y i − y j y i − y j > 0| y i � = y j [19].CL and discrimi- nation were reported for the first time in these prediction models.The registry database had several patients with unreported pre-surgery PSA or general health assessment; hence multiple imputation using chained equations was performed (100 datasets generated with 100 iterations each) [20].Missing EPIC-26 domain scores were not imputed.For each validation statistic, means (standard deviations) and median (interquartile ranges) are reported across the 100 imputed datasets.Analyses were performed in R version 4.0.5 using the packages mice and quantreg [21][22][23].

Validation cohorts
Between January 2007 and March 2020, the registry recorded 1848 patients having prostatectomy for localized prostate cancer, 886 of which competed both a preand post-surgery EPIC-26 questionnaire.750 subjects satisfied the inclusion criteria, including 669 and 439 subjects at 1-and 2-years post-surgery, respectively (Fig. 1).Compared with the target paper's development cohort, our study's two validation cohorts had slightly higher, but generally similar, age and PSA values and Gleason scores (Table 1).52% of the development cohort had a Gleason score ≤ 6, but half of the 1-year and 2-year validation cohorts had a score of 3 + 4 (49 and 50%, respectively) and these two cohorts had at least twice as large a proportion of subjects with Gleason scores of 4 + 3 or ≥ 8 compared to the development cohort.
Table 2 summarizes the validation cohort EPIC-26 domain scores pre-and post-surgery.The mean presurgery sexual scores ranged from 61.2 to 63 and were considerably higher than their post-surgery equivalents (ranging from 29.1 to 37.1).The 1-year and 2-year presurgery average urinary incontinence scores of 93.3 and 93.1, respectively, were also higher than the post-surgery scores of 74.6 and 76.1, respectively.However, the 1-year and 2-year pre-surgery mean urinary irritation/obstruction scores of 86.7 and 85.8, respectively were slightly lower than the post-surgery scores of 92.4 and 93.3, respectively.The pre-surgery and post-surgery scores were similar across the bowel and hormone domains.The scores across the 1-year and 2-year cohorts were very similar; the largest difference was a post-surgery sexual

PROMs -sexual function
Of the five models, the sexual domain models exhibited the most accurate predictions (Table 3 and Fig. 2).At 1-and 2-years post prostatectomy, the 5-question sexual domain models had the highest variance explained (R 2 = 0.35, SD = 0.02, respectively) and discrimination (c-index = 0.73 and 0.72, SD = 0.01 and 0.01, respectively), small bias (CL = − 2.2 and − 2.3, SD = 0.6 and 0.7 respectively), and calibration slopes close to one (CS = 1.04 and 1.08, SD = 0.04 and 0.03 respectively).The magnitude of differences between predictions and observations were large (MAE = 19.6 and 20.4,SD = 0.4 and 0.4; RMSE = 24.4 and 25.5, SD = 0.4 and 0.4, respectively) compared to other models.However, this was because the observations were further away from the minimum (0) and maximum (100) score values.
All the 1-year and 2-year values were approximately similar in all domains.The model fit measures (RMSE and MAE) were similar or slightly smaller than the equivalent values in the development cohort, with the exception of the urinary incontinence domain which had slightly larger values.All models exhibited lower R 2 compared to those in the development cohort.Supplementary Table 2 presents the supplementary results.The 6-question sexual domain model statistics were all similar to the 5-question statistics.The calibration statistics when estimated using linear (mean) regression instead of quantile (median) regression were similar in every domain except for urinary incontinence.

Discussion
This study externally validates a recently published online tool designed to predict patient symptoms after treatment for localized prostate cancer.The analysis shows excellent predictive performance of the model for sexual function at both 1-and 2-year assessments post-prostatectomy, reasonable performance for bowel and hormonal function, and poor predictive value of urinary incontinence and irritation/obstruction.In particular, these models demonstrated good calibration and discrimination.The online tool would be beneficial for predicting sexual, bowel and hormonal outcomes following prostatectomy.However, the hormonal predictions are not germane to surgery; as seen in Table 2, there is no difference between the three time points.Thus, using the tool to predict post RP hormonal symptoms has little clinical application despite our validation.This can also be concluded for the bowel or rectal function; the descriptive statistics for the bowel domain displayed little change in following prostatectomy.In contrast, there was a large difference in sexual function before and after surgery.
The model performance was worse in the validation cohorts than the development cohort.Although this is to be expected with validation cohorts, the discrepancy could   be partially explained by a number of factors.Sexual function is particularly impeded by prostatectomy compared to other domains [12], as shown in the differences between the pre-and post-surgery scores.Our analysis only included surgery, whereas Laviana et al. also included radiotherapy and active surveillance collectively, treatment modalities associated with relatively smaller decreases in sexual scores post-treatment [9].Surgery has been associated with worse urinary incontinence but improved urinary irritation/ obstruction, which matches the validation pre-and postsurgery domain scores, compared to EBRT or AS [12,24].This may explain our observed poor correlation for urinary incontinence and irritation/obstruction given we examined RP alone, and the inclusion of AS and EBRT in the development cohort likely resulted in a substantial bias towards the null for surgical subjects.Although the original models considered three different treatments simultaneously, models stratified by treatment may be more specific and meaningful for the relevant therapy.Differences in RP surgical approaches between the two cohorts could also result in differences in PROM outcomes.For example, higher proportions of nerve-sparing surgeries may be associated with improved erectile function and urinary continence [4,25].However, the inclusion of AS and ERBT in the development cohort but not the validation cohort is likely to predispose to larger differences between the cohorts than the choice of RP surgical technique.Differences in model performance can also result from underlying population differences.Although the development and validation cohorts' summary statistics appear reasonably similar, the validation cohorts had on average higher Gleason scores and, to a small extent, age and PSA.These are all risk factors associated with worse outcomes [26].RP is more likely to be offered to higher risk patients compared to AS [27], and in particular, AS patients generally have Gleason scores ≤6 [28].Hence the validation cohorts, which only include subjects who had RP, were likely to have overall stronger risk factors compared to a cohort that includes subjects who elected for AS.Model performance can also vary because of differences in the US and South Australian healthcare systems and socioeconomic factors which have not be addressed in these analyses.Underlying population differences were likely given the lower R 2 values in the validation cohorts.However, the RMSEs and MAEs were similar between the cohorts.These two statistics are less volatile to population differences and indicate a similar model fit with both data sets despite their differences [29].This also suggests that the original models were unlikely to be overfitted to the development population.
Our study's strengths include a large validation cohort size collected across 13 years.The registry has data on most newly diagnosed prostate cancers in South Australia [15], reflecting a high population level recruitment inherent in disease-specific registries.By validating the development models in a surgery-only cohort, this study's results potentially better reflect the prevalence of PROMs in surgical patients, particularly for sexual SA-PCCOC does not record racial group, and both PSA and general health had substantial missingness, which are the most pronounced study limitations and may obscure underlying population differences between the cohorts.However multiple imputation was implemented to minimise the impact of missing data.As subjects with advanced prostate cancer were excluded, we believe it is unlikely that missing data was greatly impacted by the effect of prostate cancer on subjects' health.Thus assuming the data was missing at random is reasonable, minimizing potential bias from the multiple imputation.Furthermore, omitting racial group would likely only have a small effect.Racial group in the development models only created a difference in the predicted scores of up to 3.8 in 99% of subjects for the sexual and urinary incontinence domains (races of "White", "Black", "Hispanic", and "Asian"), and even smaller differences no greater than 2.4 for all subjects in the other three domains [9].These differences are all below the minimum clinically important difference (MCID) for the EPIC-26 instrument [30].However, our models may have underestimated urinary incontinence scores for the other 1% of subjects (the "Other" racial group), where their predicted score increase in this domain (8.5) exceeded the MCID.The visits assessed in this study differed to those from the original study, adding a 2-year visit but omitting visits occurring at 6 months, 3 years and 5 years.Although caution should be excised for the models' validity in non-US populations at these timepoints, the generally good performance of these models at 1 and 2 years give evidence to suggest that the validated models will perform well across the full 5-year period analysed in the development study.A study observed similar mean EPIC-26 domain scores at 2-and 3-years in a large cohort of US patients post-prostatectomy almost identical to that of the development cohort [24].

Conclusions
The examined models perform well for predicting sexual and, to a lesser extent, bowel and hormonal symptoms 1-and 2-years post-surgery in our prostate cancer patients.However, they perform poorly in predicting urinary incontinence and irritation/obstruction.The urinary domain models may benefit from recalibration or revision to better predict PROMs post-prostatectomy in non-US populations.The models for the other three domains are likely suitable for non-US populations and should be considered for implementation into clinical practice.

Fig. 2
Fig. 2 Predicted vs observed EPIC-26 scores Predicted vs observed EPIC-26 scores at 1 year (top row) and 2 years (bottom row) post-prostatectomy for the first imputed set.The blue solid line is a locally weighted smoothing fit with 95% confidence intervals (blue dashed lines), the black diagonal dashed line is the line of perfect prediction

Table 1
Summary statistics for patient age and disease characteristics in the development and two validation cohorts Demographics for the development cohort from Laviana et al. and the validation cohorts (1 year and 2 years post-surgery) from this study, from SA-PCCOC.Only subjects treated with prostatectomy alone are included

Table 2
Pre-and post-surgery mean EPIC-26 scores of the two validation cohorts

Table 3
Mean (standard deviation) of the validation statistics from 100 imputed datasets for 1-and 2-year post-prostatectomy score predictions Abbreviations: C-index Concordance index, CL Calibration-in-the-large, CS Calibration slope, MAE Mean absolute error, RMSE Root mean squared error, SD Standard deviation