Prediction of cervical cancer recurrence using textural features extracted from 18F-FDG PET images acquired with different scanners

Objectives To identify an imaging signature predicting local recurrence for locally advanced cervical cancer (LACC) treated by chemoradiation and brachytherapy from baseline 18F-FDG PET images, and to evaluate the possibility of gathering images from two different PET scanners in a radiomic study. Methods 118 patients were included retrospectively. Two groups (G1, G2) were defined according to the PET scanner used for image acquisition. Eleven radiomic features were extracted from delineated cervical tumors to evaluate: (i) the predictive value of features for local recurrence of LACC, (ii) their reproducibility as a function of the scanner within a hepatic reference volume, (iii) the impact of voxel size on feature values. Results Eight features were statistically significant predictors of local recurrence in G1 (p < 0.05). The multivariate signature trained in G2 was validated in G1 (AUC=0.76, p<0.001) and identified local recurrence more accurately than SUVmax (p=0.022). Four features were significantly different between G1 and G2 in the liver. Spatial resampling was not sufficient to explain the stratification effect. Conclusion This study showed that radiomic features could predict local recurrence of LACC better than SUVmax. Further investigation is needed before applying a model designed using data from one PET scanner to another.


INTRODUCTION
Cervical cancer is a significant cause of morbidity and mortality, being the fourth most common cancer in women worldwide and the sixth in Europe [1]. The fiveyear survival rate strongly depends on the stage, based on the International Federation of Gynecology and Obstetrics (FIGO) classification, at the time of diagnosis: from 93% for stage IA to 15% for stage IVb [2]. Castelnau-Marchand et al. studied clinical outcomes of chemoradiation followed by image guided adaptive brachytherapy for 225 patients with cervical cancers (65% with a FIGO stage ≥ IIB) and observed a local recurrence rate of 13% [3].
Several factors have been associated with the probability of local control, such as the tumor size at diagnosis, the volume of the high-risk clinical target volume at time of brachytherapy, or the overall treatment time. However, in the era of image guided adaptive brachytherapy, it remains necessary to refine the prediction of outcome, and more particularly to more thoroughly identify patients who are at high risk of local recurrence and who would require intensification of treatment, such as dose escalation or who would be candidates for clinical trials of radiosensitizing agents. On the opposite, it might be clinically relevant to identify patients with a lower risk of local relapse, and therefore who are less likely to benefit from dose escalation.
Medical imaging plays a key role in the initial evaluation and staging of patients, guiding subsequent treatment decisions. Magnetic Resonance Imaging (MRI) is the reference standard for the pre-therapeutic assessment of the T-stage of gynecological tumors due to the fact that the technique allows high resolution, high softtissue contrast and functional imaging [4]. 18 F-FDG PET (18-Fluorodeoxyglucose Positron Emission Tomography) has proved to be more accurate for the detection of metastatic lymph nodes [5] and allows the evaluation of glucose consumption and metabolic activity within the tumor, which provides important prognostic information in patients treated with chemoradiation [6]. Literature showed accuracies of 76 to 90% for staging of Locally Advanced Cervical Cancer (LACC) using MRI [7], compared to 60 to 69% for CT (Computed Tomography) [8]. However, even if conventional semi-quantitative PET indices such as Standardized Uptake Value (SUV), Metabolic Volume (MV) or Total Lesion Glycolysis (TLG) have proved useful as prognostic factors in LACC [6,9], more information extracted from images holds promise to further increase the prognostic value of PET.
"Radiomics" is a non-invasive method of quantitative analysis of high throughput imaging traits that was first introduced to decode genomic activity of tumors [10,11] and then applied to all pathologies and imaging modalities [12,13]. Radiomics has great potential to influence patient care [14], from aiding diagnosis and classification of tumors by stage or histology [15][16][17], through the prediction of responses to radiotherapy [18] or to chemotherapy [19]. This technique can also be used to guide radiation therapy [20,21]. Among the features used in this approach, textural features are of particular interest to characterize tumor heterogeneity.
Several studies previously carried out PET texture analysis to predict recurrence and response to chemoradiation in LACC. In 2009, El Naqa et al. published a study in 14 patients [22], showing that textural features calculated from the co-occurrence matrix had a higher predictive power than SUV for determining chemo-radiation failure risk, and that a linear combination of two features led to an Area Under the ROC (Receiver Operating Characteristic) Curve (AUC) of 0.76 for the prediction of response to chemotherapy. The change of textural features during chemoradiation was investigated by Yang et al. [23], showing that several features calculated from the gray-level run length and zone size matrices were significantly different in the baseline and post-treatment images for complete metabolic responders (CMR), with p < 0.001. However, neither SUV indices nor textural features could differentiate CMR from partial metabolic responders or progressive tumors, possibly because of the low number of patients (n = 20). In a third study including 90 patients treated with chemoradiation, the same group demonstrated the better performance of four textural features to distinguish CMR vs. non-CMR (p < 0.05) compared to the performance of SUV [24].
The aim of our study was to identify a radiomic signature measured on baseline 18 F-FDG PET images predictive of local recurrence after chemo-radiation and brachytherapy in LACC. A second objective was to assess the robustness of selected non-redundant textural features with respect to the PET scanner used in acquiring the images.

Textural features for predicting local recurrence
Mean tumor volumes were not significantly different between groups (p = 0.2), with 39.9 ± 26.1 mL for G1 and 39.0 ± 38.9 mL for G2. No significant differences were observed in the distribution of histology and stage between the groups (Table 1). A total of 39 cases of local recurrence (G1: 28, G2: 11) were clinically identified during the median follow-up time of 3 years.
Eight features identified patients who later showed local tumor recurrence in G1 ( Table 2): SUV mean , SUV max , SUV peak , MV, LGZE and HGZE (p < 0.05) and TLG, Entropy (p < 0.01). None of the computed features were significantly different between relapsing patients and non-relapsing patients in G2, but Entropy had the lowest p-value (p = 0.052) and the highest AUC (0.70). No feature could distinguish relapsing from non-relapsing patients in an artificially reduced cohort from G1 (to include the same number of patients as G2).  Figure 1). P-value from the Delong's test was statistically significant between the AUC of SUV max and the AUC of the signature trained on G2, applied on both groups (p = 0.022 for G1, p = 0.030 for G2).

Influence of PET characteristics and voxel size on texture feature values
Three patients were excluded from the whole cohort for liver analysis because no whole-body PET-CT image showing the liver was available. From original images, two conventional features (SUV max , SUV peak ) and two textural features (Homogeneity, Entropy) calculated in the liver were significantly different between the two PET scanners according to Wilcoxon's test (Table 3, Figure 2).
The ability of image spatial resampling to remove the device-effect was also investigated on liver data. Neither resampling on G1 grid, G2 grid or on an isotropic grid of 2 mm side voxels could eliminate stratification effect. In all cases, at least five features were significantly different between the two cohorts (p < 0.001) ( Table 3).

Prediction of recurrence
This study, performed in two cohorts of patients scanned with PET scanners having different properties, showed that several textural features were predictive of recurrence in a cohort scanned with the same PET machine (G1), and that a multivariate signature trained in G2 identified local recurrence with AUCs of 0.86 in the training set and 0.76 in the validation set. This signature was better than SUV max when applied to both training and validation sets (p < 0.05). Although several studies discovered associations between tumor heterogeneity as reflected by PET texture indices and treatment outcomes, special attention should be paid to the methodology [25]. As shown in a study [26] comparing tumor delineation using the Nestle adaptive method [27] and a fixed threshold set to 40% of SUV max , different contours lead to differences in some textural feature values. As there is no consensus as to which segmentation method should be used, these authors studied the sensitivity of features to the segmentation and identified a list of robust features for three types of cancer. Further, the correlation between indices is of foremost importance. Starting from 41 indices calculated in three cohorts, they identified groups of highly correlated features (r > 0.80), some of which being highly correlated with MV [26]. In our study, only features selected for their robustness with respect to the segmentation and resampling methods were used for the processing.
Volumes of VOI-T were highly variable (39.8 ± 30.8 mL). As explained by Orlhac et al., the absolute resampling of SUV values avoids the volume dependence of textural features as opposed to relative resampling [28]. Another advantage of absolute resampling is the better distinction between tissues, and an intuitive variation in textural feature values [29]. Another team highlighted the impact of SUV discretization in such radiomic studies [30]. The authors showed that a fixed bin size (i.e. absolute resampling between two fixed boundaries) allows to obtain more comparable feature values between patients, although many teams used relative resampling with adaptive boundaries depending on minimum and maximum values in each tumor [23,31]. In our study, eight features were identified as predictive of tumor local recurrence in G1 but not any in G2, possibly because of the low number of patients as suggested by results of univariate analysis in a subset of G1 including the same number of patients as in G2. Still, the highest AUC and the lowest p-value in G2 occurred for Entropy. AUC corresponding to Entropy were slightly higher than those found by Yang et al [24] in both groups (0.70 vs. 0.66), but the difference was not significant. In a previous study, Entropy calculated on PET images was significantly different between regions presenting different patterns of cells identified on pathological slices, suggesting that PET texture analysis captures the cellular heterogeneity of tumor and might provide additional information on tumor aggressiveness [32]. Unlike [24], our results showed a statistically significant correlation between SUV values and response to treatment, as it was previously reported [6,33]. However, patient demographics were different between our study and [24], with their study having a higher number of advanced stages. Also, PET acquisition parameters and gray-level value resampling were different.
Regarding the identification of a signature for predicting recurrence, we chose to limit the number  of parameters in the model by using AIC, a userindependent criterion for automatic model selection.
A model containing too few parameters results in a low variance at the expense of high statistical bias for the fit parameters, whereas a model containing too many parameters may overfit the data, resulting in a low bias at the expense of high variance. We proved that a combination of a few radiomic features selected by an algorithm avoiding overlearning is predictive of treatment outcome. To our knowledge, this is the first study evaluating a combination of textural and SUVbased features computed on PET images to predict local recurrence of LACC with a validation set of images acquired on a separate device. Our results showed a difference between groups (Figure 1): the highest AUC was achieved using Entropy, SUV mean , SUV max and SRE in G1 (AUC = 0.77), and using SUV peak , Homogeneity, LGZE and HGZE in G2 (AUC = 0.86). Highest AUCs were reached using G2 as training set and G1 as validation set (AUC = 0.86 in G2, AUC = 0.76 in G1), and the G2 signature predicted tumor recurrence significantly better than SUV max in both sets. These results suggest that devices characterized by higher resolution and sensitivity are more sensitive to tissue heterogeneity and lead to more precise multivariate signature of recurrence. Other parameters such as injected activity, time per bed position, field of view could have likewise influenced the radiomic signature in favor of G2 PET device. An independent validation group will be necessary to strengthen the validity of this signature, but the modernization of PET scanners and the standardization of acquisition protocols through accreditations such as the EARL (European Association of nuclear medicine Research Ltd) FDG-PET/CT proposed by the European Association of Nuclear Medicine (EANM) are encouraging for performing multi-center PET radiomic studies such as suggested by Lasnon et al. [34].

Device-dependence of radiomic features
In our study, we proposed a method to evaluate the variability of radiomic features between PET scanners, using patient data only. Both Entropy and SUV max showed high differences between G1 and G2 in VOI-L with p ≤ 0.001. A recent article focused on the robustness of 68 Ga-DOTANOC PET-based textural features when using various reconstruction settings on a single PET device to simulate a multicenter study [35]. Although the use of one PET device only may not reflect the full heterogeneity that may exist between multicenter scanners, the authors identified six parameters presenting less variability than SUV max as a function of the reconstruction settings: Homogeneity, Entropy, Dissimilarity, HGRE (High Graylevel Run Emphasis), HGZE and ZP (Zone Percentage). Among these features, Homogeneity was highly robust with respect to the number of iterations, post-filtering level and reconstruction algorithm compared to SUV.
The impact of reconstruction settings was also investigated in retrospective studies [36,37]. Yan et al. focused on the impact of Point Spread Function (PSF) modeling within the reconstruction, use of Time Of Flight (TOF) information, iteration number, grid size and Full Width at Half Maximum (FWHM) of the Gaussian filter on textural and first-order features [36]. Most features had a coefficient of variation (COV) lower than 20% across different reconstruction algorithms. A high COV was found for Homogeneity and SRE when varying grid size. This result is consistent with our study in VOI-L showing a high variability across the two devices. According to Yan et al., the grid size had the largest impact on feature values, whereas the FWHM and the iteration number had Figure 3: Radiomic feature extraction pipeline. www.impactjournals.com/oncotarget a lower impact. In our study, the analysis of VOI-L after resampling on G1 or G2 grid showed that stratification effect was not only due to voxel size. Same results were found using an isotropic grid of 2 mm × 2 mm × 2 mm voxel size. The technological differences between devices in terms of spatial resolution and sensitivity appear to be a limiting factor for applying a model derived using data from a PET scanner to data measured on a different PET scanner, or for pooling data from different scanners for model identification.
The device-dependence of radiomic features and their variability according to injected activity and acquisition parameters may explain the decrease in AUC between the training and validation sets. In the AIC stepwise algorithm, all 11 features were initially used to elaborate the multivariate signature, including those identified to be device-dependent according to the liver study (G1: SUV max , Entropy; G2: SUV peak , Homogeneity). In the literature, unlike CT studies including multicenter data [38], validation of PET radiomic signatures were mostly performed on subsets of patients from the initial cohort, acquired on the same device with the same acquisition parameters.
Therefore, it is necessary to investigate the reproducibility of radiomic features between devices. This comparison can be performed in a uniform 18 F-FDG-filled phantom or in the healthy liver of patients. This latter method, used in our study, is particularly useful for retrospective studies where phantom images are no longer available.
There are several limitations in this study, especially its retrospective design and single center nature, which are sources of biases. Moreover, a higher local recurrence rate was observed in this cohort compared to previous cohorts from our institution [3]. This difference might be a consequence of scheduling considerations in local PET acquisitions, as the priority for nuclear imaging in our institution depends on the evolution and aggressiveness of the disease.

Patient cohort
118 patients treated between 2005 and 2014 in our institution were retrospectively included in this study (Table 1). This project was reviewed and approved by the Institutional Review Board.
The inclusion criteria were as follows: (i) histologically-confirmed LACC, (ii) no surgery performed except for para-aortic lymph node dissection as surgical staging, (iii) no cervical conization performed before baseline PET-CT acquisition (due to the risk of inflammation), (iv) squamous carcinoma or adenocarcinoma histological subtypes, (v) minimum follow-up period of 15 months after external beam radiation therapy in patients without recurrence. Treatment consisted of concurrent chemo-radiation followed by brachytherapy. 3D-conformationnal external beam radiotherapy was delivered in 25 daily fractions of 1.8 Gy each to reach a total dose of 45 Gy to the pelvis +/-the para-aortic area depending on the results of the primary para-aortic surgical staging. This was followed for all patients by a pulse-dose rate imageguided adaptive uterovaginal brachytherapy boost, delivering 15 Gy to 95% of the intermediate risk clinical target volume [3,39]. Concomitant chemotherapy was systematically administered and the standard regimen was cisplatin 40 mg/m 2 weekly, five times during external radiotherapy delivery, with a sixth cycle administered during brachytherapy. After treatment, patients were evaluated at 6 weeks using a pelvic MRI and a clinical examination. In case of complete response, they were then followed every 3 months during 3 years, then every 6 months during the following 2 years, and yearly thereafter. Biopsies were performed in case of non-metastatic MRI-based relapse suspicion.
No PSF modeling was introduced in the reconstructions for both scanners. The voxel size was 5.3 mm × 5.3 mm × 3.4 mm (matrix size: 128 × 128, 4 min/bed position) for G1 and 2.7 mm × 2.7 mm × 3.4 mm (matrix size: 256 × 256, 2 min/bed position) for G2. PET images were converted in SUV units by normalization using the patient body weight.

Radiomic pipeline
The entire radiomic feature extraction was performed using the LIFEx software (Local Image Feature Extraction, www.lifexsoft.org) [41]. The main steps of the radiomic pipeline are summarized in Figure 3.
The primary tumor was delineated on the PET images by a single observer (physicist, 3 years of experience) using a 40%-threshold of SUV max (maximum SUV in the lesion) within a manually drawn volume (LIFEx software). The resulting volume from this semiautomatic segmentation was thereafter termed VOI-T (volume of interest-tumor). Special attention was paid to tumors located near the bladder wall due to the intense urinary uptake. VOI-T was systematically reviewed by a nuclear medicine physician (5 years of experience) and sometimes manually adjusted to exclude any biases due to bladder proximity and resulting partial volume effect.
For each VOI-T, five 1 st -order features were extracted: SUV mean (mean SUV in the VOI), SUV max , SUV peak (mean SUV in a 1 mL sphere within the VOI such that the mean SUV in that sphere was maximum), MV, and TLG (product of SUV mean and MV).
SUV values in VOI-T were then resampled in 128 discrete values using an absolute method in order to avoid the correlation between textural features and MV, reduce the impact of noise and the size of matrices. The minimum and maximum bounds of the resampling interval were set to 0 and 40 SUV leading to a bin size of 0.3 SUV (Equation 3). The higher bound was chosen to include all tumor SUV values [28].
Three gray-level matrices were calculated in each VOI-T: the Gray-Level Co-occurrence Matrix (GLCM), the Gray-Level Run Length Matrix (GLRLM) and the Gray-Level Zone Length Matrix (GLZLM). Two methods were described in the literature to compute gray-level matrices in 3-dimensions [42]. In this study, GLCM and GLRLM were computed in 13 directions first to consider all independent directions between one voxel and its 26 neighbors [30]. Each textural feature extracted from these matrices corresponds finally to the average value over the 13 directions. Six textural indices (Homogeneity, Entropy from GLCM; Short-Run Emphasis (SRE), Long-Run Emphasis (LRE) from GLRLM; Low Gray-level Zone Emphasis (LGZE), High Gray-level Zone Emphasis (HGZE) from GLZLM) were analyzed as proposed by Orlhac et al [26].

Statistical analysis
All statistical analyses were performed using R software version 3.3.2.
First, a univariate analysis was performed to assess the ability of each individual feature for predicting local recurrence in each group separately. P-values of Wilcoxon's tests were computed between non-relapsing and relapsing patient features calculated in VOI-T for G1 and G2 separately. ROC analyses including AUC calculations were also performed to evaluate the performance of each feature using pROC library. 95% Confidence Intervals (C.I.) were computed using 2000 stratified bootstrap replicates [43]. To evaluate the influence of the patient number in univariate analysis, 100 random subsets of 39 patients from G1 were drawn so that the number of patients was identical to that in G2, and the mean p-value of each index was computed from all drawings.
Second, a multivariate analysis was performed using the original datasets to evaluate the added value of a combination of features for predicting local recurrence, and to develop a signature applicable in both groups. A stepwise model selection using the Akaike Information Criterion (AIC, library MASS) was applied to determine the best 4-feature multivariate signature for both groups [44,45], successively used for training and validation of the model: first, G1 as training set and G2 as validation set, and secondly G2 as training set and G1 as validation set. The AIC is a measure of the relative quality of statistical models based on information theory. It allows comparison of the least-square fits of a given dataset obtained using several models of varying complexity. The model with the smallest AIC value is a compromise between goodness of fit on a given dataset and number of parameters.
Delong's test was performed between AUC of the 4-feature signature and AUC of SUV max only for both groups.

Influence of PET characteristics and voxel size on texture feature values
To evaluate the influence of the PET scanner on texture index values, a spherical volume of 75.5 mL for G1 and 75.7 mL for G2 was drawn in the liver (VOI-L). This region was supposed to be a homogeneous region of reference after systematic verification of the normal liver function [28]. Another parameter influencing textural feature values is the voxel size [29,36]. G1 images were resampled on G2 grid (2.7 mm × 2.7 mm × 3.4 mm) and G2 images on G1 grid (5.3 mm × 5.3 mm × 3.4 mm) using bicubic interpolation. G1 and G2 images were also resampled to a common grid with a voxel size of 2 mm × 2 mm × 2 mm. The same radiomic pipeline as for VOI-T ( Figure 3) was applied to VOI-L. Wilcoxon's tests were performed between G1 and G2 in VOI-L on native and on the three sets of resampled images to determine the extent to which technological differences can influence radiomic feature values and if spatial resampling is sufficient to remove the device dependence.

CONCLUSIONS
In this study, we defined a 4-feature signature predicting local recurrence in LACC in two cohorts, and we validated the signature derived from the images acquired on the most recent scanner (G2) on the G1 group with AUC > 0.75 using radiomic features only. For both PET scanners, we showed that this signature predicted tumor recurrence better than SUV max .
We also demonstrated that it is challenging to merge images from two different PET scanners with different acquisition parameters without introducing bias due to differences between acquisition protocols. Multi-center or multi-device studies must thus be performed with caution, ensuring that biases are taken into account in the analyses.