Gene expression information improves reliability of receptor status in breast cancer patients

Immunohistochemical (IHC) determination of receptor status in breast cancer patients is frequently inaccurate. Since it directs the choice of systemic therapy, it is essential to increase its reliability. We increase the validity of IHC receptor expression by additionally considering gene expression (GE) measurements. Crisp therapeutic decisions are based on IHC estimates, even if they are borderline reliable. We further improve decision quality by a responsibility function, defining a critical domain for gene expression. Refined normalization is devised to file any newly diagnosed patient into existing data bases. Our approach renders receptor estimates more reliable by identifying patients with questionable receptor status. The approach is also more efficient since the rate of conclusive samples is increased. We have curated and evaluated gene expression data, together with clinical information, from 2880 breast cancer patients. Combining IHC with gene expression information yields a method more reliable and also more efficient as compared to common practice up to now. Several types of possibly suboptimal treatment allocations, based on IHC receptor status alone, are enumerated. A ‘therapy allocation check’ identifies patients possibly miss-classified. Estrogen: false negative 8%, false positive 6%. Progesterone: false negative 14%, false positive 11%. HER2: false negative 2%, false positive 50%. Possible implications are discussed. We propose an ‘expression look-up-plot’, allowing for a significant potential to improve the quality of precision medicine. Methods are developed and exemplified here for breast cancer patients, but they may readily be transferred to diagnostic data relevant for therapeutic decisions in other fields of oncology.


INTRODUCTION
The selection of an optimum breast cancer therapy has to include the expression of estrogen receptors (ER), progesterone (PGR) and human epidermal growth factor 2 (HER2) receptor proteins in an individual patient. The reliable assessment of these 3 receptor status is hence mandatory for optimum individualized therapy.
Although immunohistochemistry (IHC) is considered the gold standard for status determination, doubts have been raised [1-3] by reported differences between readers repeatedly examining the very same data.

Research Paper
While almost complete concordance was found for ER -status, up to 20% of patients assigned ER + -status may be erroneously classified, according to reports in the literature [4,5]. Of note, we also found a considerable number of ERthat might be misclassified.
First, correct assignment of ER-status is most important for the individualized choice of treatment [6]. As gene expression measurements can be achieved by different techniques, we investigate if the assessment of receptor status could be improved by additionally exploiting gene expression data.
Second, gene expression signatures for the prediction of individual therapeutic outcome have been established [7][8][9][10][11] and biomarker discovery methods developed [12][13][14], as recently reviewed [4,15]. Each of these signature-algorithms includes receptor status as decisive variables upon which calculated prognosis crucially depends. Correct prognostic algorithms can thus be developed only on the basis of reliable receptor status [16].
Hence, it is of paramount value for both, patient treatment and research, to increase the reliability or even impute receptor status by the use of additional information, e.g., gene expression measurements [17][18][19][20]. In the present work we restrict ourselves to a single measurement platform for gene expression (Affymetrix U133A+2.0), in order to avoid inter-platform batch effects.
The present work deals with three receptors, ER, PGR and HER2.

RESULTS
For the present work we used and curated the datasets listed in Table 1.

Receptor status obtained from gene expression by conventional methods
Given the IHC-estimate of a receptor-status, e.g. ER IHC − or ER IHC + , two distributions of expression values ( x i , i = 1,…, N sample , number of patients/samples) of the corresponding gene (e.g. ESR1. see Table 2) were estimated, see Figure 1. These plots show results for the estrogen receptor and also illustrate the computational procedure, see section 'Methods and models for expression of receptor genes'.

Obtaining receptor status ROC curves
As a prerequisite, we assumed IHC-estimates (e.g.: ER IHC + , ER IHC − ) to be 'true' and computed a ROC-curve (blue curve in Figure 2) by classifying a patient as receptor positive/negative if the expression value, x i , of gene ESR1 was above/below a running threshold, x. The area under the curve resulted as AUC = 0.94.

Cut-points resulting from models of receptor gene expression
In order to evaluate gene expression we assumed expression values, x i , to be approximately normally distributed (N ( , ) µ σ + + for ER IHC + and N ( , ) µ σ − − for ER IHC − ), see Figure 1, lower panel. Adding both normal distributions with proper weights (λ λ + − + =1) yielded a bimodal distribution, see the stacked bars in Figure  1. Likewise, we obtained bimodal distributions for progesterone (gene PGR) and human epidermal growth factor receptor 2 (HER2, gene ERBB2), see Figure 3.
To obtain thresholds discriminating receptor positive versus negative we evaluated 5 different methods: MaxLike, ParEst, LogReg, Youden and ExMax, for maths see the 'Methods and models for expression of receptor genes' in 'Material and methods'. We decided to finally adopt method ExMax, see section 'Why to adopt ExMax' in 'Material and methods', and obtained the cut-points given in Table 3.
Similar results were obtained for PGR and HER2, see the bottom rows in Figure 3. Computationally, we proceeded along the very same lines as for ER. Note that the majority of patients was HER2-. As a consequence, thresholds shifted towards larger expression values (as compared to ER) and specificity of discrimination resulted fairly large.
Although HER2+ has much lower prevalence then HER2-, the ExMax-estimate performs outstandingly well, as can be seen from Figure 3.

Agreement between receptor status obtained by IHC versus gene expression cut-points
Using the cut-points (CP) shown in Table 3, we  obtained receptor status from gene expression as compared  to IHC-estimates, see Table 4.
Association between IHC and gene expression was quantified by Kruscal's gamma, Cohen's Kappa (cf. Table 4) and also by the diagnostic odds ratio, DOR, and its 95% confidence intervals [21].
Disagreement between IHC and GE, as quantified by the rate of discordant samples, is highly problematic in the clinical setting, and a clinical decision should not be made on the basis of IHC data only.
PGR showed less agreement than ER due to its lower expression values, entailing a broad distribution of PGR expression values for PGR IHC + . Regarding HER2, the precision of determination suffered from an imbalance in the data: Much more patients are HER2 negative than positive. Note that missing IHC-estimates, as considered in the next chapter, were not included here.
In order to improve this situation we enhanced the decision process by furnishing GE with a responsibility function and a critical domain.

Improved decision quality due to responsibility function and critical domain
Instead of using a single cut-point for each receptor we introduced a 'critical domain' X X lower upper ,     for a probability level α = 0.05, see Figure 4 and, for mathematical details, the chapter 'Material and methods'. For expression values x X < lower we decided for receptor negative, for x X upper > receptor positive: In both cases we classified the expression value as 'informative'. As opposed, expression values within the critical domain X lower ≤ x ≤ X upper were classified as 'undecidable', and we refrained from a decision. Critical domains thus constructed are given in Table 5 and their practical application is described in section, Expression lookup plot' in 'Material and methods'.    The information conveyed by gene expression was thus refined, yielding the results for GE 0, shown in Table  6 and the yellow sub-bars in Figure 5.
A critical domain regarding GE indicates whether or not to trust in GE expression alone. This is especially useful in cases of lacking IHC estimates. For ER in particular, 81 cases revealed as untrustworthy, since missing IHC was accompanied by GE within the critical domain.
Introducing a critical domain is in particular useful in the case of PGR, since GE estimates of PGR offer only limited power of discrimination (leaving as many as 782 samples in column GE 0). But also for ER, 338 samples were revealed untrustworthy (column GE 0).
HER2 estimates revealed a special problem: Many HER2 IHC + samples had low (i.e. contradicting) GE, suggesting they might be false positives from IHC measurement, see the red part of the rightmost bar in left panel of Figure 5.

Therapy allocation check and consequences
In order to scrutinize if the additional information conveyed by GE might change therapeutic decisions based on IHC only, we compared four groups: Figure 6, blue curves, left column, 329 Pat.): Receptor positive patients according to IHC, had received hormone treatment. Since GE confirmed IHC, the choice of systemic therapy deems correct in these cases.
(see Figure 6, blue curves, right column, 137 Pat.): Receptor negative patients according to IHC, had not received hormone treatment, accordingly. Since GE confirmed IHC, the choice of systemic therapy deems correct in these cases.
3. ER PGR ER PGR Figure 6, violet, upper left, lower right, 18 Pat.): Patients deemed receptor positive according to IHC, had received hormone treatment, accordingly. Since GE contradicted IHC, the choice of systemic therapy might have been incorrect in two possible ways: Those patients who had received neoadjuvant hormone therapy (in addition to adjuvant chemotherapy) would not have benefitted from it (negligible harm). Those patients receiving only hormone therapy without chemotherapy would have been given an ineffective therapy while at the same time being deprived of the adequate, live-saving therapy. Figure   6, violet, upper right, lower left, 9 pat.): Patients deemed receptor negative according to IHC, not having received hormone treatment, accordingly. Since GE contradicted IHC, the choice of systemic therapy might have been suboptimal in these cases.  For each group, we calculated the Kaplan Meier product limit estimator for time free from relapse (see Figure 6) and obtained 95% confidence limits via the Greenwood formula [22].
In group 1 (blue, left column), a patient's positive hormone-receptor status according to IHC was not contradicted by GE. Accordingly, these patients are likely to have been allocated the 'correct' i.e. anti-hormone treatment and have actually benefitted therefrom.
Patients in group 2 (blue, right column) had been assessed hormone receptor negative via IHC, and this was confirmed by GE: They are likely to have received optimum systemic therapy.
In contrast, patients in group 3 (violet, upper left and lower right) had been assessed hormone receptor positive .177 n/a n/a n/a n/a n/a n/a n/a Youden 8.832 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a Youden 3.929 n/a n/a n/a n/a n/a n/a n/a .101 n/a n/a n/a n/a n/a n/a n/a Youden 10.633 n/a n/a n/a n/a n/a n/a n/a ExMax 12.304 9.492 13.209 1.251 0.538 n/a n/a 0.124 Cut-points (CP) as obtained by 5 different methods (MaxLike, ParEst, LogReg, Youden and ExMax), cf. the colored dots in Figure 2 and Figure 3. Cut-points for gene expression were those derived by the ExMax-algorithm, see Table 3. Four measures of agreement are given: log(DOR) (natural logarithm of the Diagnostic Odds Ratio, DOR), Cohen's Kappa (inter-rater agreement), Kruscal's Gamma (strength of association) and the rate of discordant samples (conjugate to the rate of conclusive samples). www.impactjournals.com/oncotarget via IHC, however, this was contradicted by GE: they might have received a hormone therapy but may not have benefitted from it (since they are likely to lack receptors, according to GE). It might even be the case that adjuvant chemotherapy was spared although being mandatory, relying on just hormone therapy (which did not work in these patients).
Patients in group 4 (violet, upper right and lower left) had been assessed hormone receptor negative via IHC, however, this was contradicted by GE: They might have been deprived of a hormone therapy they could have benefitted from.
The above grouping is weakly formulated by intention, in the sense that IHC estimates (used as basis for therapeutic decision) were questioned if and only if GE actually contradicted, not if GE was just within its critical domain ([GE+, GE0] for receptor positive, [GE-, GE0] for receptor negative). Tightening the definition of error-prone groups (not tolerating '0' anymore), might even aggravate the contrasts, concomitantly reducing the number of patients in each group, however. We decided to stick to the broader (and weaker) definition so as not to withhold possibly decisive warnings to patients with borderline receptor status.

Achievements due to responsibility function and critical domain
For the majority of patients (89%), the IHCassessment of estrogen receptors (ER IHC ) and gene expression (ER GE , gene ESR1) data yielded compatible results and we may consider such an estimate true. This percentage was similar to the portion of IHC-estimates in the literature suspected to be true [23]. In this work we paved the way to consistently deal with missing as well as contradicting estimates as follows, see Table 6.
In cases of missing IHC-estimates one cannot simply trust GE. Instead, in order to strengthen decision performance, we rather have to introduce a critical region for GE. It effectively flags those cases where GE alone deems untrustworthy -81, 282 and 88 for ER, PGR and HER2, respectively -and thereby provides a 'quality filter'.
Available IHC estimates which are contradicted by GE when based on simple cut-points (crisp decision), represent a large number of discordant cases: (a) For ER, about (11% of patients in our cohort, ER IHC and ER GE disagreed, with about equal numbers (130/112) being found in each of the two possible modes of disagreement see Table 4.
We have to consider, however, that samples with formerly 'contradicting' GE (according to a crisp decision based on a simple cut-point for GE) may switch to noncontradicting, i.e. 'compliant', if GE lied just within (and not beyond the limits of) its critical domain.
Thus, considering a critical domain for GE, the rate of discordant samples decreased while the rate of conclusive samples increased, as can be seen by comparing Table 4 and Table 6.

Results for patients on top of those for statistics
If the goal were just to build a cohort of patients for developing a gene-expression signature statistically, one could, of course, retreat to the save side and consider only cases in agreement. However, even this might bias the results and call for an improvement of receptor estimate quality.
However, if the goal are decisions upon therapeutic interventions, these have to be made for all patients and in utmost quality -even in cases of disagreement between IHC and gene expression. We therefore tried to reduce the risk of wrong decisions by improving the cut-pointmethod by the additional concept of a 'critical domain'.

Improving receptor security for a newly diagnosed patient
For a newly diagnosed patient, we proposed a procedure to improve receptor security, see section The critical domain for expression values x is given by X lower ≤ x ≤ X upper , based on the ExMax-algorithm and probe sets listed in Table 2. The section 'Expression lookup plot' in 'Material and methods' describes how to actually apply the criterion X lower ≤ x ≤ X upper .   'Expression lookup plot' in 'Material and methods'. The 'critical domain' denotes non-decidable cases with responsibilities between 0.05 and 0.95. Note that such a direct procedure is only possible due to the nature of RMA-normalization of expression data, see section 'Receptor status obtained from gene expression' in 'Results'.
This straight forward procedure for the newly diagnosed patient is easily applicable (and renders any renormalization on the fly, etc., unnecessary).
Note however, that the 'expression look-up-plot' may critically depend on expression chip processing, despite RMA-normalization. We hence recommend building the 'expression look-up-plot' on the basis of gene expression measurements performed in the very same environment as data from newly diagnosed patients are analysed.
As a consequence and achievement, treatment quality of newly diagnosed patients will benefit, thus conveying a significant step towards precision medicine.

Balancing the risks of erroneous receptor status
We have shown that additional information from GE-data including a critical domain may improve receptor security in a considerable portion of cases, see Table 6. were not contradicted by gene expression. Without our proposed re-assessment, the remaining 8% may receive chemotherapy although hormone therapy might suffice. Thus, applying our proposed method, unnecessarily severe side-effects could possibly be saved in approximately 8% of that patient cohort. b) ER IHC + Out of 1196 patients classified as ER IHC + only 94% were not contradicted by gene expression. Without our proposed re-assessment, the remaining 6% may erroneously receive non-effective endocrine therapy while chemotherapy is withheld, which may be lethal. Thus, identifying these patients, may save lives in approximately 6% of that patient cohort. were not contradicted by gene expression. Without our proposed re-assessment, the remaining 11% would be falsely misclassified into endocrine sensitive and would thus be exposed to the side effects of endocrine therapy without a therapeutic benefit. e) HER2 IHC − Out of 1587 patients classified HER2 IHC − , 99.6% were not contradicted by gene expression. Without our proposed re-assessment, the remaining 0.4% would be exposed to Herceptin-associated side effects without deriving a benefit from this expensive and side-effect-related therapy. f) HER2 IHC + Out of 538 patients classified HER2 IHC + only 50% were not contradicted by gene expression. Without our proposed re-assessment, the remaining 50% would not be able to receive HER2-targeted treatment such as Herceptin, and would be under-treated.
Balancing the consequences of above listed decisions in therapy allocation we adhere to the general consensus that withholding endocrine therapy when it should be delivered is more harmful than applying endocrine therapy when not needed. Hence, • the therapy allocation error in c) is more serious than it is in d).
• the therapy allocation error in f) is more serious than it is in e).
• the most serious allocation error is listed in b), concerning patients falsely diagnosed as receptor positive.

Characteristics of selected procedures
This section critically enumerates assumptions underlying the results presented here.

Dichotomous IHC-estimates
IHC assays initially produce raw values which are normally dichotomized into ER IHC + and ER IHC − according to some thresholds. Using more than one threshold yields results such as 'low', 'moderate', 'high', etc.
A straight forward improvement of the current methodology could draw on raw values of IHC-assays www.impactjournals.com/oncotarget   and file them into the analysis as real values similar to gene expression values. Such an approach could even outperform the one presented here and may be the scope of future investigations. A drawback of this proposal is, however, that in most cases we only have dichotomized IHC-estimates at hand. No large datasets are available on the net to pursue the issue if gradual IHC estimates could improve precision.

Goals achieved and clinical impact
Receptor status estimated via IHC is a key ingredient of precision medicine for breast cancer patients. However, significant doubts have been raised regarding reliability, when comparing disparate receptor estimates of different readers evaluating the same data [5].
On the other hand, receptor estimates from gene expression, which are usually not used in clinical settings, have been given even more credit than IHC estimates [24].
Taking GE in account in addition to IHC clearly boosts credibility of concordant cases but, at first glance, reduced the number of decidable cases, since contradicting cases also emerged. However, in a second step, concordance could be re-improved by considering a critical domain of GE, thereby again reducing the number of contradictory cases.
In this work we scrutinized possible mathematical ways to enrich receptor information from IHC by gene expression data and thus put decisions on therapies on more solid grounds. To these ends we developed a procedure to incorporate both, IHC estimates and gene expression data.
This work used Affymetrix data together with clinical parameters downloaded from GEO in search for an improved receptor status determination. Translated into clinical reality, the expression status of those few genes involved will most probably be determined via RT-qPCR. Gene expression for single, newly diagnosed patients can be made interpretable and meaningful by our proposed 'expression-lookup-table' and lends itself as a valuable tool for improved receptor status assessment.

Data for gene expression and receptor status
Scanning the gene expression omnibus, GEO [25][26][27], for studies on breast cancer, using Affymetrix U133A+2.0 arrays and also containing receptor data resulted in 28 studies, two of which are redundant. The 26 non-redundant studies comprise a total number of N sample = 2880 individual breast cancer samples, after excluding samples completely lacking clinical information, control samples as well as duplicate samples.
Affymetrix-probe sets for receptors are given in Table 2.
For the analyses we restricted ourselves to the subset of the 22215 probe sets U133A+2.0 has in common with U133A, in order to be downwards compatible.

Methods and models for expression of receptor genes
To exploit information from the expression of receptor genes we considered five methods: MaxLike, ParEst, LogReg, Youden and ExMax.

Method 1: Youden point
The conventional, most straight forward approach is to compute the Youden point [31] along the ROC-curve by maximizing The threshold is set at equal Bayesian probabilities Pr | Pr | . + ( )= − ( )= x x 0 5 [33], where the Bayesian probability is given by: Here Pr Computing the cut-point is achieved again via eq. (4) and eq. (5).

Method 5: expectation maximization (ExMax)
The ExMax-algorithm is a general and very potent mathematical concept, see p. 275 in [35], which can fruitfully be broken down and applied to receptor estimation, as follows.
ExMax not only estimates the parameters of both normal distributions, µ σ , , but also their relative weight as an additional parameter, π + , yielding π π − + = − ( ) 1 due to normalization. Hence, ExMax is able to work without any IHC-estimates.
In short the process is as follows: We start from an initial guess for parameters of the two normal distributions ((μ 1 , σ 1 ), (μ 2 , σ 2 ) and π + ∆ ( ) π + It is only this initial guess -not the data used later for the actual estimation -which may be obtained from IHC-data. Instead of maximizing the likelihood straight away (as in method 4), it is expanded by additional N sample latent parameters γ i , i N = 1.... sample (8) γ i represents a guess (probability) for sample i to be receptor-positive. a) These guesses are computed (via a special formula) from the guesses for the 'real' parameters of the two normal distributions (μ 1 , σ 1 ), (μ 2 , σ 2 ) and π + ∆ ( ) π + + + − = + N N N . b) From this expanded maximum likelihood function, new estimates for the 'real' parameters are obtained via maximization.
Steps a) and b) are repeated in a loop until convergence. Probability density functions based on these fitted parameters are shown as blue lines in Figure 1, Figure 2 and Figure 3.
Note that IHC-results need to be known in advance and individually for each patient to apply methods Youden, LogReg and ParEst (i.e. supervised methods). As opposed, MaxLike draws on the relative abundances λ + ) only (semi-supervised method). ExMax does not need IHC data at all (unsupervised method), and hence lends itself for analyzing cohorts totally lacking IHC information.

Why to adopt ExMax
Screening the five methods resulted in the following valuation: • LogReg draws on IHC estimates which may partly be wrong, possibly blurring the decision.
• Youden yields only a cut-point without a confidence interval.
• MaxLike and ExMax yield almost identical results. However, according to our tests, MaxLike seems to perform well only if the abundances of false positive and false negative estimates are about equal, which is incidentally the case in our data. As opposed, ExMax works unaffected by relative abundances. Hence, we preferred ExMax over MaxLike.
• ParEst yields larger confidence intervals than ExMax, see section 'Obtaining confidence intervals for expression distribution parameters'.
• LogReg and Youden do not draw on normal distributions. Finally, each method yields a specific cutpoint for classifying the expression value, x i , see the marks on the ROC-curve (Figure 2), corresponding to the vertical lines in Figure 1 and Figure 3, lower panels.
All proposed methods were found to yield compatible cut-points, see Table 3, and ExMax was finally selected. It yields a PDF ( f − ) for receptor negative and a PDF ( f + ) for receptor positive.

Obtaining confidence intervals for expression distribution parameters
For the methods ParEst, MaxLike and ExMax, outlined to model receptor gene expression, the accuracy of parameters can be estimated by several methods, such as bootstrap, Leave-One-Out (LOO), cross-validation (CV, via training and validation set) or else by Fisher's information criterion, FIC, [35]. We will stick to FIC, and it works as follows: We start from the likelihood function, plug in all parameters (means and standard deviations) as a 'vector of parameters',  θ , and get: The so-called Fisher Information matrix is obtained as second partial derivative: Next, we would like to evaluate the expectation value of I ( )  θ by integrating over all measured values: i For a bimodal normal distribution this is not feasible.
As an approximation, we instead evaluate I ( )  θ for the maximum likelihood estimate of the parameters, θ → , as arguments: I (θ → ‫|‬ x → ). It can be shown mathematically [35] that this converges towards the expectation value.
From statistics one knows that the parameter estimates are normally distributed We take the main diagonal of this matrix and therefrom construct 95% confidence limits through multiplication by 1.96   θ θ ( ) For results see Table 7.

Responsibility functions define lower and upper bounds for gene expression
Each of the four estimation methods (LogReg, ParEst, MaxLike, ExMax) lends itself to construct a 'responsibility function', r x − ( ), for the respective sample being receptor-negative (likewise, r x + ( ) can be constructed for a sample being receptor positive). Each method yields a separate responsibility function, see Figure 7.
For the methods ParEst, MaxLike and in particular for ExMax, which we finally adopted, responsibility functions are constructed as follows: Given the two PDFs f − and f + (see Figure 1 and The 'critical domain' denotes non-decidable cases with responsibilities between 0.05 and 0.95.

Expression lookup plot
As a prearrangement take a gene expression database of a similar patient cohort and some expression platform, e.g. like the one we have used in this work, as a basis. Perform RMA-normalization as described and generate an 'expression look-up-plot' similar to Figure 8. It serves as a reference in the following procedure for the newly diagnosed patient: a) Take the raw expression values of all probe sets, order them by size and obtain ranks 1 -22.215 (This number depends on the platform used, here a subset of Affymetrix U133A+2.0 as mentioned in section 'Normalization'). b) Locate the rank r of the receptor probe set (e.g. ESR1) among the others e.g. rank 18.515 (within that sample), see Figure 8. c) Within the 'expression look-up-plot' locate the rank of the receptor gene on the rank-axis (x-axis), project it onto the curve and read off the value from the expression axis (y-axis). In math terms, one computes the x = r / N probes -quantile. d) This value, x, comes already in correct dimensions and may directly be compared with the limits of the 'critical domain' X X lower upper ,     .ˆ