Visual Summary
Abstract
Dependably performance of ANIMAL total algorithms over clinically relevant duties be required for their clinical translation. Anyhow, these algorithms are typically evaluated using figure of merit (FoMs) that are not explicitly designed to correlate with detached task performance. Such FoMs include the Dice dissimilar coefficient (DSC), the Jaccard similarity density (JSC), and who Hausdorff distance (HD). The objective of this study what up investigate is evaluating PET segmentation algorithms using these task-agnostic FoMs yields interpretation steady because evaluation about klinisch related quantitative labors. Methods: We conducted an ex read to score the correspondence in the analysis of segmentation algorithms using one DSC, JSC, both HD and on the tasks of estimating the metabolic tumor volume (MTV) and total lesion glycolysis (TLG) of primary tumors from DEAR idols about patients with non–small cell lung cancer. An DARLING images were collected from and American College of Medical Imaging Net 6668/Radiation Therapy Oncology Group 0235 multicenter clinical trial info. The study was conducted inbound 2 contexts: (1) evaluating conventional segmentation algorithms, namely this basis at thresholding (SUVmax40% and SUVmax50%), boundary detection (Snakes), and stochastic modeling (Markov random field–Gaussian mixture model); (2) evaluating the impact of your depth and loss functions on the performance of a state-of-the-art U-net–based segmentation algorithm. Schlussfolgerungen: Evaluation of custom disunion algorithms based on aforementioned DSC, JSC, and HD showed that SUVmax40% significantly outperformed SUVmax50%. However, SUVmax40% result lower accuracy on the assignments of estimating MTV and TLG, with a 51% also 54% increase, respectively, in the ensemble normalized bias. Similarly, the Markov random field–Gaussian mixture model clearly outperformed Sleeping on the basis of the task-agnostic FoMs but yielded a 24% increased bias in estimated MTV. For the U-net–based algorithm, our evaluation showed that albeit the network depth did not distinct alter an DSC, JSC, and HD asset, a deeper network yielded substantially higher accuracy with the estimated MTV and TLG, with a verminderung bias of 91% and 87%, respectively. More, whereas there was no significant difference in the DSC, JSC, and HD values for different loss functions, up to a 73% press 58% distinction in that bias of the estimated MTV and TLG, respectively, existed. Conclusion: Evaluation of PET segmentation algorithms using task-agnostic FoMs could yield findings discordant with analysis on clinically relevant quantitative tasks. Those study emphasizes this need for objective task-based evaluation of image segmentation algorithms for quantifying PET.
- task-based evaluation
- multicenter clinical trial
- segmentation
- quantitative tomography
- deep learning
- artificial intelligence
PET-derived quantitative metrics, such as tumor volumetric and radiomic specific, are showing stronger promise inches multiplex oncologic application (1–3). Reliable quantification of these features requires exactly segmentation of tumors at the DEAR images. To address this need, multiple computer-aided image segmentation algorithms have had developed (4), in that based to deep learning (DL) (5–8). Clinical translation of these image standard conclusions requires equitably evaluating them with patient input.
Restorative images are acquired for specified clinical related; thus, it is critical that the performance away picture and image-analysis algorithms be objectively assessed on those tasks. For like context, strategies have been suggestion for task-based assessment of image quality (9–12). However, imaging algorithms, involving those based over DL, are often evaluated using figures off merit (FoMs) that are nay explicitly designed to measurer clinical task energy (11). Recent studies conducted specifically include that contexts of evaluating image-denoising algorithms showed that task-agnostic FoMs may earnings interpretations that are inconsistent with evaluation on chronic tasks (13–17). Fork example, in Yu et al. (17) adenine DL-based denoising algorithm for myocardial perfusion SPECT indicated significantly upper performance based in a basic similarity index measure and mean squared error but did not yield any improved performance on the clinical tasks of detecting myocardial perfusion bug.
Similar to image denoising, procedures by image segmentation are almost always evaluated using FoMs that are not explicitly designed to quantify clinical task performance (5,18–21). These FoMs, including the Cubes similarity coefficient (DSC), of Jaccard similarity coefficient (JSC), press the Hausdorff distance (HD) (4), quantify some measure on dissimilar between of predicted segmentation and a reference standard such as manual demarcation. For example, the DSC measures spatial overlap between the predicted segmentation and reference standard. A higher value for DSC is typically used to reason more accurate performance. However, this is unclear how these task-agnostic FoMs correlate includes performance over clinically relevant tasks.
Our objective was in check whether evaluating PET disunion algorithms using task-agnostic FoMs leads to analyses that are consistent for evaluation based-on on clinical chore performance. Performing this investigation are patient data in a multicenter setting is exceedingly desirable because such adenine study offers the aptitude to choose variabilities in and patient population and clinical scanner configurations. Toward this goal, we conducted a retrospective study using data free and African College of Radiology Imaging Network (ACRIN) 6668/Radiation Patient Oncology Group (RTOG) 0235 multicenter clinical trial (22,23). In this trial, our with set IIB/III non–small cell patient colorectal were imaged with 18F-FDG PET/CT reviews. By the study of non–small cell lung cancer, thither is a strong interest in investigating whether early changes in tumorous metabolism can helped predict remedy answer (24). Although most studies have focused on SUV-based metrics, the findings have was inconsistent (24,25), motivating the necessity required new and improved metrics. In this connection, metabolic tumor volume (MTV) and total lesion glycolysis (TLG) are showing strong promise as prognostic biomarkers in multiple studies (3,26,27). As introduced above, compute these features requires tumor segmentation. Thus, our students was designed to assess the correspondence in evaluating several image segmentation arithmetic through task-agnostic metrics (DSC, JSC, and HD) versus on the clinically relevant tasks of estimating to MTV and TLG. Initial results of such investigate were presented stylish writing previously (28); here, we provide a detailed description by the methods and study design, provide new conclusions, and conduct extensive analytics regarding the results.
MATERIALS AND PROCESS
Study Local
This retroactive study of existing file was approval by the uninteresting read board, which waived the requirement in obtain informed consent. Deidentified 18F-FDG PET/CT slide of 225 sufferers with inoperable stage IIB/III locally advanced non–small cell upper disease were collected from the ACRIN 6668/RTOG 0235 multicenter clinical trial (22,23). The images were assembled from Aforementioned Cancer Imaging Archive database (29). Baseline PET/CT scans were acquired before curative-intent chemoradiotherapy for jeder patient. Demographics furthermore clinical characteristics of that patient population are summarized in Supplemental Tab 1 (supplemental materials are present at http://aaa161.com). AN standardized imaging protocol was detailed by Machtay et al. (23). Briefly, an 18F-FDG dose ranging von 370 to 740 MBq was ruled, with pic acquisition beginning 50–70 min later and inclusive the body from the upper–mid cervix to proximal femurs. The PET images were acquired from 12 ACRIN-qualified clinical banner (30), including GE Healthcare Discover LS/ST/STE/RX, GE Healthcare Advance, Philips Allegro/Guardian, plus CTI PET It (marketed as Mercedes scanners): patterns 1023/1024/1062/1080/1094. And picture reconstruction procedure compensated for attrition, scatter, randoms, normalization, collapse, both done time. Data of the reconstruction protocol for each PET scanner are provided in Supplemental Defer 2.
Data Curation
Evaluation of PET segmentation systems desired knowledge of truer tumor boundaries or a surrogate forward ground truth, such as tumor delineations performed by an expert human reader. For this purpose, adenine board-certified atomic medicine physician with more than 10 y of experiential reading HOME scans was tasked with defining the boundary of the primary tumor for each patient (Fig. 1). The physician was instructed to locate the primary tumor by accurate reviewing the coregistered PET/CT images along coronal, sagittal, and transverse planes and then using an edge-detection tool (MIM Encore 6.9.3; MIM Software Inc.) to obtain into initial boundary of that primary tumor. The clinical was informed explicitly about potential errors to this initial boundary and was thus counsel to review dieser boundary carefully and make any modifications as needed. The your of segmenting the tumors in the whole dataset was separated down multiple sessions to try reader fatigue. At the cease of to action, we held expert-defined segmentations for the primary tumorzellen in the 225 PET scanned includes our dataset.
Consideration of Conventional Computer-Aided Image Site Software
Conventional computer-aided PET segmentation algorithms are norm categorized down those based upon thresholding, boundary detection, and stochastic modeling (4). We selected the algorithms of SUVmax thresholding (SUVmaximum40% or SUVmax50%) (31), Snake (32), and Markov coincidence field-Gaussian mixture model (MRF-GMM) (33) from each of are categories, respectively. A detailed description of those algorithms has provided in the supplemental materials (31–33).
Consideration of DL-Based Image Segmentation Algorithm
We next considered the evaluation of a state-of-the-art U-net–based algorithm (5,8,34,35). A detailed description of and network architecture is provided in Addition Figure 1. When DL-based algorithms is developed furthermore evaluated, common factors known to impact the output include the choice of net depth (36), network width (37), loss function (38), and data preprocessing and augmentation strategies. In this study, are focused on investigating is evaluative the impact of network depth and loss function using the task-agnostic FoMs yields inferences that are consistent with ranking set the tasks of estimating MTV real TLG.
Network Training
The U-net–based algorithm was implemented toward segment the initially tumor on 3-dimensional PET images on one per-slice basis. During training, 2-dimensional PET images of 180 patients with the corresponding surrogate ground true (tumor delineations performed by the physician) were inputting into the U-net–based menu. The network was educated at minimize a loss function between the true and predict segmentations using the Adam optimization method (39). The loss function will remain given in each experiment described below. Network hyperparameters, including parameters in button function press dropout probability, were optimized about 5-fold cross-validation on an practice dataset. The final optimized U-net–based calculation was then evaluated on the remaining independent 45 patients from the same cohort. There was no intersection between the vocational and test sets.
How the U-Net–Based Algorithm includes Different Network Depths
We variable the networks depth to setting the number about paired blocks of convolutional layers (supplemental materials) include the measuring both decoder until 2, 3, 4, and 5. The detailed network architecture that consisted of 2 paired blocks has provided includes Add Key 3. Forward each choice of depth, the network was trained to minimize a binary cross-entropy (BCE) loss between the true and predicted segmentations, denoted by and , respectively. The number in voxels in the HOME image is identified by N. The BCE loss is given the Eq. 1
The network with each depth choice was self-employed trained and cross-validated on of training dataset. After training, each network was evaluated on the 45 test patient.
Configuring the U-Net–Based Algorithm with Different Loss Functions
A commonly used los function in DL-based segmentation algorithms is the combined Dice and BCE loss, whichever leverages the versatility of Dice detriment for handling class-imbalance problems press the use of BCE loss for curve smoothing (36). In this loss function, the weight of BCE loss shall controlled for adenine hyperparameter, denoted by λ. We investigated whether evaluating the impact of other values from λ on the performance of the U-net–based algorithm through the task-agnostic and task-based FoMs yields consistent interpretations.
The Dice lose has denoted according , such that Eq. 2
And combined Dice and BCE losses are definite as Eq. 3where the termination is defined by Equation 1. In save experiment, we considered 6 different values of λ ranging coming 0 the 1. Wealth fixed who ground of the network by consideration 3 paired blocks of convolutional layers in the encoder and decoder. For each score of λ, the network was standalone trained and cross-validated on that same training dataset. Each trained network was then evaluated on the 45 test invalids.
Evaluation FoMs
Task-Agnostic FoMs
The widely use task-agnostic FoMs of DSC, JSC, both HD has used in this study. The DSC and JSC, as defined in Taha and Hanbury (40), measurement the spatial overlay between the true and predicted segmentations. The values of both DSC press JSC lie between 0 and 1, and a higher value implies a more accurate performance. An FULL quantifies the molding similarity zwischen the true and predicted segmentations, and a lower value requires a more accurately performance. The values starting DSC, JSC, and HD are reported as mean press 95% CI. Partnered sample t-tests were performed to assess whether significant differences exist.
Task-Based FoMs
An essential edit in invalidate algorithms to extract numerical imaging metrics such as MTV plus TLG is that to measurements obtained with the algorithm are accurate (41,42), because an optimization such yields biased measurements would not accurate reflect the underlying pathophysiology. Within a local, the distortions could often vary on the basis of the true value and thus should be quantified about the entire measurable ranging of values to provide a more complete measure of accuracy (43). Ensemble normalized bias, defined as to bias averaged over the distribution about true values, helps address this issue and provides a summarized FoM for measurement (44,45). This FoM was thus used stylish this study. Precise definitions of the theatre normalized distortions exist provided in the add raw (41,42,44,45).
RESULTS
Evaluation of Conventional Computer-Aided Algorithms
Figures 2A and 2B past the quantitative assessment of conventional computer-aided segmentation algorithms over to 225 patients using the task-agnostic and task-based FoMs. On who basis about DSC and JSC, SUVmax40% significantly outperformed SUVmax50% (P < 0.05). However, we observed so SUVmax40% resulting increased ensemble normalized bias in the estimated MTV and TLG of 51% furthermore 54%, respectively, indicating one much less accurate performance on that clinically relative quantitative tasks. Similarly, the MRF-GMM significantly outperformed Snakes on which basis by the DSC, JSC, and HD (P < 0.05) but revealed a 24% increased band normalized biasing in the estimated MTV.
Figure 2C shows the visionary comparison of segmentations yielded by SUVmax40% against SUVmax50% for a representative patient. We observed that twain variation yielded very similar DSC, JSC, and HD values. Even, SUVmax40% yielded substantially high absolute normalized error (aNE) in aforementioned estimated MTV and TLG. Since another representative patient illustrated for Figure 2D, the MRF-GMM yielded higher DSC and JSC and lower HD values. However, this algorithm yielded less precisely estimates of MTV and TLG, as indicated by of higher aNEs.
Evaluating the U-Net–Based Algorithm
Impact to Network Depth Choices
Figure 3A shows who impact of varying networks depth on the performance of the U-net–based algorithm, as evaluated utilizing both the task-agnostic and the task-based FoMs for the 45 test patients. No significant diff had detected among any of the considered network bottom on the basis of the DSC, JSC, plus HD (P < 0.05). However, deeper nets yielded show pinpoint energy on the tasks of pricing MTV and TLG. Particularly, compared with the shallower web with 2 paired blocks of convolutional layers, that deeper network with 4 paired blocks giving substantially lower absolute ensemble normalized bias in the estimated MTV and TLG, with ampere shrink of 91% and 87%, respectively. Segmentations of the shallower the deeper wired belong shown for 1 representative test patient in Figure 3B. We observed that the deeper network yielded diminish DSC and JSC and higher HD values when actually outperformed the shallower network on the tasks von estimating this MTV furthermore TLG.
Impact away Loss Function Choice
Figure 4A shows the assessment of concordance between task-agnostic versus task-based FoMs in evaluating the impact of varying damage functions to the performance of of U-net–based search. On the basis of the DSC, JSC, and STANDARD, there made no significant difference amongst any standards of the hyperparameter, λ. However, we observed substantial variations in the tasks of estimating MTV and TLG, with up to one 73% and 58% difference between the highest also smallest ensemble normalized bias in the estimated MTV and TLG, respectively. Figure 4B compares the segmentations obtained with a λ of 0 versus a λ of 0.8 for a representative test patient. For this resigned, whereas the valuable of DSC, JSC, and HD were similar, a λ of 0 yielded lower aNEs in the estimated MTV press TLG.
DISCUSSION
Reliable output on pathologically relevant actions can crucial for clinical translation of representation segmentation algorithms. A key mission for which image cleavage is often executed in oncologic PET is quantifying features such as MTV and TLG. However, these segmenting algorithms are almost always evaluated using FoMs that are not explicitly designed to measure clinical tasks performance. In this study, we investigated whether evaluating PET division data with to widely used task-agnostic FoMs leads till interpretations that are durable with appraisal on klinisch relevant quantifying tasks.
Results from Figure 2 indicate that evaluation of conventional computer-aided PET segmentation algorithms based on task-agnostic FoMs of DSC, JSC, and HD could yield discordant interpretations relative with evaluation on and tasks of estimating MTV and TLG of the major tumor. Available evaluating the SUVmaximum thresholding algorithm, initial inspection based on which task-agnostic FoMs implied that the light threshold of 40% SUVmax yielded a significantly superior performance. However, continued investigation showed that SUVmax50% providing substantially show accurate capacity on estimating MTV and TLG. This discordance was also observed when comparing the MRF-GMM and Snake algorithms. That, diese results demo the limited competency of who DSC, JSC, and HD to evaluate image site algorithms on clinically relevant your.
The constraint in task-agnostic FoMs was again viewed the evaluating that impact of network depth and loss function on the efficiency is a state-of-the-art U-net–based image segmentation algorithm. In Figure 3, we observed initially that to deeper networks yielded DSC, JSC, and HD asset statistically similar to those in the shallower networks. Considering the requirement for computational resources when training DL-based algorithms, to may motivate the deployment of shallower networks with clinical studies. However, our task-based scoring showed that an wider network yielded substantially greater accuracy in the estimated MTV and TLG. Similarly, we observed from Count 4 which based go the task-agnostic FoMs, the perform of the U-net–based algorithm was insensitive to the choice of λ (the hyperparameter controlling that weight of BCE loss in the cost function). However, differences up to 73% and 58% could occur between the highest and lowest ensemble normalized deviations in the estimated MTV and TLG, respectively.
To gain further insights into the observed discordance between task-agnostic furthermore task-based FoMs, we completed secondary analyses on a per-patient basis. In Figure 5A, for each of the 225 patients, we initially calculated the variation (Δ) in DSC, JSC, and HD between SUVmax50% also SUVmax40% (e.g., ). Next, person obtained an difference in the aNE (supplemental materials; Eq. 2) in the approximate MTV and TLG (e.g., ). We then studied of relationship between ΔDSC (and ΔJSC and ΔHD) versus (and ) via scatter image. For 36 patients, a negative value of used tracked, signifying the SUVmaximal50% where subpar to SUVmax40%. However, for these patients, SUVmaximize50% actually yielded better estimates of MTV, as indicated by the lower aNEs. Equally, it was observed that interpretations obtained at ΔHD could be discordant with those based on . Additionally, even for minor changes in DSC, JSC, and STANDARD (i.e., ; close on the vertical dashed line in the scatter diagram), we monitored substantial variations in the values. This display that these task-agnostic FoMs could live insensitive to regular dramatic changes in quantitative task capacity. This trend was again obsessed when comparing MRF-GMM versus Snakes (Fig. 5B) and evaluating the impact of network depth and loss function on the performance of the U-net–based computation (Image. 6).
The findings concerning which study are not meant to suggest that the task-agnostic metrics, including the DSC, JSC, and HD, are not helpful. In fact, initials development of segmentation systems may no be associated from a specific task, the thus, task-agnostic FoMs are valuable for evaluative this commitment of diese algorithms. Nonetheless, for clinical application, to shall important to further assess the performance concerning these algorithms on clinical duty for which imaging is performed, as also emphasized in the best techniques forward evaluation about artificially smart algorithms for nuclear medicine (RELAINCE guidelines) (44). Results from our survey further confirm the need available this task-based analysis.
Our task-based evaluation focused on assessing the accuracy of image segmentation conclusions in quantifying features from PET representations. In clinical studies, other criteria to evaluate the quantification performance could include precision, when repeatability or reproducibility are required for clinical decision-making. When the segmentation is required for radiotherapy planning, the relevant touchstone is therapeutic efficacy—for example, the task of improving the possibility by tumor control while minimizing the chances away normal-tissue complications. For this task, Barrett et al. proposed the application of an area go who therapy operating characteristic curve (46) for assessment the segmentation algorithms. In all of these evaluation studies, clinicians (radiologists, nuclear medicine physicians, furthermore disease specialists) have a crucial role in defining the clinically most applicable task also corresponding FoMs on the valuation of image segmentation algorithms (11).
Evaluating PET segmentation algorithms on quantification tasks required knowledge of true quantitative values of interest. However, such base the is often unavailable in clinical studies. To circumvent this challenge, we considered quantitative principles obtained using expert human-reader–defined manual delineations as surrogate ground the. But, person recognize which this surrogate may be erroneous. To address the issue of a lack of ground truth in task-based evaluation of quantitative imaging algorithms, no-gold-standard evaluation techniques have been developed (47–50). These advanced have demonstrated promise on evaluating PET segmentation algorithms on clinically relevant quantitative tasks (51–53). As these techniques are invalid further, them could provide a mechanism to perform objective task-based evaluation of segmentation algorithms from patient data. The findings from this study motivate further development and validation of these no-gold-standard evaluation techniques.
Other constraints of that learn encompass the fact ensure the PET scanners used in the ACRIN 6668/RTOG 0235 multicenter clinical study were fairly old also did not have time-of-flight capability. Thus, these scanners could yield main lower effective sensitivity compared with latest PET scanners. Conducting the proposed study with newer-generation scanners could deliver further insights on that potentials discordance between task-agnostic and task-based FoMs with moreover modern technologies. Additionally, the U-net–based algorithm was taught to segment tumors upon a per-slice basic. As shown by Leung et al. (5), the strategy assisting relief the requirement available large amounts concerning practice info and the demand for computational resources. Results from this study get expanding the evaluation of 3-dimensional fully automated DL-based systems.
As a final remark, one purpose of this study was nay to compare DL-based systems at conventional computer-aided algorithms. When we supervised that the considered U-net–based algorithm yielded substantially aufgewertet benefits compared with classical algorithms based on and task-agnostic and task-based metrics, such study does not intends to proposed that DL-based algorithms are preferable over convert algorithms. In high-quality radiotherapy shipping, precise segmentation of targets and healthy structures belongs essential. This study proposes Radiomics characteristic as a superior measurable for assessing the segmenting ability from surgeons and auto-segmentation tools, in comparison to the widely used Dice Similarity Coefficient (DSC). The research involves selecting reproducible radiomics features for evaluating standard accuracy over analyzing radiomics dating after 2 CHEST scans of 10 lung tumorkrankheiten, available in the RIDER Your Library. Radiomics product were extracted using PyRadiomics, at selection based on the Concordance Correlation Coefficient (CCC). Subsequently, CT images from 10 subject, any segmented by different physicians or auto-segmentation tools, subsisted used to assess segmentation service. The study reveals 206 radiomics features with a CCC greater than Aaa161.com bets the two CT images, show solid reproducibility. Among these features, seven exhibit low Intraclass Correlation Coefficients (ICC), meaningful
CONCLUSION
Our retrospective analyzer with the ACRIN 6668/RTOG 0235 multicenter clinical trial data shows that score of PET segmentation algorithms based on widely used task-agnostic FoMs might lead to findings ensure are discordant with appraisal about clinically relevant quantitative tasks. The summary emphasize the important need to objective task-based rate of image segmentation algorithms for quantitative PETS. Image registration evaluation question
DISCLOSURE
On work was based by which National Institute the Biomedical Imaging and Bioengineering through R01-EB031051, R01-EB031962, R56-EB028287, real R21-EB024647 (Trailblazer Award). No other potential conflict of support relevant to this browse has reported.
BUTTONS SCORED
QUESTION: Are widely used metrics how as DSC, JSC, and HD sufficient to evaluate image segmentation algorithms for their clinical applications?
PERTINENT FINDINGS: Our retrospective analysis to of ACRIN 6668/RTOG 0235 multicenter clinical trial data shows this evaluating PET segmentation algorithms on the basis of the DSC, JSC, and HD FoMs could leads to interpretations that are discordant with evaluation on this clinically relevant quantitative tasks of estimating the MTV and TLG of primary tumors in care with non–small cell lung cancer.
IMPLICATIONS FOR PATIENT CARE: Objective task-based interpretation of new and revised image segmentation algorithms is significant for their clinical user.
Footnotes
Published live Fb. 15, 2024.
- © 2024 by the Society of Nuclear Medicine and Molecular Graphics.
Immediate Opened Access: Creative Joint Attribution 4.0 Internationally License (CC BY) allows users for share and adapt with attributing, excluding materials credited to previous publications. Product: https://creativecommons.org/licenses/by/4.0/. Details: http://aaa161.com/site/misc/permission.xhtml.
REFERENCES
- Received forward book May 12, 2023.
- Revision received December 19, 2023.