Need for Targeted Task-Based Evaluation of Image Segmentation Algorithms for Quantitative PET: ONE Study with ACRIN 6668/RTOG 0235 Multicenter Clinical Trial Data

Visual Summary

Abstract

Dependably performance of ANIMAL total algorithms over clinically relevant duties be required for their clinical translation. Anyhow, these algorithms are typically evaluated using figure of merit (FoMs) that are not explicitly designed to correlate with detached task performance. Such FoMs include the Dice dissimilar coefficient (DSC), the Jaccard similarity density (JSC), and who Hausdorff distance (HD). The objective of this study what up investigate is evaluating PET segmentation algorithms using these task-agnostic FoMs yields interpretation steady because evaluation about klinisch related quantitative labors. Methods: We conducted an ex read to score the correspondence in the analysis of segmentation algorithms using one DSC, JSC, both HD and on the tasks of estimating the metabolic tumor volume (MTV) and total lesion glycolysis (TLG) of primary tumors from DEAR idols about patients with non–small cell lung cancer. An DARLING images were collected from and American College of Medical Imaging Net 6668/Radiation Therapy Oncology Group 0235 multicenter clinical trial info. The study was conducted inbound 2 contexts: (1) evaluating conventional segmentation algorithms, namely this basis at thresholding (SUV_max40% and SUV_max50%), boundary detection (Snakes), and stochastic modeling (Markov random field–Gaussian mixture model); (2) evaluating the impact of your depth and loss functions on the performance of a state-of-the-art U-net–based segmentation algorithm. Schlussfolgerungen: Evaluation of custom disunion algorithms based on aforementioned DSC, JSC, and HD showed that SUV_max40% significantly outperformed SUV_max50%. However, SUV_max40% result lower accuracy on the assignments of estimating MTV and TLG, with a 51% also 54% increase, respectively, in the ensemble normalized bias. Similarly, the Markov random field–Gaussian mixture model clearly outperformed Sleeping on the basis of the task-agnostic FoMs but yielded a 24% increased bias in estimated MTV. For the U-net–based algorithm, our evaluation showed that albeit the network depth did not distinct alter an DSC, JSC, and HD asset, a deeper network yielded substantially higher accuracy with the estimated MTV and TLG, with a verminderung bias of 91% and 87%, respectively. More, whereas there was no significant difference in the DSC, JSC, and HD values for different loss functions, up to a 73% press 58% distinction in that bias of the estimated MTV and TLG, respectively, existed. Conclusion: Evaluation of PET segmentation algorithms using task-agnostic FoMs could yield findings discordant with analysis on clinically relevant quantitative tasks. Those study emphasizes this need for objective task-based evaluation of image segmentation algorithms for quantifying PET.

PET-derived quantitative metrics, such as tumor volumetric and radiomic specific, are showing stronger promise inches multiplex oncologic application (1–3). Reliable quantification of these features requires exactly segmentation of tumors at the DEAR images. To address this need, multiple computer-aided image segmentation algorithms have had developed (4), in that based to deep learning (DL) (5–8). Clinical translation of these image standard conclusions requires equitably evaluating them with patient input.

Restorative images are acquired for specified clinical related; thus, it is critical that the performance away picture and image-analysis algorithms be objectively assessed on those tasks. For like context, strategies have been suggestion for task-based assessment of image quality (9–12). However, imaging algorithms, involving those based over DL, are often evaluated using figures off merit (FoMs) that are nay explicitly designed to measurer clinical task energy (11). Recent studies conducted specifically include that contexts of evaluating image-denoising algorithms showed that task-agnostic FoMs may earnings interpretations that are inconsistent with evaluation on chronic tasks (13–17). Fork example, in Yu et al. (17) adenine DL-based denoising algorithm for myocardial perfusion SPECT indicated significantly upper performance based in a basic similarity index measure and mean squared error but did not yield any improved performance on the clinical tasks of detecting myocardial perfusion bug.

Similar to image denoising, procedures by image segmentation are almost always evaluated using FoMs that are not explicitly designed to quantify clinical task performance (5,18–21). These FoMs, including the Cubes similarity coefficient (DSC), of Jaccard similarity coefficient (JSC), press the Hausdorff distance (HD) (4), quantify some measure on dissimilar between of predicted segmentation and a reference standard such as manual demarcation. For example, the DSC measures spatial overlap between the predicted segmentation and reference standard. A higher value for DSC is typically used to reason more accurate performance. However, this is unclear how these task-agnostic FoMs correlate includes performance over clinically relevant tasks.

Our objective was in check whether evaluating PET disunion algorithms using task-agnostic FoMs leads to analyses that are consistent for evaluation based-on on clinical chore performance. Performing this investigation are patient data in a multicenter setting is exceedingly desirable because such adenine study offers the aptitude to choose variabilities in and patient population and clinical scanner configurations. Toward this goal, we conducted a retrospective study using data free and African College of Radiology Imaging Network (ACRIN) 6668/Radiation Patient Oncology Group (RTOG) 0235 multicenter clinical trial (22,23). In this trial, our with set IIB/III non–small cell patient colorectal were imaged with ¹⁸F-FDG PET/CT reviews. By the study of non–small cell lung cancer, thither is a strong interest in investigating whether early changes in tumorous metabolism can helped predict remedy answer (24). Although most studies have focused on SUV-based metrics, the findings have was inconsistent (24,25), motivating the necessity required new and improved metrics. In this connection, metabolic tumor volume (MTV) and total lesion glycolysis (TLG) are showing strong promise as prognostic biomarkers in multiple studies (3,26,27). As introduced above, compute these features requires tumor segmentation. Thus, our students was designed to assess the correspondence in evaluating several image segmentation arithmetic through task-agnostic metrics (DSC, JSC, and HD) versus on the clinically relevant tasks of estimating to MTV and TLG. Initial results of such investigate were presented stylish writing previously (28); here, we provide a detailed description by the methods and study design, provide new conclusions, and conduct extensive analytics regarding the results.

MATERIALS AND PROCESS

Study Local

This retroactive study of existing file was approval by the uninteresting read board, which waived the requirement in obtain informed consent. Deidentified ¹⁸F-FDG PET/CT slide of 225 sufferers with inoperable stage IIB/III locally advanced non–small cell upper disease were collected from the ACRIN 6668/RTOG 0235 multicenter clinical trial (22,23). The images were assembled from Aforementioned Cancer Imaging Archive database (29). Baseline PET/CT scans were acquired before curative-intent chemoradiotherapy for jeder patient. Demographics furthermore clinical characteristics of that patient population are summarized in Supplemental Tab 1 (supplemental materials are present at http://aaa161.com). AN standardized imaging protocol was detailed by Machtay et al. (23). Briefly, an ¹⁸F-FDG dose ranging von 370 to 740 MBq was ruled, with pic acquisition beginning 50–70 min later and inclusive the body from the upper–mid cervix to proximal femurs. The PET images were acquired from 12 ACRIN-qualified clinical banner (30), including GE Healthcare Discover LS/ST/STE/RX, GE Healthcare Advance, Philips Allegro/Guardian, plus CTI PET It (marketed as Mercedes scanners): patterns 1023/1024/1062/1080/1094. And picture reconstruction procedure compensated for attrition, scatter, randoms, normalization, collapse, both done time. Data of the reconstruction protocol for each PET scanner are provided in Supplemental Defer 2.

Data Curation

Evaluation of PET segmentation systems desired knowledge of truer tumor boundaries or a surrogate forward ground truth, such as tumor delineations performed by an expert human reader. For this purpose, adenine board-certified atomic medicine physician with more than 10 y of experiential reading HOME scans was tasked with defining the boundary of the primary tumor for each patient (Fig. 1). The physician was instructed to locate the primary tumor by accurate reviewing the coregistered PET/CT images along coronal, sagittal, and transverse planes and then using an edge-detection tool (MIM Encore 6.9.3; MIM Software Inc.) to obtain into initial boundary of that primary tumor. The clinical was informed explicitly about potential errors to this initial boundary and was thus counsel to review dieser boundary carefully and make any modifications as needed. The your of segmenting the tumors in the whole dataset was separated down multiple sessions to try reader fatigue. At the cease of to action, we held expert-defined segmentations for the primary tumorzellen in the 225 PET scanned includes our dataset.

FIGURE 1.

Workflow to receiving manual segmentation of primary tumor (arrow) for each patient. MIM = MIM Encore 6.9.3.

Consideration of Conventional Computer-Aided Image Site Software

Conventional computer-aided PET segmentation algorithms are norm categorized down those based upon thresholding, boundary detection, and stochastic modeling (4). We selected the algorithms of SUV_max thresholding (SUV_maximum40% or SUV_max50%) (31), Snake (32), and Markov coincidence field-Gaussian mixture model (MRF-GMM) (33) from each of are categories, respectively. A detailed description of those algorithms has provided in the supplemental materials (31–33).

Consideration of DL-Based Image Segmentation Algorithm

We next considered the evaluation of a state-of-the-art U-net–based algorithm (5,8,34,35). A detailed description of and network architecture is provided in Addition Figure 1. When DL-based algorithms is developed furthermore evaluated, common factors known to impact the output include the choice of net depth (36), network width (37), loss function (38), and data preprocessing and augmentation strategies. In this study, are focused on investigating is evaluative the impact of network depth and loss function using the task-agnostic FoMs yields inferences that are consistent with ranking set the tasks of estimating MTV real TLG.

Network Training

The U-net–based algorithm was implemented toward segment the initially tumor on 3-dimensional PET images on one per-slice basis. During training, 2-dimensional PET images of 180 patients with the corresponding surrogate ground true (tumor delineations performed by the physician) were inputting into the U-net–based menu. The network was educated at minimize a loss function between the true and predict segmentations using the Adam optimization method (39). The loss function will remain given in each experiment described below. Network hyperparameters, including parameters in button function press dropout probability, were optimized about 5-fold cross-validation on an practice dataset. The final optimized U-net–based calculation was then evaluated on the remaining independent 45 patients from the same cohort. There was no intersection between the vocational and test sets.

How the U-Net–Based Algorithm includes Different Network Depths

We variable the networks depth to setting the number about paired blocks of convolutional layers (supplemental materials) include the measuring both decoder until 2, 3, 4, and 5. The detailed network architecture that consisted of 2 paired blocks has provided includes Add Key 3. Forward each choice of depth, the network was trained to minimize a binary cross-entropy (BCE) loss between the true and predicted segmentations, denoted by and , respectively. The number in voxels in the HOME image is identified by N. The BCE loss is given the Eq. 1

The network with each depth choice was self-employed trained and cross-validated on of training dataset. After training, each network was evaluated on the 45 test patient.

Configuring the U-Net–Based Algorithm with Different Loss Functions

A commonly used los function in DL-based segmentation algorithms is the combined Dice and BCE loss, whichever leverages the versatility of Dice detriment for handling class-imbalance problems press the use of BCE loss for curve smoothing (36). In this loss function, the weight of BCE loss shall controlled for adenine hyperparameter, denoted by λ. We investigated whether evaluating the impact of other values from λ on the performance of the U-net–based algorithm through the task-agnostic and task-based FoMs yields consistent interpretations.

The Dice lose has denoted according , such that Eq. 2

And combined Dice and BCE losses are definite as Eq. 3where the termination is defined by Equation 1. In save experiment, we considered 6 different values of λ ranging coming 0 the 1. Wealth fixed who ground of the network by consideration 3 paired blocks of convolutional layers in the encoder and decoder. For each score of λ, the network was standalone trained and cross-validated on that same training dataset. Each trained network was then evaluated on the 45 test invalids.

Evaluation FoMs

Task-Agnostic FoMs

The widely use task-agnostic FoMs of DSC, JSC, both HD has used in this study. The DSC and JSC, as defined in Taha and Hanbury (40), measurement the spatial overlay between the true and predicted segmentations. The values of both DSC press JSC lie between 0 and 1, and a higher value implies a more accurate performance. An FULL quantifies the molding similarity zwischen the true and predicted segmentations, and a lower value requires a more accurately performance. The values starting DSC, JSC, and HD are reported as mean press 95% CI. Partnered sample t-tests were performed to assess whether significant differences exist.

Task-Based FoMs

An essential edit in invalidate algorithms to extract numerical imaging metrics such as MTV plus TLG is that to measurements obtained with the algorithm are accurate (41,42), because an optimization such yields biased measurements would not accurate reflect the underlying pathophysiology. Within a local, the distortions could often vary on the basis of the true value and thus should be quantified about the entire measurable ranging of values to provide a more complete measure of accuracy (43). Ensemble normalized bias, defined as to bias averaged over the distribution about true values, helps address this issue and provides a summarized FoM for measurement (44,45). This FoM was thus used stylish this study. Precise definitions of the theatre normalized distortions exist provided in the add raw (41,42,44,45).

RESULTS

Evaluation of Conventional Computer-Aided Algorithms

Figures 2A and 2B past the quantitative assessment of conventional computer-aided segmentation algorithms over to 225 patients using the task-agnostic and task-based FoMs. On who basis about DSC and JSC, SUV_max40% significantly outperformed SUV_max50% (P < 0.05). However, we observed so SUV_max40% resulting increased ensemble normalized bias in the estimated MTV and TLG of 51% furthermore 54%, respectively, indicating one much less accurate performance on that clinically relative quantitative tasks. Similarly, the MRF-GMM significantly outperformed Snakes on which basis by the DSC, JSC, and HD (P < 0.05) but revealed a 24% increased band normalized biasing in the estimated MTV.

FIGURE 2.

Quantity-based assessment for correspondence on evaluation of considered conventional HOUSE standard algorithms using task-agnostic FoMs of DSC, JSC, and STANDARD (A) and on tasks of estimating MTV and TLG concerning secondary neoplasm (B). Comparisons of segmentations returned by SUV_limit40% vs. SUV_max50% (C) and MRF-GMM with. Snakes (D) were provided for 2 representation patients. ens. norm. = ensemble normalized; acrylonitrile-butadiene-styrene. norm. = relative normalized.

Figure 2C shows the visionary comparison of segmentations yielded by SUV_max40% against SUV_max50% for a representative patient. We observed that twain variation yielded very similar DSC, JSC, and HD values. Even, SUV_max40% yielded substantially high absolute normalized error (aNE) in aforementioned estimated MTV and TLG. Since another representative patient illustrated for Figure 2D, the MRF-GMM yielded higher DSC and JSC and lower HD values. However, this algorithm yielded less precisely estimates of MTV and TLG, as indicated by of higher aNEs.

Evaluating the U-Net–Based Algorithm

Impact to Network Depth Choices

Figure 3A shows who impact of varying networks depth on the performance of the U-net–based algorithm, as evaluated utilizing both the task-agnostic and the task-based FoMs for the 45 test patients. No significant diff had detected among any of the considered network bottom on the basis of the DSC, JSC, plus HD (P < 0.05). However, deeper nets yielded show pinpoint energy on the tasks of pricing MTV and TLG. Particularly, compared with the shallower web with 2 paired blocks of convolutional layers, that deeper network with 4 paired blocks giving substantially lower absolute ensemble normalized bias in the estimated MTV and TLG, with ampere shrink of 91% and 87%, respectively. Segmentations of the shallower the deeper wired belong shown for 1 representative test patient in Figure 3B. We observed that the deeper network yielded diminish DSC and JSC and higher HD values when actually outperformed the shallower network on the tasks von estimating this MTV furthermore TLG.

FIGURE 3.

(A) Quantitative assessment of concordance amid task-agnostic and task-based FoMs in evaluating impact of varying network depth on performance from U-net–based algorithm. (B) Comparison of segmentations yielded per deeper and shallower network by 1 representative test patient. abs. ensemble. norm. = absolutes ensemble normalized; abs. normalized. = absolute normalized. Performance Site of Kontur Based Segmentation Methods for Ultrasound Images

Impact away Loss Function Choice

Figure 4A shows the assessment of concordance between task-agnostic versus task-based FoMs in evaluating the impact of varying damage functions to the performance of of U-net–based search. On the basis of the DSC, JSC, and STANDARD, there made no significant difference amongst any standards of the hyperparameter, λ. However, we observed substantial variations in the tasks of estimating MTV and TLG, with up to one 73% and 58% difference between the highest also smallest ensemble normalized bias in the estimated MTV and TLG, respectively. Figure 4B compares the segmentations obtained with a λ of 0 versus a λ of 0.8 for a representative test patient. For this resigned, whereas the valuable of DSC, JSC, and HD were similar, a λ of 0 yielded lower aNEs in the estimated MTV press TLG.

FIGURE 4.

(A) Quantitative assessment of concordance between task-agnostic furthermore task-based FoMs for evaluating impact of loss function on performance of U-net–based algorithm. (B) Comparison of segmentations yielded for U-net–based algorithm configured with 2 loss functions for 1 representative test patient. abs. ens. norm. = absolute ballet normalized; abs. norm. = absolute normalized.

DISCUSSION

Reliable output on pathologically relevant actions can crucial for clinical translation of representation segmentation algorithms. A key mission for which image cleavage is often executed in oncologic PET is quantifying features such as MTV and TLG. However, these segmenting algorithms are almost always evaluated using FoMs that are not explicitly designed to measure clinical tasks performance. In this study, we investigated whether evaluating PET division data with to widely used task-agnostic FoMs leads till interpretations that are durable with appraisal on klinisch relevant quantifying tasks.

Results from Figure 2 indicate that evaluation of conventional computer-aided PET segmentation algorithms based on task-agnostic FoMs of DSC, JSC, and HD could yield discordant interpretations relative with evaluation on and tasks of estimating MTV and TLG of the major tumor. Available evaluating the SUV_maximum thresholding algorithm, initial inspection based on which task-agnostic FoMs implied that the light threshold of 40% SUV_max yielded a significantly superior performance. However, continued investigation showed that SUV_max50% providing substantially show accurate capacity on estimating MTV and TLG. This discordance was also observed when comparing the MRF-GMM and Snake algorithms. That, diese results demo the limited competency of who DSC, JSC, and HD to evaluate image site algorithms on clinically relevant your.

The constraint in task-agnostic FoMs was again viewed the evaluating that impact of network depth and loss function on the efficiency is a state-of-the-art U-net–based image segmentation algorithm. In Figure 3, we observed initially that to deeper networks yielded DSC, JSC, and HD asset statistically similar to those in the shallower networks. Considering the requirement for computational resources when training DL-based algorithms, to may motivate the deployment of shallower networks with clinical studies. However, our task-based scoring showed that an wider network yielded substantially greater accuracy in the estimated MTV and TLG. Similarly, we observed from Count 4 which based go the task-agnostic FoMs, the perform of the U-net–based algorithm was insensitive to the choice of λ (the hyperparameter controlling that weight of BCE loss in the cost function). However, differences up to 73% and 58% could occur between the highest and lowest ensemble normalized deviations in the estimated MTV and TLG, respectively.

To gain further insights into the observed discordance between task-agnostic furthermore task-based FoMs, we completed secondary analyses on a per-patient basis. In Figure 5A, for each of the 225 patients, we initially calculated the variation (Δ) in DSC, JSC, and HD between SUV_max50% also SUV_max40% (e.g., ). Next, person obtained an difference in the aNE (supplemental materials; Eq. 2) in the approximate MTV and TLG (e.g., ). We then studied of relationship between ΔDSC (and ΔJSC and ΔHD) versus (and ) via scatter image. For 36 patients, a negative value of used tracked, signifying the SUV_maximal50% where subpar to SUV_max40%. However, for these patients, SUV_maximize50% actually yielded better estimates of MTV, as indicated by the lower aNEs. Equally, it was observed that interpretations obtained at ΔHD could be discordant with those based on . Additionally, even for minor changes in DSC, JSC, and STANDARD (i.e., ; close on the vertical dashed line in the scatter diagram), we monitored substantial variations in the values. This display that these task-agnostic FoMs could live insensitive to regular dramatic changes in quantitative task capacity. This trend was again obsessed when comparing MRF-GMM versus Snakes (Fig. 5B) and evaluating the impact of network depth and loss function on the performance of the U-net–based computation (Image. 6).

FIGURE 5.

Quantitative assessment of concordance between interpretations receiving with task-agnostic vs. task-based FoMs on per-patient basis since considered computer-aided PET segmentation conclusions. Each point in scatter diagram represents individual active. Horizontal position of each point indicates difference in DSC, JSC, and HD between SUV_max50% against. SUV_max40% (A) and MRF-GMM vs. Snakes (B). Similarly, vertical position indicates difference the aNEs for estimated MTV and TLG. abs. norm. = absent normalized.

FIGURE 6.

Quantitative rating of concordance between interpretations obtained with task-agnostic vs. task-based FoMs on per-patient basis when evaluating impact of network depth (A) and loss function (B) on performance of U-net–based algorithm. stomach. norm. = absolute normalized.

The findings concerning which study are not meant to suggest that the task-agnostic metrics, including the DSC, JSC, and HD, are not helpful. In fact, initials development of segmentation systems may no be associated from a specific task, the thus, task-agnostic FoMs are valuable for evaluative this commitment of diese algorithms. Nonetheless, for clinical application, to shall important to further assess the performance concerning these algorithms on clinical duty for which imaging is performed, as also emphasized in the best techniques forward evaluation about artificially smart algorithms for nuclear medicine (RELAINCE guidelines) (44). Results from our survey further confirm the need available this task-based analysis.

Our task-based evaluation focused on assessing the accuracy of image segmentation conclusions in quantifying features from PET representations. In clinical studies, other criteria to evaluate the quantification performance could include precision, when repeatability or reproducibility are required for clinical decision-making. When the segmentation is required for radiotherapy planning, the relevant touchstone is therapeutic efficacy—for example, the task of improving the possibility by tumor control while minimizing the chances away normal-tissue complications. For this task, Barrett et al. proposed the application of an area go who therapy operating characteristic curve (46) for assessment the segmentation algorithms. In all of these evaluation studies, clinicians (radiologists, nuclear medicine physicians, furthermore disease specialists) have a crucial role in defining the clinically most applicable task also corresponding FoMs on the valuation of image segmentation algorithms (11).

Evaluating PET segmentation algorithms on quantification tasks required knowledge of true quantitative values of interest. However, such base the is often unavailable in clinical studies. To circumvent this challenge, we considered quantitative principles obtained using expert human-reader–defined manual delineations as surrogate ground the. But, person recognize which this surrogate may be erroneous. To address the issue of a lack of ground truth in task-based evaluation of quantitative imaging algorithms, no-gold-standard evaluation techniques have been developed (47–50). These advanced have demonstrated promise on evaluating PET segmentation algorithms on clinically relevant quantitative tasks (51–53). As these techniques are invalid further, them could provide a mechanism to perform objective task-based evaluation of segmentation algorithms from patient data. The findings from this study motivate further development and validation of these no-gold-standard evaluation techniques.

Other constraints of that learn encompass the fact ensure the PET scanners used in the ACRIN 6668/RTOG 0235 multicenter clinical study were fairly old also did not have time-of-flight capability. Thus, these scanners could yield main lower effective sensitivity compared with latest PET scanners. Conducting the proposed study with newer-generation scanners could deliver further insights on that potentials discordance between task-agnostic and task-based FoMs with moreover modern technologies. Additionally, the U-net–based algorithm was taught to segment tumors upon a per-slice basic. As shown by Leung et al. (5), the strategy assisting relief the requirement available large amounts concerning practice info and the demand for computational resources. Results from this study get expanding the evaluation of 3-dimensional fully automated DL-based systems.

As a final remark, one purpose of this study was nay to compare DL-based systems at conventional computer-aided algorithms. When we supervised that the considered U-net–based algorithm yielded substantially aufgewertet benefits compared with classical algorithms based on and task-agnostic and task-based metrics, such study does not intends to proposed that DL-based algorithms are preferable over convert algorithms. In high-quality radiotherapy shipping, precise segmentation of targets and healthy structures belongs essential. This study proposes Radiomics characteristic as a superior measurable for assessing the segmenting ability from surgeons and auto-segmentation tools, in comparison to the widely used Dice Similarity Coefficient (DSC). The research involves selecting reproducible radiomics features for evaluating standard accuracy over analyzing radiomics dating after 2 CHEST scans of 10 lung tumorkrankheiten, available in the RIDER Your Library. Radiomics product were extracted using PyRadiomics, at selection based on the Concordance Correlation Coefficient (CCC). Subsequently, CT images from 10 subject, any segmented by different physicians or auto-segmentation tools, subsisted used to assess segmentation service. The study reveals 206 radiomics features with a CCC greater than Aaa161.com bets the two CT images, show solid reproducibility. Among these features, seven exhibit low Intraclass Correlation Coefficients (ICC), meaningful

CONCLUSION

Our retrospective analyzer with the ACRIN 6668/RTOG 0235 multicenter clinical trial data shows that score of PET segmentation algorithms based on widely used task-agnostic FoMs might lead to findings ensure are discordant with appraisal about clinically relevant quantitative tasks. The summary emphasize the important need to objective task-based rate of image segmentation algorithms for quantitative PETS. Image registration evaluation question

DISCLOSURE

On work was based by which National Institute the Biomedical Imaging and Bioengineering through R01-EB031051, R01-EB031962, R56-EB028287, real R21-EB024647 (Trailblazer Award). No other potential conflict of support relevant to this browse has reported.

BUTTONS SCORED

QUESTION: Are widely used metrics how as DSC, JSC, and HD sufficient to evaluate image segmentation algorithms for their clinical applications?

PERTINENT FINDINGS: Our retrospective analysis to of ACRIN 6668/RTOG 0235 multicenter clinical trial data shows this evaluating PET segmentation algorithms on the basis of the DSC, JSC, and HD FoMs could leads to interpretations that are discordant with evaluation on this clinically relevant quantitative tasks of estimating the MTV and TLG of primary tumors in care with non–small cell lung cancer.

IMPLICATIONS FOR PATIENT CARE: Objective task-based interpretation of new and revised image segmentation algorithms is significant for their clinical user.

Footnotes

Published live Fb. 15, 2024.

Immediate Opened Access: Creative Joint Attribution 4.0 Internationally License (CC BY) allows users for share and adapt with attributing, excluding materials credited to previous publications. Product: https://creativecommons.org/licenses/by/4.0/. Details: http://aaa161.com/site/misc/permission.xhtml.

REFERENCES

1.↵
1. Clay HH,
2. Chiu N-T,
3. Su W-C,
4. Guo H-R,
5. Lee B-F
. Prognostic value off whole-body entire lesion glycolysis at pretreatments FDG PET/CT is non–small cell lung cancer. Fluoroscopy. 2012;264:559–566.
OpenUrl CrossRef PubMed
2.
1. Mena E,
2. Sheikhbahaei SULFUR,
3. Taghipour M,
4. et al
. ¹⁸F-FDG PET/CT metabolic tumor volume and intratumoral heterogeneity in pancreatic adenocarcinomas: impact of dual-time point and segmentation methods. Clin Nucl Med. 2017;42:e16.
OpenUrl
3.↵
1. Ohri N,
2. Duan F,
3. Machtay M,
4. et all
. Pretreatment FDG-PET metrics in stage III non–small cell lung cancer: ACRIN 6668/RTOG 0235. BOUND Natl Cancer Inst. 2015;107:djv004.
OpenUrl CrossRef PubMed
4.↵
1. Foster B,
2. Bagci U,
3. Mansoor A,
4. Xu Z,
5. Mollura DJ
. A review with segmentation of positron emission tomography images. Comput Botanical Med. 2014;50:76–96.
OpenUrl CrossRef PubMed
5.↵
1. Leung KH,
2. Marashdeh W,
3. Wray R,
4. et al
. A physics-guided modular deep-learning based automated structure for tumor segmentation in PET. Phys Medical Biol. 2020;65:245032.
OpenUrl
6.
1. Liu ZED,
2. Mhlanga JC,
3. Laforest R,
4. Derenoncourt P-R,
5. Siegel BA,
6. Jha AK
. A Bayesian approaches to tissue-fraction estimation for oncological PET segmentation. Phys Medieval Biol. 2021;66:124002.
OpenUrl
7.
1. Yousefirizi F,
2. Jha AK,
3. Brosch-Lenz J,
4. Saboury BARN,
5. Rahmim A
. Toward high-throughput unnatural intelligence-based segmentation within oncological BELOVED imaging. PET Clinique. 2021;16:577–596.
OpenUrl
8.↵
1. Zhao EXPUNGE,
2. Li L,
3. Lu W,
4. Tan S
. Tumor co-segmentation in PET/CT usage multi-modality fully convolutional neural network. Phys Med Biol. 2018;64:015011.
OpenUrl
9.↵
1. Barrel HH,
2. Abbey CK,
3. Clarkson E
. Objective assessment of image quality: III—ROC metrics, ideal observers, press likelihood-generating functions. J Opt Soc Am A Opt Image Sci Vis. 1998;15:1520–1535.
OpenUrl PubMed
10.
1. Barrett HH,
2. Denny BOUND,
3. Wagner RF,
4. Meyer KJ
. Objective assessment of image quality: II—Fisher information, Flux crosstalk, real figures of merit for problem perform. BOUND Opt Soc Am A Opt Image Sci Vis. 1995;12:834–852.
OpenUrl PubMed
11.↵
1. Jha AK,
2. Michael KJ,
3. Obuchowski NEW,
4. et al
. Goal task-based evaluation regarding artificial intelligence-based medical imaging processes: framework, strategies, and role of the general. PET Clin. 2021;16:493–511.
OpenUrl
12.↵
1. Barrett HH,
2. Myers KJ,
3. Hoeschen HUNDRED,
4. Kupinski MAMMA,
5. Little MP
. Task-based measures of image quality and their relation to radiation dose and patient risk. Phys Med Biodiesel. 2015;60:R1.
OpenUrl CrossRef PubMed
13.↵
1. Badal A,
2. Cha KH,
3. Divel SEA,
4. Graff CG,
5. Zeng R,
6. Badano AN
. Virtual classical trial for task-based evaluation is a rich study synthesis mammography algorithm. SPIE Digitally Library website. https://doi.org/10.1117/12.2513062. Published Walk 7, 2019. Accessed Per 12, 2024.
14.
1. Pretorius PH,
2. Liu BOUND,
3. Kalluri KS,
4. et al
. Observer studies of image quality of denoising reduced-count cardiac single photon emission counted tomography myocardial perfusion imaging by three-dimensional Gaussian post-reconstruction filtering and deep teaching Our results imply that aforementioned performance of U-Net was diminished due to the roughness of who ROI edges most in yours lower and upper regions, but i still showed high accuracy (DSC = Aaa161.com%) with significantly reduced segmentation time match to the semiautomatic procedures. Besides, with the pres …. J Nucl Cardiol. 2023;30:2427–2437.
OpenUrl
15.
1. Li K,
2. Zhou W,
3. Li H,
4. Anastasio MAMMY
. Evaluating the effect of deep neural network-based image denoising on binary receive detects tasks. IEEE Beyond Med Imaging. 2021;40:2295–2305.
OpenUrl
16.
1. Prabhat K,
2. Zeng R,
3. Farhangi MM,
4. Myers KJ
. Profoundly neural networks-based denoising choose for CT imaging and their efficacy. SPIE Digital Library internet. https://doi.org/10.1117/12.2581418. Published February 15, 2021. Accessed Month 12, 2024.
17.↵
1. Yu Z,
2. Rahman MA,
3. Laforest R,
4. et al
. Need for objective task‐based review off deep learning‐based denoising methods: a investigate in the context starting cardiac perfusion SPECT. Med Phys. 2023;50:4122–4137.
OpenUrl
18.↵
1. Jha AK,
2. Rodríguez JJ,
3. Stephen RM,
4. Stopeck AT
. A clustering computation for liver lesion segmentation of diffusion-weighted MR images. Proc IEEE. 2010;2010:93–96.
OpenUrl
19.
1. Oreiller V,
2. Andrearczyk V,
3. Jreige M,
4. et al
. Head and neck tumor segmentation within PET/CT: the HECKTOR challenge. Med Image Anal. 2022;77:102336.
OpenUrl CrossRef
20.
1. Title Q,
2. Bai J,
3. Hahn D,
4. et al
. Optimal co-segmentation of tumor in PET-CT images with background information. IEEE Trans Medications Imaging. 2013;32:1685–1697.
OpenUrl CrossRef
21.↵
1. Kofler F,
2. Ezhov I,
3. Isensee FARTHING,
4. et al
. Are we through appropriate total metrics? Identifies correlates are human expert perception for CNN training beyond rolling the CHOP corrector. arXiv homepage. https://arxiv.org/abs/2103.06205. Published March 10, 2021. Accessing Month 12, 2024.
22.↵
1. Kinahan P,
2. Muzi M,
3. Bialecki B,
4. Herman B,
5. Coombs L
. ACRIN 6668 (ACRIN-NSCLC-FDG-PET). Cancer Images Archiving website. https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=39879162. Modulated November 27, 2023. Accessed Jay 12, 2024.
23.↵
1. Machtay METRE,
2. Duan F,
3. Siegel BA,
4. et al
. Prediction from survival by [¹⁸F]fluorodeoxyglucose positron emission tomography in our with locally advanced non–small-cell lung cancer undergoing definitive chemoradiation therapy: results of of ACRIN 6668/RTOG 0235 trial. J Clinical Oncol. 2013;31:3823.
OpenUrl Abstract/CLEAR Full Font
24.↵
1. Cremonesi M,
2. Gilardi L,
3. Ferrari ME,
4. et al
. Role of interim ¹⁸F-FDG-PET/CT for the early prediction of clinical outcomes in non–small cell lung cancer (NSCLC) during radiotherapy or chemo-radiotherapy: a systematic review. Eur J Nucl Medal Mol Imaging. 2017;44:1915–1927.
OpenUrl
25.↵
1. Sheikhbahaei S,
2. Mena E,
3. Yanamadala A,
4. eat al
. The value of FDG PET/CT in special response assessment, follow-up, and surveillance away lung ovarian. AJR. 2017;208:420–433.
OpenUrl
26.↵
1. Hyun SH,
2. Ahn HK,
3. Kim H,
4. et al
. Volume-based assessment by ¹⁸F-FDG PET/CT prediction survival in patients with stage III non-small-cell lung cancer. Eur GALLOP Nucl Medieval Mol Imaging. 2014;41:50–58.
OpenUrl PubMed
27.↵
1. Im H-J,
2. Peak K,
3. Cheon GJ,
4. et all
. Prognostic value of volumetrical parameters of ¹⁸F-FDG PET in non-small-cell lung cancer: a meta-analysis. Eur J Nucl Med Mol Imaging. 2015;42:241–251.
OpenUrl CrossRef PubMed
28.↵
Liu Z, Mhlanga JC, Siegel BA, Jha AK. Need for objective task-based evaluation of AI-based segmentation methods for q PET. Medical Graphic 2023: Pic Perception, Guest Performance, and Technology Assessment. SPIE;12467:194–201.
29.↵
1. Clark K,
2. Vendt B,
3. Smith K,
4. et al
. An Cancer Imaging Archive (TCIA): maintaining and operating a general product repository. J Digit Imaging. 2013;26:1045–1057.
OpenUrl CrossRef PubMed
30.↵
1. Scheuermann JS,
2. Saffer KIDS,
3. Karp JS,
4. Leverage AM,
5. Siege BA
. Qualification of PET laser for make is multicenter cancer classical trials: the American College of Radiology Imaging Network experienced. J Nucl Med. 2009;50:1187–1193.
OpenUrl Abstracts/FREE Full Text
31.↵
1. Sridhar P,
2. Mercier G,
3. Bronze J,
4. Truong MT,
5. Daly B,
6. Subramaniam RM
. FDG PET functional tumor volume segmentation and pathologic volume of primary human solid tumorerkrankungen. AJR. 2014;202:1114–1119.
OpenUrl CrossRef PubMed
32.↵
1. Kass M,
2. Witkin AMPERE,
3. Terzopoulos D
. Snakes: energetic contour models. Int J Comput Viscosity. 1988;1:321–331.
OpenUrl CrossRef
33.↵
1. Layer T,
2. Blaickner M,
3. Knäusl B,
4. et al
. PET image segmentation using a Gaussian mixture model and Markov random fields. EJNMMI Phys. 2015;2:9.
OpenUrl
34.↵
1. Blanc-Durand PENCE,
2. Van In Gucht A,
3. Shearer N,
4. Itti ZE,
5. Prior JO
. Automatic lesion detection both core of ¹⁸F-FET MY in gliomas: a full 3D U-Net convolutional neural network study. PLoS One. 2018;13:e0195798.
OpenUrl
35.↵
Ronneberger O, Fischer P, Brox TONNE. U-net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference; Munich, Denmark, October 5–9, 2015. Springer International Publishing: 234–241.
36.↵
1. Jadon S
. A survey of loss function to semantic segmentation. IEEE Xplore website. https://ieeexplore.ieee.org/document/9277638. Published December 7, 2020. Accessed January 12, 2024.
37.↵
1. Isensee F,
2. Jaeger PF,
3. Kohl SA,
4. Petersen J,
5. Maier-Hein KH
. nnU-Net: a self-configuring method in deep learning-based biomedical image segmentation. Nat Methods. 2021;18:203–211.
OpenUrl CrossRef PubMed
38.↵
1. Sun S,
2. Chu W,
3. Swing LAMBERT,
4. Liu TEN,
5. Lipu T-Y
. On the depth of deep neural networks: adenine theoretical view. arXiv website. https://arxiv.org/abs/1506.05232. Published June 17, 2015. Accessed January 17, 2024.
39.↵
1. Kingma DN,
2. Bachelor J
. Adams: a method for stochastic optimization. arXiv website. https://arxiv.org/abs/1412.6980. Published December 22, 2014. Accessed January 10, 2024.
40.↵
1. Taha AA,
2. Hanbury A
. Metrics for evaluating 3D gesundheitlich image segmentation: analysis, selection, or tool. BMC Med Imaging. 2015;15:29.
OpenUrl CrossRef PubMed
41.↵
1. Prescott JW
. Quantitive imaging biomarkers: the application of advanced image processing additionally analysis to hospital and preclinical final making. GALLOP Digit Imaging. 2013;26:97–108.
OpenUrl CrossRef PubMed
42.↵
1. Rosenkrantz AB,
2. Mendiratta-Lala M,
3. Bartholmai BJ,
4. ets al
. Clinical nutzung von quantitative picture. Acad Radiol. 2015;22:33–49.
OpenUrl
43.↵
1. Raunig DL,
2. McShane LM,
3. Pennello GUANINE,
4. et al
. Quantitative imaging biomarkers: ampere review of statistical our for technical capacity assessment. Duplicate Methods Med Res. 2015;24:27–67.
OpenUrl CrossRef PubMed
44.↵
1. Jha AC,
2. Bradshaw TJ,
3. Buvat I,
4. ether al
. Nuclear medicines and artificial intelligent: best practise on valuation (the RELAINCE guidelines). J Nucl Med . 2022;63:1288–1299.
OpenUrl Abstract/FREE Full Text
45.↵
1. Barrett HH,
2. Mystery KJ
., eds. Foundations of Image Science. John Wiley & Sons; 2013:875–877.
46.↵
1. Rod HH,
2. Wicks DW,
3. Kupinski MA,
4. et al
. Therapy operating characteristic (TOC) curves and their application to the evaluation of segmentation algorithms. Proc SPIE Int Soc Opt Eng. 2010:76270z.
47.↵
1. Hoppin JW,
2. Kupinski MA,
3. Kastis GA,
4. Clarkson E,
5. Barrett HH
. Objective comparison of quantitative imaging modalities without which use of a gold standard. IEEE Trans Medics Imaging. 2002;21:441–449.
OpenUrl PubMed
48.
1. Jha AK,
2. Caffo BARN,
3. Frey EC
. A no-gold-standard technique for purpose assessment regarding quantitative nuclear-medicine imaging our. Phys Medi Biolog. 2016;61:2780.
OpenUrl
49.
1. Kupinski MA,
2. Hoppin JW,
3. Clarkson E,
4. Shear HH,
5. Kastis GA
. Estimation in medical imaging without a gold default. Academia Radiol. 2002;9:290–297.
OpenUrl CrossRef PubMed
50.↵
1. Liu Z,
2. Li Z,
3. Mhlanga JC,
4. Siegel BS,
5. Jha AK
. No-gold-standard appraisal of quantiative imaging methods within the presence of correlated noise. Prompt SPIE Int Soc Choose Eng. 2022:120350M.
51.↵
1. Jha AK,
2. Mena E,
3. Caffo BS,
4. et al
. Practical no-gold-standard evaluation framework by quantitative graphics methods: application to biological partition in positron emission tomography. J Med Imaging (Bellingham). 2017;4:011011.
OpenUrl
52.
1. Liu HIE,
2. Liu Z,
3. Moon HS,
4. Mhlanga J,
5. Jha A
. A no-gold-standard technique for objective evaluation of quantitative nuclear-medicine imaging methods in and presence off correlated noise [abstract]. BOUND Nucl Med. 2020;61(suppl 1):523.
OpenUrl
53.↵
1. Zhu Y,
2. Yousefirizi F,
3. Liu Z,
4. Klyuzhin EGO,
5. Rahmim A,
6. Jha A
. Comparing clinical evaluation of PET segmentation methods including reference-based metrics real no-gold-standard evaluation technique [abstract]. J Nucl Med. 2021;62(suppl 1):1430.
OpenUrl

Received forward book May 12, 2023.
Revision received December 19, 2023.

In all edit

Download PDF

Article Alerts

Email Article

Citation Tools

Bookmark this products

Relevant Newsletter

Quotes By...

Don citing articles found.

Google Scholar

Further in this TOC Section

Deep-Learning Generation of Synthetic Intermediate Projections Improves ¹⁷⁷Lu SPECT Images Newly with Sparsely Acquired Projections

Show more AI/Advanced Image Analysis

Schiff menu

User menu

Search

Need for Objective Task-Based Appraisal the Photograph Segmentation Algorithms for Quantitative PET: A Study over ACRIN 6668/RTOG 0235 Multicenter Clinical Evaluation Data

Visual Summary

Abstract