Assessing the quality and reliability of ChatGPT’s responses to radiotherapy-related patient queries: GPT-3.5 versus GPT-4

Grilo, Ana; Marques, Catarina; Corte-Real, Maria; Carolino, Elisabete; Caetano, Marco

http://hdl.handle.net/10400.21/17642

Use this identifier to reference this record.

Name:	Description:	Size:	Format:
Assessing the quality and reliability of ChatGPT’s responses to radiotherapy-related patient queries_GPT-3.5 versus GPT-4.pdf		593.15 KB	Adobe PDF	Download

Send Feedback

Authors

Abstract(s)

Patients frequently resort to the Internet to access cancer information. Nevertheless, these online websites often need more content accuracy and readability. Recently, ChatGPT, an artificial intelligence-powered chatbot, signifies a potential paradigm shift in how cancer patients can access vast medical information. However, given that ChatGPT was not explicitly trained for oncology-related inquiries, the quality of the information it provides still needs to be verified. Evaluating the quality of responses is crucial, as misinformation can foster a false sense of knowledge and security, lead to noncompliance, and delay appropriate treatment. Objective: This study aims to evaluate the quality and reliability of ChatGPT’s responses to standard patient queries about radiotherapy, comparing the performance of GPT-3.5 and GPT-4. Methods: Forty commonly asked radiotherapy questions were selected and inserted into both versions. Responses were evaluated by six radiotherapy experts using a General Quality Score (GQS), assessed for consistency and similarity using the cosine similarity score, and analyzed for readability using the Flesch Reading Ease Score (FRES) and Flesch-Kincaid Grade Level (FKGL). Statistical analysis was performed using the Mann-Whitney test. Results: GPT-4 demonstrated superior performance, with higher GQS and a complete absence of lower scores compared to GPT-3.5. The Mann-Whitney test revealed statistically significant differences in some questions, with GPT-4 generally receiving higher ratings. The cosine similarity score indicated substantial similarity and consistency in responses from both versions. Readability scores for both versions were considered college-level, with GPT-4 scoring slightly better in FRES (35.55) and FKGL (12.71) compared to GPT-3.5 (30.68 and 13.53, respectively). Both versions’ responses were deemed challenging for the public to read. Conclusions: While GPT-4 generates more accurate and reliable responses than GPT-3.5, both models present readability challenges for the public. ChatGPT reveals potential as a valuable resource for addressing common patient queries related to radiotherapy. However, it is crucial to acknowledge its limitations, including the risks of misinformation and readability issues.

Keywords

Radiotherapy Artificial intelligence ChatGPT Large language model Patient information

URI

http://hdl.handle.net/10400.21/17642

Citation

Grilo A, Marques C, Corte-Real M, Carolino E, Caetano M. Assessing the quality and reliability of ChatGPT’s responses to radiotherapy-related patient queries: GPT-3.5 versus GPT-4. JMIR Preprints. 2024 Jun 27:63677.