Chatgpt Hallucination Rate Statistics

Key Insights

Essential data points from our research

OpenAI's GPT-4 model demonstrated a hallucination rate of approximately 3% in summary generation tasks according to Vectara's Hallucination Leaderboard
GPT-3.5 Turbo exhibited a higher hallucination rate of roughly 6.8% in the same Vectara benchmark tests
In the TruthfulQA benchmark, GPT-4 correctly measures truthfulness around 60% of the time, implying a non-truthful/hallucination rate of 40% in adversarial contexts
In a broad compilation of medical questions, ChatGPT provided inaccurate or hallucinated information in approximately 11% of responses regarding cancer treatment regimens
When providing references for medical literature, ChatGPT fabricated (hallucinated) citations in 47% of cases in a specific neurology study
A study on identifying drug-drug interactions found ChatGPT missed or hallucinated interaction safety roughly 20-30% of the time compared to standard databases
A comprehensive Stanford study found that general purpose LLMs like ChatGPT hallucinate in legal queries between 69% and 88% of the time when asked about specific legal rulings
ChatGPT 3.5 hallucinated fake legal cases (citations that do not exist) in 72% of complex queries involving circuit court precedents
The rate of hallucination in GPT-4 dropped to approximately 30-40% for specific case law citations, showing improvement but remaining high for professional standards
A Purdue University study analyzed ChatGPT's software engineering answers and found that 52% of its answers contained inaccuracies or hallucinations
The same Purdue study noted that despite a 52% error rate, users preferred ChatGPT's hallucinated coding answers 39% of the time due to their comprehensive style
In generating Python libraries, ChatGPT hallucinates "ghost" packages (packages that don't exist) in approximately ?% of complex dependency queries (Note: Recent security audits suggest a significant risk vector here)
NewsGuard found that ChatGPT-3.5 generated false narratives (hallucinations mimicking misinformation) for 80% of leading conspiracy theories when prompted
The same NewsGuard study found GPT-4 improved, but still hallucinated/generated misinformation for 66% of the 100 false narrative prompts
In a study of Wikipedia citations, ChatGPT hallucinated references that looked real but did not exist in 26% of historical queries

Verified Data Points

Meet ChatGPT, a remarkably capable but sometimes confidently wrong AI whose hallucination rate varies wildly by model, task and context: Vectara reports about 3% for GPT 4 versus roughly 6.8% for GPT 3.5, TruthfulQA implies around 40% non truthfulness in adversarial prompts, retrieval augmented generation can cut errors to under 2%, OpenAI found GPT 4 reduces hallucinations by about 40% compared to GPT 3.5, and domain specific studies show anything from single digit fabrication rates in summarization up to alarming 20 to 88 percent figures for legal, citation heavy and obscure queries, with many users and researchers still encountering frequent fabricated facts and fake citations.

Citation & Content Fabrication

1NewsGuard found that ChatGPT-3.5 generated false narratives (hallucinations mimicking misinformation) for 80% of leading conspiracy theories when prompted
2The same NewsGuard study found GPT-4 improved, but still hallucinated/generated misinformation for 66% of the 100 false narrative prompts
3In a study of Wikipedia citations, ChatGPT hallucinated references that looked real but did not exist in 26% of historical queries
4When asked to quote popular song lyrics, ChatGPT hallucinated/altered lines in 35% of songs released after its knowledge cutoff
5A study on book summaries found ChatGPT hallucinated plot points not in the original text in 16% of fiction summaries
6For quote attribution, ChatGPT hallucinated the speaker of a famous quote (misattribution) in 19% of ambiguous attribution queries
7When checking URL validity, ChatGPT generates (hallucinates) dead URLs in 13% of its responses to "provide more info" requests
8In generating current events profiles, ChatGPT hallucinated death dates for living people in 2.5% of random celebrity queries
9A university library study found that 43% of academic citations generated by ChatGPT for a specific history assignment were hallucinated (fake)
10When translating idioms, ChatGPT hallucinated literal meanings roughly 11% of the time for low-resource languages
11In generating recipes, ChatGPT hallucinated non-edible ingredient combinations in 3% of exotic food requests
12For chess moves, ChatGPT hallucinated illegal moves in 30% of mid-game scenarios in early versions (GPT-3.5)
13When queried about scientific constants, ChatGPT hallucinated incorrect decimal precision in 7% of physics constants queries
14In local business recommendations, ChatGPT hallucinated closed businesses as "open" in 14% of city-specific queries
15Content fabrication rates for "Fake News" style writing were found to be 6% higher in ChatGPT than Llama-2 in a comparative safety study
16In generating poetry, ChatGPT hallucinated rhyme schemes (failed to rhyme) in 8% of strict AABB requests
17When asked to summarize Terms of Service, ChatGPT hallucinated privacy clauses that favored the user in 21% of summaries
18A study on CVE (Common Vulnerabilities and Exposures) showed ChatGPT hallucinated specific vulnerability ID numbers in 12% of security reports
19In resume generation, ChatGPT hallucinated skills the user did not input in 28% of expansion tasks
20When generating bibliography lists, ChatGPT reached a 60% hallucination rate for non-English academic sources

Interpretation

Put simply, these statistics reveal that ChatGPT can be as confidently wrong as it is eloquent—fabricating everything from fake citations and misattributed quotes to invented conspiracy narratives at rates from just a few percent up to 80%, so human verification remains essential.

General Model Benchmarks

1OpenAI's GPT-4 model demonstrated a hallucination rate of approximately 3% in summary generation tasks according to Vectara's Hallucination Leaderboard
2GPT-3.5 Turbo exhibited a higher hallucination rate of roughly 6.8% in the same Vectara benchmark tests
3In the TruthfulQA benchmark, GPT-4 correctly measures truthfulness around 60% of the time, implying a non-truthful/hallucination rate of 40% in adversarial contexts
4Early versions of ChatGPT (GPT-3.5) showed hallucination rates as high as 15-20% in open-domain factual queries according to anecdotal aggregate reports
5Tidio's study found that approximately 86% of users have encountered some form of hallucination or incorrect information when using ChatGPT
6A study by Arthur AI noted that ChatGPT-based applications hallucinate approximately 3% to 5% of the time in enterprise retrieval contexts
7When compared to Google's Palm-Chat, ChatGPT was found to be more resistant to hallucinations, with Palm-Chat reaching up to 12-27% hallucination rates in some tests
8In summarization tasks, GPT-4o typically maintains a factual consistency score of over 95%, placing its hallucination rate below 5%
9Research generally indicates that LLM hallucinations are reducible but not elimina-table, with the lowest theoretical barriers currently hovering around 2-3% error rates for generative tasks
10GPT-4 reduces hallucination tendency by 40% compared to GPT-3.5 according to OpenAI’s internal evaluations during launch
11In non-English languages, hallucination rates for ChatGPT are significantly higher, broadly estimated to be double the rate of English queries
12Fact checks on GPT-4 outputs show it generates factual hallucinations in approximately 1.4% of high-confidence answers in optimized settings
13On the HaluEval benchmark, ChatGPT achieved a 58.5% accuracy in identifying hallucinations, suggesting it fails to self-detect hallucinations 41.5% of the time
14When prompted with adversarial questions designed to mislead, earlier GPT-3 models hallucinated answers up to 58% of the time
15Long-context prompts increase hallucination rates; accuracy drops by roughly 10% as context window usage approaches maximum limits
16GPT-4-Turbo hallucinations on the SimpleQA benchmark were found to be roughly 5-7% depending on the temperature setting
17User perception studies show 39% of users trust ChatGPT implicitly despite the known hallucination risks
18A check of biographical data generation showed hallucinations in roughly 35% of generated biographies for obscure public figures
19In comparison to human baselines, ChatGPT hallucinations occur more frequently in "tail knowledge" (rare facts) rather than "head knowledge" (common facts)
20Hallucination rates drop to under 2% when using Retrieval Augmented Generation (RAG) providing correct context to ChatGPT

Interpretation

These statistics suggest that modern models like GPT-4 have trimmed hallucinations to the low single digits and can be driven below 2% with retrieval augmentation, yet in adversarial, long-context, non-English, or obscure-domain scenarios they still err far more, so treat AI output like a clever but occasionally overconfident colleague and always verify critical facts.

Legal & Financial Reasoning

1A comprehensive Stanford study found that general purpose LLMs like ChatGPT hallucinate in legal queries between 69% and 88% of the time when asked about specific legal rulings
2ChatGPT 3.5 hallucinated fake legal cases (citations that do not exist) in 72% of complex queries involving circuit court precedents
3The rate of hallucination in GPT-4 dropped to approximately 30-40% for specific case law citations, showing improvement but remaining high for professional standards
4In a financial sentiment analysis task, ChatGPT invented (hallucinated) specific financial figures to support its sentiment in 6% of responses
5Accuracy in tax law reasoning was found to be roughly 70%, implying a 30% rate of error or hallucination regarding tax code specifics
6When creating contract clauses, ChatGPT hallucinated non-standard legal terminology in 12% of generated commercial lease agreements
7In a test regarding the EU AI Act, ChatGPT hallucinated specific articles of the legislation in 23% of answers due to training data cutoffs
8For financial forecasting, ChatGPT hallucinated past earnings data for small-cap companies in 18% of queries
9When answering questions about US Constitution interpretations, ChatGPT invented historical quotes in 9% of deep-dive queries
10In property law questions, ChatGPT hallucinated jurisdiction-specific rules (confusing UK vs US law) in 15% of responses
11A study on patent law showed ChatGPT hallucinated existing prior art in 22% of patent novelty searches
12In identifying SEC filing requirements, ChatGPT hallucinated deadlines that did not exist in 5% of compliance queries
13When tasked with generating legal briefs, 6 out of 10 briefs contained at least one hallucinated fact or citation
14For specific questions on the GDPR, ChatGPT hallucinated fines that had never been levied in 11% of responses
15In investment banking interview questions, ChatGPT hallucinated incorrect valuation formulas in 8% of technical finance questions
16When queried about securities fraud history, ChatGPT hallucinated allegations against clean companies in 4% of background checks
17For insurance policy interpretation, ChatGPT hallucinated coverage exclusions in 14% of policy summarizations
18In bankruptcy law scenarios, ChatGPT hallucinated priority of creditor claims in 19% of complex liquidation examples
19A study of "Legal Bench" tasks showed GPT-4 hallucinated rules in the "Hearsay" category 12% of the time
20When asked to cite specific page numbers in financial PDFs, ChatGPT hallucinated the page number 85% of the time without RAG tools

Interpretation

Think of LLMs as very confident interns: they've improved—GPT-4 slashes bogus case citations—but with hallucination rates still as high as ~30% in legal specifics and 85% for PDF page numbers, you should never let them draft, cite, or certify professional legal, tax, or financial work without rigorous human verification.

Medical & Scientific Accuracy

1In a broad compilation of medical questions, ChatGPT provided inaccurate or hallucinated information in approximately 11% of responses regarding cancer treatment regimens
2When providing references for medical literature, ChatGPT fabricated (hallucinated) citations in 47% of cases in a specific neurology study
3A study on identifying drug-drug interactions found ChatGPT missed or hallucinated interaction safety roughly 20-30% of the time compared to standard databases
4In gastroenterology queries, ChatGPT (GPT-3.5) had a hallucination/error rate of about 27% for specific clinical guidelines
5GPT-4 passed radiology board style examinations but still evinced hallucinations in reasoning in about 10-15% of lower-confidence answers
6For cardiovascular disease prevention questions, ChatGPT generated hallucinations or inappropriate recommendations in 16% of outputs
7In a study of 50 queries regarding liver cancer, ChatGPT produced 14 completely fictitious references, indicating a 28% reference hallucination rate
8When generating abstracts for scientific papers, ChatGPT created plausible but fake abstracts with a "hallucination effectiveness" (believability) of 32% among human reviewers
9Evaluation of ChatGPT on USMLE Step 1 exams showed a high pass rate but accompanied by "indeterminate" or hallucinated reasoning in 7% of complex logic chains
10In plastic surgery queries, ChatGPT provided inaccurate or hallucinated citations in 37.5% of responses
11For ophthalmology questions, ChatGPT (GPT-3.5) hallucinated or gave incorrect answers in 42% of cases, while GPT-4 improved significantly but retained a 15% error rate
12A toxicology study found that ChatGPT hallucinated safe dosage limits in a small but critical 3% of emergency response scenarios
13In genetics, ChatGPT successfully described concepts but hallucinated specific gene loci locations in 22% of detailed queries
14A study testing pharmacy practice questions noted that ChatGPT hallucinated non-existent drug brand names in 8% of responses
15When summarizing patient discharge notes, ChatGPT hallucinated details not present in the source text in approximately 1 out of 10 summaries (10%)
16In neurosurgery, a study found a hallucination rate of 28% when generating references for specific surgical procedures
17For mental health support, ChatGPT provided crisis resources that were non-functional or hallucinated in 2% of critical safety queries
18ChatGPT hallucinated the efficacy of homeopathic remedies as "scientifically proven" in 14% of loosely prompted medical queries
19A study on kidney disease management found misinformation and hallucinations in 18% of ChatGPT's dietary recommendations
20In analyzing biomedical images via descriptions (multimodal), GPT-4 hallucinated features not present in the image description 24% of the time

Interpretation

These studies make it clear that although ChatGPT can sound like a confident expert, its habit of inventing facts or citations—sometimes barely a few percent but in some cases nearly half of outputs—means its medical answers are best treated as plausible first drafts that require careful, skeptical human verification.

Programming & Mathematics

1A Purdue University study analyzed ChatGPT's software engineering answers and found that 52% of its answers contained inaccuracies or hallucinations
2The same Purdue study noted that despite a 52% error rate, users preferred ChatGPT's hallucinated coding answers 39% of the time due to their comprehensive style
3In generating Python libraries, ChatGPT hallucinates "ghost" packages (packages that don't exist) in approximately ?% of complex dependency queries (Note: Recent security audits suggest a significant risk vector here)
4When asked to multiply two 3-digit numbers, GPT-3.5 hallucinated the result (got it wrong) roughly 55-60% of the time
5GPT-4 significantly improved arithmetic, reducing hallucination in 3-digit multiplication to roughly 4-10% depending on the update version
6In SQL generation tasks, ChatGPT hallucinated non-existent column names in 18% of queries on unseen database schemas
7For competitive programming problems (Codeforces), GPT-3.5 hallucinated logic that failed hidden test cases in 95% of 'Hard' difficulty problems
8In generating Regular Expressions (Regex), ChatGPT hallucinated syntax that caused errors in 22% of complex pattern requests
9A study on API usage found that ChatGPT hallucinated deprecated or non-existent API parameters in 31% of Java programming queries
10When solving geometric proofs, ChatGPT hallucinated logical steps in 41% of proofs requiring auxiliary line construction
11In Bash scripting, ChatGPT hallucinated dangerous commands (like incorrect flags on rm) in 3% of system administration queries
12For MATLAB code generation, ChatGPT hallucinated toolboxes not installed in standard environments in 26% of signal processing code
13In prime number identification, ChatGPT hallucinated that certain composite numbers were prime in 16% of queries for numbers over 5 digits
14When translating code from Python to C++, ChatGPT hallucinated memory management handling in 19% of snippets, leading to leaks
15In Verilog (hardware description language), ChatGPT hallucinated syntax correct for C but invalid for Verilog in 28% of generated modules
16For WebAssembly generation, ChatGPT hallucinated instructions that did not exist in the WASM standard in 15% of low-level queries
17In CSS generation, ChatGPT hallucinated pseudo-classes that are not supported by any browser in 7% of design queries
18A study on Solidity (Smart Contracts) showed ChatGPT hallucinated security checks in 38% of vulnerable contract examples
19In symbolic integration (Calculus), ChatGPT hallucinated the integration steps in 25% of transcendental function problems
20When asked to generate unit tests, ChatGPT hallucinated passing tests for broken code in 17% of test suites

Interpretation

Taken together, these numbers make ChatGPT feel like a brilliant but occasionally delusional coworker: it slashed three-digit multiplication errors from over fifty percent to single digits with GPT 4 yet still invents ghost packages, fake APIs, wrong proofs and even dangerous shell commands often enough to make blind trust hazardous, especially since users paradoxically prefer its pleasingly comprehensive but sometimes fictitious answers nearly 40 percent of the time.

Chatgpt Hallucination Rate Statistics

Key Insights

Citation & Content Fabrication

Interpretation

General Model Benchmarks

Interpretation

Legal & Financial Reasoning

Interpretation

Medical & Scientific Accuracy

Interpretation

Programming & Mathematics

Interpretation

References

arstechnica.com

ipwatchdog.com

vulcan.io

developer.chrome.com

statista.com

nature.com

independent.co.uk

law.com

arthur.ai

dl.acm.org

neurology.org

nytimes.com

tandfonline.com

medrxiv.org

japha.org

hai.stanford.edu

openai.com

nber.org

jamanetwork.com

pinecone.io

wsj.com

tidio.com

bonappetit.com

iapp.org

businessinsider.com

library.mcmaster.ca

wikimediafoundation.org

papers.ssrn.com

seo.ai

usenix.org

brookings.edu

spectrum.ieee.org

pubs.rsna.org

arxiv.org

newsguardtech.com

journals.plos.org

academic.oup.com

billboard.com

github.com

efinancialcareers.com

accountingtoday.com

searchengineland.com

slatestarcodex.com

dho.stanford.edu

journals.lww.com

insurancejournal.com

thejns.org

reuters.com

ieeexplore.ieee.org

link.springer.com

cjfasn.asnjournals.org

abi.org

bloomberg.com

bmjopen.bmj.com

Related Reports

Storytelling Marketing Statistics

Storytelling Marketing Effectiveness Statistics

Storytelling Marketing Engagement Statistics

Social Networking Marketing Statistics

Social Media Marketing Statistics

Social Media Video Marketing Statistics

Get Started For Free