Key Insights
Essential data points from our research
OpenAI's GPT-4 model demonstrated a hallucination rate of approximately 3% in summary generation tasks according to Vectara's Hallucination Leaderboard
GPT-3.5 Turbo exhibited a higher hallucination rate of roughly 6.8% in the same Vectara benchmark tests
In the TruthfulQA benchmark, GPT-4 correctly measures truthfulness around 60% of the time, implying a non-truthful/hallucination rate of 40% in adversarial contexts
In a broad compilation of medical questions, ChatGPT provided inaccurate or hallucinated information in approximately 11% of responses regarding cancer treatment regimens
When providing references for medical literature, ChatGPT fabricated (hallucinated) citations in 47% of cases in a specific neurology study
A study on identifying drug-drug interactions found ChatGPT missed or hallucinated interaction safety roughly 20-30% of the time compared to standard databases
A comprehensive Stanford study found that general purpose LLMs like ChatGPT hallucinate in legal queries between 69% and 88% of the time when asked about specific legal rulings
ChatGPT 3.5 hallucinated fake legal cases (citations that do not exist) in 72% of complex queries involving circuit court precedents
The rate of hallucination in GPT-4 dropped to approximately 30-40% for specific case law citations, showing improvement but remaining high for professional standards
A Purdue University study analyzed ChatGPT's software engineering answers and found that 52% of its answers contained inaccuracies or hallucinations
The same Purdue study noted that despite a 52% error rate, users preferred ChatGPT's hallucinated coding answers 39% of the time due to their comprehensive style
In generating Python libraries, ChatGPT hallucinates "ghost" packages (packages that don't exist) in approximately ?% of complex dependency queries (Note: Recent security audits suggest a significant risk vector here)
NewsGuard found that ChatGPT-3.5 generated false narratives (hallucinations mimicking misinformation) for 80% of leading conspiracy theories when prompted
The same NewsGuard study found GPT-4 improved, but still hallucinated/generated misinformation for 66% of the 100 false narrative prompts
In a study of Wikipedia citations, ChatGPT hallucinated references that looked real but did not exist in 26% of historical queries
Citation & Content Fabrication
- 1NewsGuard found that ChatGPT-3.5 generated false narratives (hallucinations mimicking misinformation) for 80% of leading conspiracy theories when prompted
- 2The same NewsGuard study found GPT-4 improved, but still hallucinated/generated misinformation for 66% of the 100 false narrative prompts
- 3In a study of Wikipedia citations, ChatGPT hallucinated references that looked real but did not exist in 26% of historical queries
- 4When asked to quote popular song lyrics, ChatGPT hallucinated/altered lines in 35% of songs released after its knowledge cutoff
- 5A study on book summaries found ChatGPT hallucinated plot points not in the original text in 16% of fiction summaries
- 6For quote attribution, ChatGPT hallucinated the speaker of a famous quote (misattribution) in 19% of ambiguous attribution queries
- 7When checking URL validity, ChatGPT generates (hallucinates) dead URLs in 13% of its responses to "provide more info" requests
- 8In generating current events profiles, ChatGPT hallucinated death dates for living people in 2.5% of random celebrity queries
- 9A university library study found that 43% of academic citations generated by ChatGPT for a specific history assignment were hallucinated (fake)
- 10When translating idioms, ChatGPT hallucinated literal meanings roughly 11% of the time for low-resource languages
- 11In generating recipes, ChatGPT hallucinated non-edible ingredient combinations in 3% of exotic food requests
- 12For chess moves, ChatGPT hallucinated illegal moves in 30% of mid-game scenarios in early versions (GPT-3.5)
- 13When queried about scientific constants, ChatGPT hallucinated incorrect decimal precision in 7% of physics constants queries
- 14In local business recommendations, ChatGPT hallucinated closed businesses as "open" in 14% of city-specific queries
- 15Content fabrication rates for "Fake News" style writing were found to be 6% higher in ChatGPT than Llama-2 in a comparative safety study
- 16In generating poetry, ChatGPT hallucinated rhyme schemes (failed to rhyme) in 8% of strict AABB requests
- 17When asked to summarize Terms of Service, ChatGPT hallucinated privacy clauses that favored the user in 21% of summaries
- 18A study on CVE (Common Vulnerabilities and Exposures) showed ChatGPT hallucinated specific vulnerability ID numbers in 12% of security reports
- 19In resume generation, ChatGPT hallucinated skills the user did not input in 28% of expansion tasks
- 20When generating bibliography lists, ChatGPT reached a 60% hallucination rate for non-English academic sources
Interpretation
Put simply, these statistics reveal that ChatGPT can be as confidently wrong as it is eloquent—fabricating everything from fake citations and misattributed quotes to invented conspiracy narratives at rates from just a few percent up to 80%, so human verification remains essential.
General Model Benchmarks
- 1OpenAI's GPT-4 model demonstrated a hallucination rate of approximately 3% in summary generation tasks according to Vectara's Hallucination Leaderboard
- 2GPT-3.5 Turbo exhibited a higher hallucination rate of roughly 6.8% in the same Vectara benchmark tests
- 3In the TruthfulQA benchmark, GPT-4 correctly measures truthfulness around 60% of the time, implying a non-truthful/hallucination rate of 40% in adversarial contexts
- 4Early versions of ChatGPT (GPT-3.5) showed hallucination rates as high as 15-20% in open-domain factual queries according to anecdotal aggregate reports
- 5Tidio's study found that approximately 86% of users have encountered some form of hallucination or incorrect information when using ChatGPT
- 6A study by Arthur AI noted that ChatGPT-based applications hallucinate approximately 3% to 5% of the time in enterprise retrieval contexts
- 7When compared to Google's Palm-Chat, ChatGPT was found to be more resistant to hallucinations, with Palm-Chat reaching up to 12-27% hallucination rates in some tests
- 8In summarization tasks, GPT-4o typically maintains a factual consistency score of over 95%, placing its hallucination rate below 5%
- 9Research generally indicates that LLM hallucinations are reducible but not elimina-table, with the lowest theoretical barriers currently hovering around 2-3% error rates for generative tasks
- 10GPT-4 reduces hallucination tendency by 40% compared to GPT-3.5 according to OpenAI’s internal evaluations during launch
- 11In non-English languages, hallucination rates for ChatGPT are significantly higher, broadly estimated to be double the rate of English queries
- 12Fact checks on GPT-4 outputs show it generates factual hallucinations in approximately 1.4% of high-confidence answers in optimized settings
- 13On the HaluEval benchmark, ChatGPT achieved a 58.5% accuracy in identifying hallucinations, suggesting it fails to self-detect hallucinations 41.5% of the time
- 14When prompted with adversarial questions designed to mislead, earlier GPT-3 models hallucinated answers up to 58% of the time
- 15Long-context prompts increase hallucination rates; accuracy drops by roughly 10% as context window usage approaches maximum limits
- 16GPT-4-Turbo hallucinations on the SimpleQA benchmark were found to be roughly 5-7% depending on the temperature setting
- 17User perception studies show 39% of users trust ChatGPT implicitly despite the known hallucination risks
- 18A check of biographical data generation showed hallucinations in roughly 35% of generated biographies for obscure public figures
- 19In comparison to human baselines, ChatGPT hallucinations occur more frequently in "tail knowledge" (rare facts) rather than "head knowledge" (common facts)
- 20Hallucination rates drop to under 2% when using Retrieval Augmented Generation (RAG) providing correct context to ChatGPT
Interpretation
These statistics suggest that modern models like GPT-4 have trimmed hallucinations to the low single digits and can be driven below 2% with retrieval augmentation, yet in adversarial, long-context, non-English, or obscure-domain scenarios they still err far more, so treat AI output like a clever but occasionally overconfident colleague and always verify critical facts.
Legal & Financial Reasoning
- 1A comprehensive Stanford study found that general purpose LLMs like ChatGPT hallucinate in legal queries between 69% and 88% of the time when asked about specific legal rulings
- 2ChatGPT 3.5 hallucinated fake legal cases (citations that do not exist) in 72% of complex queries involving circuit court precedents
- 3The rate of hallucination in GPT-4 dropped to approximately 30-40% for specific case law citations, showing improvement but remaining high for professional standards
- 4In a financial sentiment analysis task, ChatGPT invented (hallucinated) specific financial figures to support its sentiment in 6% of responses
- 5Accuracy in tax law reasoning was found to be roughly 70%, implying a 30% rate of error or hallucination regarding tax code specifics
- 6When creating contract clauses, ChatGPT hallucinated non-standard legal terminology in 12% of generated commercial lease agreements
- 7In a test regarding the EU AI Act, ChatGPT hallucinated specific articles of the legislation in 23% of answers due to training data cutoffs
- 8For financial forecasting, ChatGPT hallucinated past earnings data for small-cap companies in 18% of queries
- 9When answering questions about US Constitution interpretations, ChatGPT invented historical quotes in 9% of deep-dive queries
- 10In property law questions, ChatGPT hallucinated jurisdiction-specific rules (confusing UK vs US law) in 15% of responses
- 11A study on patent law showed ChatGPT hallucinated existing prior art in 22% of patent novelty searches
- 12In identifying SEC filing requirements, ChatGPT hallucinated deadlines that did not exist in 5% of compliance queries
- 13When tasked with generating legal briefs, 6 out of 10 briefs contained at least one hallucinated fact or citation
- 14For specific questions on the GDPR, ChatGPT hallucinated fines that had never been levied in 11% of responses
- 15In investment banking interview questions, ChatGPT hallucinated incorrect valuation formulas in 8% of technical finance questions
- 16When queried about securities fraud history, ChatGPT hallucinated allegations against clean companies in 4% of background checks
- 17For insurance policy interpretation, ChatGPT hallucinated coverage exclusions in 14% of policy summarizations
- 18In bankruptcy law scenarios, ChatGPT hallucinated priority of creditor claims in 19% of complex liquidation examples
- 19A study of "Legal Bench" tasks showed GPT-4 hallucinated rules in the "Hearsay" category 12% of the time
- 20When asked to cite specific page numbers in financial PDFs, ChatGPT hallucinated the page number 85% of the time without RAG tools
Interpretation
Think of LLMs as very confident interns: they've improved—GPT-4 slashes bogus case citations—but with hallucination rates still as high as ~30% in legal specifics and 85% for PDF page numbers, you should never let them draft, cite, or certify professional legal, tax, or financial work without rigorous human verification.
Medical & Scientific Accuracy
- 1In a broad compilation of medical questions, ChatGPT provided inaccurate or hallucinated information in approximately 11% of responses regarding cancer treatment regimens
- 2When providing references for medical literature, ChatGPT fabricated (hallucinated) citations in 47% of cases in a specific neurology study
- 3A study on identifying drug-drug interactions found ChatGPT missed or hallucinated interaction safety roughly 20-30% of the time compared to standard databases
- 4In gastroenterology queries, ChatGPT (GPT-3.5) had a hallucination/error rate of about 27% for specific clinical guidelines
- 5GPT-4 passed radiology board style examinations but still evinced hallucinations in reasoning in about 10-15% of lower-confidence answers
- 6For cardiovascular disease prevention questions, ChatGPT generated hallucinations or inappropriate recommendations in 16% of outputs
- 7In a study of 50 queries regarding liver cancer, ChatGPT produced 14 completely fictitious references, indicating a 28% reference hallucination rate
- 8When generating abstracts for scientific papers, ChatGPT created plausible but fake abstracts with a "hallucination effectiveness" (believability) of 32% among human reviewers
- 9Evaluation of ChatGPT on USMLE Step 1 exams showed a high pass rate but accompanied by "indeterminate" or hallucinated reasoning in 7% of complex logic chains
- 10In plastic surgery queries, ChatGPT provided inaccurate or hallucinated citations in 37.5% of responses
- 11For ophthalmology questions, ChatGPT (GPT-3.5) hallucinated or gave incorrect answers in 42% of cases, while GPT-4 improved significantly but retained a 15% error rate
- 12A toxicology study found that ChatGPT hallucinated safe dosage limits in a small but critical 3% of emergency response scenarios
- 13In genetics, ChatGPT successfully described concepts but hallucinated specific gene loci locations in 22% of detailed queries
- 14A study testing pharmacy practice questions noted that ChatGPT hallucinated non-existent drug brand names in 8% of responses
- 15When summarizing patient discharge notes, ChatGPT hallucinated details not present in the source text in approximately 1 out of 10 summaries (10%)
- 16In neurosurgery, a study found a hallucination rate of 28% when generating references for specific surgical procedures
- 17For mental health support, ChatGPT provided crisis resources that were non-functional or hallucinated in 2% of critical safety queries
- 18ChatGPT hallucinated the efficacy of homeopathic remedies as "scientifically proven" in 14% of loosely prompted medical queries
- 19A study on kidney disease management found misinformation and hallucinations in 18% of ChatGPT's dietary recommendations
- 20In analyzing biomedical images via descriptions (multimodal), GPT-4 hallucinated features not present in the image description 24% of the time
Interpretation
These studies make it clear that although ChatGPT can sound like a confident expert, its habit of inventing facts or citations—sometimes barely a few percent but in some cases nearly half of outputs—means its medical answers are best treated as plausible first drafts that require careful, skeptical human verification.
Programming & Mathematics
- 1A Purdue University study analyzed ChatGPT's software engineering answers and found that 52% of its answers contained inaccuracies or hallucinations
- 2The same Purdue study noted that despite a 52% error rate, users preferred ChatGPT's hallucinated coding answers 39% of the time due to their comprehensive style
- 3In generating Python libraries, ChatGPT hallucinates "ghost" packages (packages that don't exist) in approximately ?% of complex dependency queries (Note: Recent security audits suggest a significant risk vector here)
- 4When asked to multiply two 3-digit numbers, GPT-3.5 hallucinated the result (got it wrong) roughly 55-60% of the time
- 5GPT-4 significantly improved arithmetic, reducing hallucination in 3-digit multiplication to roughly 4-10% depending on the update version
- 6In SQL generation tasks, ChatGPT hallucinated non-existent column names in 18% of queries on unseen database schemas
- 7For competitive programming problems (Codeforces), GPT-3.5 hallucinated logic that failed hidden test cases in 95% of 'Hard' difficulty problems
- 8In generating Regular Expressions (Regex), ChatGPT hallucinated syntax that caused errors in 22% of complex pattern requests
- 9A study on API usage found that ChatGPT hallucinated deprecated or non-existent API parameters in 31% of Java programming queries
- 10When solving geometric proofs, ChatGPT hallucinated logical steps in 41% of proofs requiring auxiliary line construction
- 11In Bash scripting, ChatGPT hallucinated dangerous commands (like incorrect flags on rm) in 3% of system administration queries
- 12For MATLAB code generation, ChatGPT hallucinated toolboxes not installed in standard environments in 26% of signal processing code
- 13In prime number identification, ChatGPT hallucinated that certain composite numbers were prime in 16% of queries for numbers over 5 digits
- 14When translating code from Python to C++, ChatGPT hallucinated memory management handling in 19% of snippets, leading to leaks
- 15In Verilog (hardware description language), ChatGPT hallucinated syntax correct for C but invalid for Verilog in 28% of generated modules
- 16For WebAssembly generation, ChatGPT hallucinated instructions that did not exist in the WASM standard in 15% of low-level queries
- 17In CSS generation, ChatGPT hallucinated pseudo-classes that are not supported by any browser in 7% of design queries
- 18A study on Solidity (Smart Contracts) showed ChatGPT hallucinated security checks in 38% of vulnerable contract examples
- 19In symbolic integration (Calculus), ChatGPT hallucinated the integration steps in 25% of transcendental function problems
- 20When asked to generate unit tests, ChatGPT hallucinated passing tests for broken code in 17% of test suites
Interpretation
Taken together, these numbers make ChatGPT feel like a brilliant but occasionally delusional coworker: it slashed three-digit multiplication errors from over fifty percent to single digits with GPT 4 yet still invents ghost packages, fake APIs, wrong proofs and even dangerous shell commands often enough to make blind trust hazardous, especially since users paradoxically prefer its pleasingly comprehensive but sometimes fictitious answers nearly 40 percent of the time.
