The Risk of Hallucinations in AI-Generated Calculations
Abstract
This paper examines the technical limitations of large language models (LLMs) such as Gemini and GPT when deployed in business-critical applications requiring numerical calculations. We identify two fundamental risks: hallucinations (plausible but false outputs occurring 17–33% of the time even with retrieval augmentation) and non-determinism (accuracy variations up to 15% across identical inputs). Google's official documentation confirms it is not possible to guarantee deterministic outputs from prompts, even with optimal parameter settings. Research demonstrates that 80% of repeated identical requests produce unique outputs, making LLMs unsuitable for precise financial or mathematical calculations where consistency and accuracy are essential. These limitations represent potential legal liabilities and business risks that require human oversight rather than autonomous operation in high-stakes scenarios.
The Risk of Hallucinations in AI-Generated Calculations
As companies incorporate generative AI into customer-facing tools that perform calculations, the question of reliability becomes critical. Large language models (LLMs) such as Gemini or GPT offer rapid customization and natural language interfaces, but their tendency to "**hallucinate**" (generate plausible yet false outputs) presents a serious challenge in business contexts where numerical accuracy is essential. Google CEO Sundar Pichai acknowledged this limitation directly, stating that "no one in the field has yet solved the hallucination problems. All models do have this as an issue" (CBS News, 2023). This is not a minor flaw but a structural limitation: a false answer from an LLM can appear "perfectly credible," making it "challenging for users" to distinguish between fact and fiction, what some experts have called a "critical limitation" when accuracy is paramount (BizTech Magazine, 2025).
A recent study by Stanford University on legal AI tools emphasized this risk. The authors found that even when such systems are augmented with retrieval mechanisms designed to anchor answers in trusted sources, they still hallucinated 17 to 33% of the time. The researchers concluded bluntly that current AI legal tools remain "prone to 'hallucinate'... making their use risky in high-stakes domains" (Zhang et al., 2025). These hallucinations present more than technical inconveniences; they represent significant "**business risks** for enterprises and their customers," including reputational damage, erosion of trust, and legal liability if customers act on incorrect information (BizTech Magazine, 2025).
For example, imagine an AI system that performs financial calculations and overstates projected savings by a factor of ten, claiming a project will save $500,000 when in reality it would only save $50,000. If a client makes a decision based on this error, the vendor could be accused of misrepresentation. As legal analysts at Tech and Media Law point out, hallucinations should no longer be treated as "harmless model quirks" but instead as "**potential liabilities**" that must be proactively managed (Tech and Media Law, 2025). Venture capital investors now ask AI startups how they are mitigating hallucination risk, recognizing that an uncorrected hallucination problem renders the business model fragile (Tech and Media Law, 2025). Moreover, once users "begin to rely on AI-generated content to make decisions, the legal landscape changes," and companies must be prepared to assume responsibility for what their AI systems produce (Tech and Media Law, 2025).
Beyond hallucinations, LLMs face another fundamental limitation: the inability to deliver deterministic outputs. Google's official documentation explicitly states that even with temperature set to zero and fixed seed values, Gemini produces only "**mostly deterministic**" responses with "a small amount of variation still possible" (Google Cloud, 2025). Recent peer-reviewed research quantified this problem empirically, testing multiple LLMs including GPT-4 and finding accuracy variations up to 15% across identical inputs at temperature=0 (Kapoor et al., 2024). The researchers discovered that 80% of repeated identical requests produced unique outputs in some scenarios. This non-determinism is not a bug but an intentional design choice: performance optimizations like dynamic batching and continuous batching introduce variability while improving throughput and reducing costs.
For AI-generated calculations, this non-determinism compounds the hallucination risk. A user could input identical parameters multiple times and receive significantly different numerical results, not because of randomness settings, but due to floating-point arithmetic variations, dynamic server load affecting batch sizes, and GPU kernel implementation details. This means the same customer could submit the same calculation request and see their results change from $300,000 to $345,000 to $285,000 without any input modifications. Such inconsistency would be immediately visible to users and could undermine trust in the entire platform. As researchers testing this phenomenon concluded, "determinism in production LLM systems is **impossible to guarantee** rather than difficult to achieve" (Kapoor et al., 2024).
For this reason, organizations like the State Bar of Wisconsin advise that generative AI should be treated like a junior assistant, helpful for producing a first draft, but never to be trusted without human oversight (Wisconsin State Bar, 2025). In short, while generative AI may enhance speed and personalization, it is not yet reliable enough to operate independently in high-stakes scenarios involving numerical calculations or financial projections. As Pichai reminds us, we must approach this technology "with humility" (CBS News, 2023).
Conclusions
The evidence presented in this paper demonstrates that current large language models possess fundamental limitations that make them unsuitable for autonomous deployment in business-critical calculation tasks. The combination of hallucination rates reaching 33% and the impossibility of guaranteeing deterministic outputs creates an unacceptable risk profile for applications where numerical accuracy is paramount.
Three key findings emerge from our analysis. First, hallucinations are not occasional errors but structural limitations inherent to LLM architecture, as acknowledged by leading technology executives and confirmed by peer-reviewed research. Second, non-determinism in LLM outputs is not a configuration issue but a consequence of intentional design choices that prioritize performance over consistency. Third, these limitations translate directly into legal and business risks that extend beyond technical considerations into the realm of corporate liability and regulatory compliance.
Organizations deploying AI systems for calculations must implement robust safeguards. At minimum, this requires human verification of all AI-generated numerical outputs, clear disclosure to end-users about the probabilistic nature of LLM-generated results, and comprehensive error-detection mechanisms that can identify when outputs deviate from acceptable parameters. The alternative—deploying these systems without oversight—exposes organizations to reputational damage, loss of customer trust, and potential legal liability for misrepresentation.
The path forward requires acknowledging these limitations while leveraging AI's strengths. LLMs excel at natural language understanding, rapid prototyping, and generating initial drafts that human experts can refine. They fail at tasks requiring absolute precision, consistency, and verifiable accuracy. Until fundamental advances address the hallucination and non-determinism problems, the appropriate role for generative AI in calculation-intensive applications remains as an assistive tool operating under human supervision, not as an autonomous decision-making system.