
ChatGPT:
Chatbots as “Stochastic Parrots”: A Critical Exploration by Emily Bender
🌐 Introduction: The Rising Debate on Large Language Models
As AI-powered conversational agents like ChatGPT become embedded in our daily lives, questions about their ethical, practical, and theoretical implications have intensified. Linguist Emily M. Bender, a professor and the director of the Computational Linguistics Laboratory at the University of Washington, is a prominent voice challenging the current trajectory in AI. In her landmark paper, Stochastic Parrots (2021), co-authored with Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell, Bender argues that large language models (LLMs) like ChatGPT mimic language without true understanding, making them “stochastic parrots.” In recent discussions, she critiques the technological, ethical, and social implications of large-scale LLMs, which she argues lack real comprehension, introduce systemic biases, and raise labor and environmental concerns. Bender’s insights question whether larger AI models necessarily mean better models and advocate for a stakeholder-centered approach to AI development.
🔍 The Foundation of “Stochastic Parrots” and Its Warnings
The Original Warning in Stochastic Parrots
In 2021, Bender and her co-authors warned of the risks associated with the rapid expansion of LLMs. Her primary argument was not a prediction of technological inevitabilities but a warning against continuing a path that prioritizes larger and more resource-intensive models. Bender emphasizes that these models lack fundamental understanding of language and meaning. As the technology evolved, these warnings have proven increasingly relevant, as developers chase scale and realism in AI systems without addressing foundational shortcomings or ethical issues.
The Exploitative Labor Behind LLMs
One dimension Bender had not fully anticipated in her original paper was the degree to which human labor would be exploited to create and refine these models. As LLMs grew, companies relied on human contractors to annotate data, rate responses, and moderate content—a process essential to training and maintaining these systems but often performed by workers in low-wage positions, sometimes with poor working conditions.
The Hype and Misplaced Enthusiasm
Bender also underestimated the extent of the public’s enthusiasm for AI-generated synthetic text, which in her view often obscures the true nature of these technologies. While the fluidity of synthetic text may impress, she argues that it leads to a misplaced belief in the technology’s capabilities, contributing to what she views as an unjustified “hype” around these systems.
🚀 Scaling Models: Does Bigger Mean Better?
The Problem with Scale
Bender questions the prevailing belief that bigger LLMs inherently mean better results, arguing that scale improves the ability to mimic human language but does not enhance the actual understanding or task-specific utility. She references Claude Shannon’s work in the 1940s, which laid the groundwork for probabilistic language modeling—used for practical tasks like transcription or spell-checking—but she argues that LLMs have gone far beyond what’s needed for such applications.
No Clear Evaluations for Specific Tasks
According to Bender, there is a lack of rigorous evaluations that demonstrate significant performance improvements on specific, practical tasks due to LLM scaling. Instead, the primary result of larger models is better mimicry of human-like responses, which she argues serves limited practical purpose without genuine understanding.
🧠 “Stochastic Parrots”: Mimicry Without Comprehension
Linguistic Foundation of “Stochastic Parrots”
Bender draws on linguistic theory to argue that language inherently pairs form and meaning, a concept dating back to Ferdinand de Saussure in the early 20th century. In contrast, LLMs operate only on “form,” producing language patterns without any grasp of meaning or communicative intent. This lack of understanding, she asserts, makes LLMs akin to “stochastic parrots”—repeating patterns without comprehension.
Risks of Mimicking Language
Bender explains that these systems are fundamentally designed to generate “plausible” text sequences rather than accurate or contextually meaningful information. This distinction becomes especially problematic when LLMs are deployed in roles that require reliable information or understanding, as the systems may confidently produce incorrect or misleading statements.
📉 Persistent Inaccuracies and Misinformation
The Design Limitations of LLMs
The statistical basis of LLMs predisposes them to generate errors, which Bender argues are not merely accidental but rather systemic and inevitable. Since LLMs determine text based on likelihood rather than factual accuracy, they are bound to produce misinformation, especially when asked to extend beyond trained data with unique or niche queries.
Impact on AI-Powered Tools
AI tools embedded in workplace software or consumer applications are particularly susceptible to spreading misinformation. Bender notes that while such tools might produce text that appears accurate, the actual factual integrity of the content requires verification by human users, effectively shifting the burden back to the user and negating the supposed convenience of the tool.
🌐 Data Bias: A Challenge in Large-Scale Models
Inherent Bias in Training Data
Bender emphasizes that there is no truly unbiased training dataset, especially when LLMs are trained on vast amounts of internet data. The internet includes a multitude of harmful biases, from racism and sexism to ideological distortions, which LLMs absorb and reproduce. Bender argues that the lack of transparency around LLM training data further compounds this problem, as users and even developers often lack detailed knowledge of the model’s sources.
Problems with Filtering and Curation
Attempts to filter harmful content from LLM datasets often fall short. Bender notes that early efforts included a list of banned words, many related to sexuality and LGBTQ topics, which unintentionally eliminated valid data while failing to remove problematic content. In her view, LLM development must move toward more thoughtful and curated data selection to minimize harmful biases, but this is challenging under the current trend of indiscriminately expanding data scope.
🔍 LLMs and Search Engines: A Mismatch for Information Retrieval
Contextual Limitations in LLM Search Applications
Bender argues that using LLMs as search tools fundamentally misunderstands how users interact with information. A traditional search engine allows users to see the source and context of results, such as expert articles or user forums. In contrast, LLMs provide synthesized responses detached from their original context, preventing users from assessing credibility or exploring the variety of perspectives that often enrich search results.
Environmental and Social Concerns
Beyond accuracy, Bender raises environmental concerns, noting that the computational demands of LLMs require significant electricity, leading to a substantial carbon footprint. She suggests that the industry must weigh these impacts against the limited benefits that LLMs bring to search and information retrieval.
🧩 Challenges of Machine Translation
Benefits and Risks of AI Translation
While Bender acknowledges that machine translation can break down language barriers, she also cautions against overreliance. Fluent translations can lead users to assume accuracy, but linguistic nuances may be lost, particularly with less common languages. Bender calls for greater transparency about the uncertainty in translation results, emphasizing that machine translation should be used with an awareness of its limitations.
👥 Anthropomorphism and Its Pitfalls
Misleading Terminology
Bender resists terms like “hallucinations” for AI errors, as such words anthropomorphize LLMs by suggesting they have human-like thoughts or intentions. She argues for language that reflects these systems’ actual limitations, urging the use of terms that emphasize functionality over capability, like “error” instead of “hallucination.”
The Problem with “Artificial Intelligence”
Bender critiques the term “artificial intelligence” itself, arguing that it sets up unrealistic expectations. The concept of “intelligence” in AI, she notes, has roots in eugenics, as the history of intelligence as a measurable quality is controversial and ethically fraught. Bender stresses that these systems are computational tools, not intelligent beings, and should be discussed in terms that clarify rather than mystify their functions.
🌐 Homogenization of Language and Ideas
Cultural and Creative Impact
Bender warns that LLMs could have a homogenizing effect on language and ideas, as they inherently average responses based on patterns in data, potentially stifling creativity. While LLMs may limit exposure to diverse thought, Bender hopes that users will seek out genuine human creativity in response to the prevalence of synthetic text.
🔄 A Purpose-Driven AI Development Model
Building AI Based on Stakeholders’ Needs
Bender proposes an approach to AI development that begins with a clear understanding of end-user needs. She argues that if AI systems were developed with a focus on practical utility and accountability, as opposed to scaling LLMs, more effective and ethical solutions would emerge. For example, in cases like the New York City information access system, AI tools should be developed with input from stakeholders, such as renters, landlords, and government officials, to ensure accurate and lawful information dissemination.
🔍 Research Directions Beyond LLM Scaling
Alternative Pathways in AI Research
Bender encourages researchers to pursue approaches beyond LLM scaling. She highlights the importance of models that emphasize semantic understanding, focusing on how words connect to meaning and intention. These alternative research pathways, she argues, are being overlooked in the rush to scale LLMs, yet they may ultimately lead to more reliable and ethically sound AI applications.
🌌 Rejecting the “Superintelligence” Narrative
Critique of AI Doomsayers and Accelerationists
Finally, Bender addresses a recent split in the AI community between those advocating for rapid AI advancement and those fearing its existential risks. She rejects the underlying assumption of both camps that LLMs could lead to artificial general intelligence. Instead, she views the current fascination with superintelligence as misguided, with both sides exaggerating the technology’s potential. Bender warns that focusing on AI “doom” or “acceleration” detracts
FAQs
Q: What does Emily Bender mean by calling language models “stochastic parrots”?
A: Bender uses the term “stochastic parrots” to highlight that language models like ChatGPT generate text based on statistical patterns in data rather than true understanding. They can mimic human-like language but lack the ability to comprehend meaning, context, or intent, thus “parroting” without any genuine awareness.
Q: Why does Bender criticize the scaling of large language models (LLMs)?
A: Bender argues that scaling LLMs only makes them better at generating human-like text, not at understanding it. She believes that larger models don’t necessarily lead to more useful or accurate AI and that scaling comes with risks like misinformation, bias, environmental impact, and exploitative labor practices.
Q: How does Bender view the impact of LLMs on labor practices?
A: Bender highlights that the development of LLMs relies on exploitative labor practices. Human contractors are often employed in low-wage positions to annotate data, filter content, and rate answers—essential tasks for building and maintaining these models.
Q: What are Bender’s concerns regarding LLMs and data bias?
A: Bender argues that LLMs inherently carry biases from their training data, which often includes vast amounts of internet text containing stereotypes, racism, sexism, and other biases. Filtering or curating data can help, but she notes that these efforts are often insufficient, leaving LLMs vulnerable to reproducing harmful content.
Q: Why does Bender argue that LLMs are unsuitable for search engines?
A: Bender believes that LLMs, when used in search engines, distort how users interact with information. They create synthesized answers detached from original sources, removing context and reliability. This prevents users from understanding where information comes from and assessing its credibility.
Q: Does Bender believe LLMs could become “superintelligent”?
A: No, Bender rejects the notion that LLMs or similar AI could lead to “artificial general intelligence” (AGI) or superintelligence. She argues that LLMs are fundamentally limited by their design, which focuses on pattern-matching rather than genuine intelligence or comprehension.
Q: What does Bender suggest as a better approach to developing AI systems?
A: Bender advocates a purpose-driven development model that centers on understanding the needs of stakeholders, such as end-users and affected communities, rather than scaling LLMs. She believes that AI should be tailored to specific, practical tasks with clear accountability rather than aiming for general-purpose language mimicry.
Q: How does Bender view machine translation and its potential risks?
A: While Bender acknowledges the benefits of machine translation in breaking down language barriers, she warns that users might mistakenly trust it to always be accurate. She encourages greater transparency around machine translation’s limitations, especially in languages that are less well-represented in training data.
Q: What role does Bender believe language plays in AI misconceptions?
A: Bender argues that using terms like “hallucinations” or “intelligence” anthropomorphizes AI, leading people to attribute human-like qualities to LLMs that they do not possess. She advocates for language that accurately reflects AI’s limitations and functions to avoid misleading assumptions.
Q: How does Bender think LLMs might affect language and creativity?
A: Bender is concerned that LLMs could homogenize language, as they average out responses to mimic general patterns, potentially stifling creativity and diversity in thought. However, she also believes people may push back and seek authentic, non-synthetic language and ideas in response to these trends.

Emily Bender raises several critical points about large language models (LLMs) like ChatGPT, focusing on their lack of true understanding, propensity for factual inaccuracies, inherent biases, environmental impact, and the dangers of anthropomorphizing AI. While her concerns are valid and highlight important issues in AI development, some criticisms can be made regarding her comments:
1. Underestimating Practical Utility:
• Criticism: Bender questions the utility of LLMs by stating, “I don’t know what that’s for,” when referring to their ability to generate human-like text. This may overlook the practical applications where LLMs have proven beneficial, such as assisting in drafting emails, coding assistance, language translation, and content generation.
• Counterargument: While LLMs may not “understand” content as humans do, they can still perform tasks that save time and resources, enhancing productivity in various industries.
2. Dismissal of Improvement Possibilities:
• Criticism: She asserts that the industry cannot reduce factual errors in LLM outputs to an acceptable level, given their design. This perspective might be seen as pessimistic, not accounting for ongoing research aimed at improving model accuracy through techniques like reinforcement learning from human feedback and better training methodologies.
• Counterargument: Advances in AI safety research are continually addressing these issues, and while perfection may be unattainable, significant improvements can make LLMs more reliable for certain applications.
3. Overemphasis on Negative Aspects:
• Criticism: Bender focuses heavily on the potential harms and downplays the benefits LLMs can offer. For example, in fields like education, healthcare, and customer service, LLMs can provide valuable support when used responsibly.
• Counterargument: A balanced view might acknowledge both the risks and the opportunities, promoting a more nuanced discussion about how to mitigate downsides while harnessing benefits.
4. Anthropomorphism and Language Use:
• Criticism: While she cautions against anthropomorphizing AI, the use of terms like “stochastic parrots” might contribute to misconceptions by oversimplifying complex technologies.
• Counterargument: Such metaphors, while illustrative, may not fully capture the capabilities and potential of LLMs, possibly leading to underestimation of their usefulness and impact.
5. Environmental Impact Perspective:
• Criticism: Bender highlights the environmental costs of training and running LLMs but may not equally consider the environmental benefits of AI applications, such as optimizing energy use in other domains or reducing the need for physical resources through digital solutions.
• Counterargument: A comprehensive assessment would weigh both the environmental costs and the potential for AI to contribute to sustainability efforts elsewhere.
6. Lack of Faith in Mitigation Strategies:
• Criticism: She seems skeptical about the effectiveness of strategies to reduce bias and misinformation in AI models. This might discount the efforts and progress made in AI ethics, bias mitigation, and transparency.
• Counterargument: Continued interdisciplinary research is making strides in addressing these issues, and completely dismissing their potential may hinder constructive developments.
7. Rejection of AI’s Potential for Understanding:
• Criticism: Bender asserts that LLMs cannot and will not achieve understanding or meaning. Some may argue this stance is too rigid, ignoring the possibility that future models could develop more advanced forms of machine understanding through novel architectures or integration with other AI systems.
• Counterargument: The field of AI is rapidly evolving, and what is not possible today might become feasible with new innovations.
8. Alternative Approaches Not Fully Explored:
• Criticism: While advocating for stakeholder-driven AI development, Bender does not delve deeply into how scalable and practical these alternatives are compared to LLMs in meeting diverse needs across industries.
• Counterargument: Large-scale models offer solutions that are difficult to replicate with smaller, purpose-built systems, especially when dealing with the complexity and variability of human language.
9. Dismissal of AGI Discussions:
• Criticism: Bender dismisses the possibility and desirability of artificial general intelligence (AGI), potentially overlooking valuable discourse on the future of AI and its long-term implications.
• Counterargument: Engaging with AGI concepts can help prepare for and shape future technological trajectories, ensuring ethical considerations are integrated from the outset.
10. Impact on Innovation:
• Criticism: Her critical stance might discourage exploration and investment in AI research, potentially slowing down beneficial innovations that could arise from continued work with LLMs.
• Counterargument: Responsible innovation requires balancing caution with openness to new possibilities, fostering an environment where ethical AI development can thrive.
In summary, while Emily Bender brings important issues to light regarding the development and use of large language models, some may find her perspective too cautious or dismissive of the potential benefits and ongoing efforts to address the challenges she mentions. A balanced approach might involve acknowledging both the limitations and the valuable contributions of LLMs, promoting responsible development that leverages their strengths while actively mitigating risks.
Definition of “Stakeholder” in Large Language Models (LLMs)
In the context of Large Language Models (LLMs), a stakeholder refers to any individual, group, organization, or entity that has an interest in, is affected by, or can influence the development, deployment, regulation, or use of these AI systems. Stakeholders are integral to the AI ecosystem, as their needs, concerns, and actions can significantly impact the direction and impact of LLM technologies.
Categories of Stakeholders in LLMs:
1. Developers and Researchers:
• Role: They design, build, train, and refine LLMs.
• Interest: Advancing technology, improving performance, and pushing the boundaries of what’s possible with AI.
• Influence: Directly shape the capabilities and limitations of LLMs through their technical decisions.
2. Users:
• Role: Individuals or organizations that interact with LLMs or applications powered by them.
• Interest: Utilizing LLMs for various purposes such as information retrieval, content creation, education, or entertainment.
• Influence: Their feedback and usage patterns can guide future improvements and features.
3. Data Providers:
• Role: Entities or individuals whose data is used to train LLMs.
• Interest: Ensuring their data is used ethically and possibly receiving compensation or recognition.
• Influence: The quality and diversity of the data they provide affect the performance and biases of LLMs.
4. Regulators and Policymakers:
• Role: Government agencies and bodies that create laws and guidelines governing AI use.
• Interest: Protecting public interest, ensuring safety, privacy, and ethical standards.
• Influence: Can impose regulations that shape how LLMs are developed and deployed.
5. Businesses and Organizations:
• Role: Companies that develop LLMs or integrate them into their products and services.
• Interest: Commercial success, competitive advantage, and meeting customer needs.
• Influence: Drive innovation and set industry standards based on market demands.
6. Ethicists and Advocacy Groups:
• Role: Organizations and individuals focused on the ethical implications of AI.
• Interest: Promoting fairness, transparency, accountability, and the prevention of harm.
• Influence: Raise awareness of potential issues and advocate for responsible practices.
7. Impacted Communities:
• Role: Groups that may be affected by the biases or decisions made by LLMs.
• Interest: Ensuring fair treatment and avoiding negative consequences such as discrimination or misrepresentation.
• Influence: Can provide valuable insights into the societal impacts of LLMs and push for necessary changes.
8. Legal and Compliance Professionals:
• Role: Experts who ensure that the development and use of LLMs comply with laws and regulations.
• Interest: Minimizing legal risks and upholding ethical and legal standards.
• Influence: Guide organizations in implementing compliant and ethical AI practices.
9. Investors and Funding Bodies:
• Role: Entities that provide financial resources for LLM research and development.
• Interest: Return on investment, fostering innovation, and market growth.
• Influence: Their funding decisions can prioritize certain research directions or applications.
10. The General Public:
• Role: Society at large, which may be indirectly affected by LLMs.
• Interest: Impacts on employment, privacy, culture, and overall well-being.
• Influence: Public opinion can shape policies and the acceptance of LLM technologies.
Importance of Stakeholders in LLMs:
• Ethical Development: Stakeholder engagement ensures that ethical considerations are integrated into the development process, addressing issues like bias, fairness, and transparency.
• Balanced Perspectives: Involving diverse stakeholders leads to a more comprehensive understanding of potential impacts and benefits.
• Regulatory Compliance: Collaboration with regulators and legal experts helps in navigating the complex legal landscape surrounding AI.
• Social Responsibility: Acknowledging the concerns of impacted communities and the public fosters trust and social acceptance.
• Innovation and Improvement: Feedback from users and other stakeholders drives continuous improvement and innovation in LLMs.
Conclusion:
Stakeholders in the realm of Large Language Models are vital to the responsible advancement of AI technology. By considering the interests and influences of all stakeholders, developers and organizations can create LLMs that are not only technologically advanced but also socially beneficial, ethical, and aligned with the values and needs of society.