Engineering

Bugs (Software Engineering)

From semantic roots to AI-generated code: the empirical science of software defects

Lead Summary

A software bug is any defect that causes a program to behave differently from its specification. Bugs span the full development lifecycle — from requirements ambiguities that propagate into design, through coding mistakes and configuration errors, to race conditions in concurrent execution. They are not rare accidents but a structural property of complex systems: no large language model or production codebase has been found to generate consistently defect-free code, and empirical studies of major open-source projects reveal that semantic bugs dominate all other defect categories while also being the hardest to detect automatically. Understanding bugs as a domain requires engaging with taxonomy (what kinds exist), economics (when they cost the most to fix), human factors (why programmers introduce them), and tooling (how detection works in practice). Each of these angles has a substantial empirical literature — and each reveals surprising nuance.

Definition & Scope

The word "bug" is informal. In engineering practice the preferred term is defect, and its definition involves two elements: a deviation from some specification, and a failure to do so in the intended context. Root-cause classification distinguishes between inadequate process (the implementation fails to meet a valid spec) and inadequate specification (the implementation satisfies the spec but the spec is wrong). Both manifest as bugs to end users but require entirely different remediation strategies.

Severity and priority are separate dimensions in bug triage. Severity reflects impact on system functionality — how bad the failure is for users. Priority reflects business urgency — how soon it needs to be fixed given organizational context. These two attributes frequently diverge: a typo on a rarely-visited settings page may be low severity but high priority if it affects the company's brand; a crash in an edge-case path may be high severity but low priority if no user has ever triggered it. Severity is typically assigned by whoever discovers the bug; priority is assigned during triage by someone with business context.

Classification & Taxonomy

The three canonical families

Large-scale empirical studies of production open-source code converge on three broad bug families:

Semantic bugs are the dominant category. In an analysis of 2,060 bugs sampled from the Linux kernel, Mozilla, and Apache, semantic bugs represent the plurality and the majority of security-related defects. They are application-specific — they reflect the gap between the programmer's understanding of what the program should do and what it actually does — making them resistant to generic automated detection tools. Unlike memory bugs, there is no language-level mechanism that prevents a semantic bug.

Memory bugs — null pointer dereferences, buffer overflows, use-after-free errors, double frees — have shown a declining trend in open-source software as effective detection tools became widely adopted. But they have not disappeared. NULL pointer dereference (CWE-476) remains on the MITRE CWE Top 25 Most Dangerous Software Weaknesses and continues to cause crashes in PHP-FPM, Windows kernel, and macOS kernel. For specific bug classes in large C/C++ codebases, static analysis tools like CodeQL and Infer produce over 95% false alarm rates — capability exists but practical utility approaches zero.

Concurrency bugs are especially prevalent in systems software. A comprehensive study of 105 real-world concurrency bugs across MySQL, Apache, Mozilla, and OpenOffice documents their prevalence and characteristics. The Linux kernel exhibits a higher proportion of concurrency bugs than application-layer software, reflecting the complexity of interrupt handling and hardware interaction. Approximately one-third of non-deadlock concurrency bugs are caused by violations of programmer order intentions — scenarios where actual execution order differs from what the programmer assumed. About 34% involve multiple variables, a gap that existing single-variable analysis tools do not address. And approximately 73% cannot be fixed by simply adding or modifying locks — the first attempted fix is often incorrect, because reasoning about concurrent execution is fundamentally difficult.

Defect origin phases

Defects are not uniformly distributed in time. They originate across development phases: requirements phase (specification errors), design phase (architectural violations), coding phase (implementation mistakes), integration phase (interface failures), and deployment phase (configuration or environmental mismatches). The origin phase matters because it determines detection cost — a defect born in requirements and discovered in production carries a rework burden spanning every artifact built on top of the original error.

In aerospace software systems, empirical breakdown shows 58% of errors originate in code logic, 25% from input/sensor/command sources, and 16% from configuration data — demonstrating that coding mistakes are the plurality but by no means the complete picture in safety-critical domains.

Type-based classification

A practical type taxonomy for defects includes at least five categories: functional, security, performance, compatibility, and usability. These categories are not mutually exclusive — a race condition in authentication might simultaneously be a functional bug, a security vulnerability, and a performance issue. The OWASP threat landscape illustrates a shifting distribution: injection vulnerabilities dropped from #3 in 2021 to #5 in 2025, not because injection became less dangerous, but because new attack surfaces in cloud and API services elevated other categories.

Orthogonal Defect Classification

The most rigorous industrial taxonomy framework is Orthogonal Defect Classification (ODC), developed at IBM Research in the late 1980s and early 1990s. ODC captures defects across dimensions including activity, trigger, severity, origin, content, and type — enabling teams to extract measurable signals about the development process from the defects themselves. Applying ODC reportedly improved defect analysis time by a factor of 10 or more. It has been successfully applied across waterfall, spiral, and agile processes.

Why taxonomies matter

Structured defect taxonomies do more than organize data. Empirical industrial case studies demonstrate that taxonomy-based testing reduces the number of required test cases while increasing failures found per test case. Taxonomies function as organizational knowledge capture mechanisms — codifying domain expertise into reusable frameworks for test design, resource allocation, and quality measurement.

Core Concepts

The cost-of-defect curve

Barry Boehm's foundational empirical work in the late 1970s established that fixing a defect becomes exponentially more expensive the later it is discovered. Derived from TRW and IBM project data in Waterfall development contexts, his studies showed multipliers ranging from 1x at requirements to 100x after release. The IBM Systems Sciences Institute later reported specific multipliers: 1x during design, 6.5x during implementation, 15x during testing, and 60–100x after release.

The mechanism is defect propagation and cascading rework. When a requirements error goes undetected, it propagates into design (multiple components built on the flawed requirement), then into implementation (code written to satisfy the flawed design), then into testing (tests written to validate the wrong behavior). Late discovery requires rework across every downstream artifact — requirements, designs, code, and tests that validated the wrong behavior must all be revised.

The Boehm curve was derived from large, sequential, government-contracted projects. Modern CI/CD practices compress this curve significantly. Continuous integration enables defect detection within minutes of code change, while the codebase change is still recent and isolated. The shift-left testing philosophy operationalizes this: move quality activities earlier, catching defects while dependent changes have not yet accumulated. In continuous deployment environments, the detection-to-fix cycle shrinks from weeks to hours, substantially reducing the practical cost multiplier.

"Shipping first time code is like going into debt. A little debt speeds development so long as it is paid back promptly with a rewrite." — Ward Cunningham, OOPSLA 1992

Technical debt as a defect driver

Ward Cunningham coined the technical debt metaphor in 1992, framing expedient code shortcuts as financial obligations with compounding interest. Codebases with unaddressed technical debt exhibit higher bug densities across multiple releases: files with self-admitted technical debt comments tend to have significantly more defects than clean files. Code smells increase both the time to change code and the probability of introducing bugs during that change.

Structural code metrics predict this. Code complexity measures — cyclomatic complexity, lines of code, nesting depth — show strong, stable correlations with defect density. Machine learning models combining coupling, complexity, and cohesion metrics achieve over 80% accuracy in predicting defect-prone classes. Highly coupled modules exhibit disproportionately high bug densities, because changes to coupled dependencies are more likely to be incomplete or incorrect.

Mechanism & Process

How programmers introduce bugs

Reason's cognitive error taxonomy maps directly onto programming defects. Three distinct failure modes exist:

Slips are inattentional failures in routine execution — typos, forgotten code elements, using <= instead of < in a loop bound. Off-by-one errors (OBOE) are canonical slips: misunderstanding boundary conditions in loops, array indexing, and API conventions. They occur in experienced programmers under high attentional demand.

Lapses are memory failures in planned steps — omitted boundary checks, copy-paste reuse where identifiers are not consistently updated. Copy-paste bug propagation is a particularly damaging lapse mechanism: a single malformed pattern duplicated across a codebase replicates the same error at scale.

Mistakes are planning or design errors from incorrect mental models. When API documentation fails to make the design model clear, programmers construct incorrect mental models and implement solutions based on flawed understanding. Outdated mental models persist and do not automatically update when systems change.

Confirmation bias is a pervasive cognitive force in testing and debugging. Programmers tend to design test cases that confirm expected correct behavior rather than searching for refuting cases. This bias persists through code review — reviewers under time pressure default to confirmation-seeking. Time pressure under deadline stress amplifies this effect: quality collapse is more severe than the apparent productivity gain, with increased defect rates measurable in controlled experiments. Logical reasoning ability reduces but does not eliminate confirmation bias.

Debugging itself depletes cognitive resources. Prolonged troubleshooting requires constructing and maintaining approximations of system behavior across combinatorial state transitions. When cognitive fatigue sets in, mental models become less accurate and hypothesis generation becomes more error-prone — this is a cognitive depletion effect, not a skill deficit.

Configuration as a bug source

Configuration defects have emerged as a dominant cause of system failures in production deployments. An empirical study of 546 real-world misconfigurations found that between 70% and 85.5% result from mistakes in setting configuration parameters rather than from fundamental design flaws. Configuration errors cause severe service outages and downtime, and represent a distinct class of human error requiring different prevention strategies than coding mistakes.

Variants & Subtypes

Machine learning system bugs

Machine learning systems require domain-specific bug taxonomies distinct from traditional software defects. Research on deep learning frameworks identifies 34 bug symptoms, 28 root causes, and 6 fix patterns, with 45.1% of symptoms unique to distributed ML frameworks. Hyperparameter-related defects form a significant ML-specific category: the most commonly reported incorrect hyperparameters are learning rate, batch size, and epoch count. Suboptimal hyperparameter values do not necessarily cause crashes but significantly degrade training performance, making them a distinct category that does not map to traditional functional/security/performance dimensions.

Current Status

Bugs in AI-generated code

The widespread adoption of AI code generation tools has introduced a new defect profile that differs measurably from human-written code. Comprehensive evaluations across GPT-4, Codex, DeepSeek, and other models show that no large language model generates consistently defect-free code. AI-co-authored pull requests contain approximately 1.7x more issues overall than human-generated ones — about 10.83 issues each, compared to 6.45 in human-generated PRs.

The defect profile of AI-generated code has a characteristic shape: roughly 90–93% code smells, 5–8% bugs, and around 2% security vulnerabilities. Compared to human-written code, AI code shows approximately 75% more logic and correctness errors (missing null checks, inadequate exception handling, "almost-right" code that fails edge cases), 2x more error handling gaps, twice the rate of concurrency and dependency correctness mistakes, 2.66x more formatting issues, and 3x more readability issues.

Security vulnerabilities appear in AI-generated code at 1.5–2.74x greater frequency than in human-written code, with 29–45% of AI-generated code containing security vulnerabilities. When prompted for cybersecurity scenarios, GitHub Copilot produced security-related bugs in 40% of its suggestions. The most prevalent vulnerability types concentrate in specific CWE categories: OS Command Injection (CWE-78), Use of Insufficiently Random Values (CWE-330), and Improper Handling of Exceptional Conditions (CWE-703).

Hallucinations as bugs

LLMs frequently generate code that references non-existent APIs, functions, and library methods. High-frequency APIs yield fewer hallucinations; low-frequency APIs result in significantly more, establishing a clear correlation between API training frequency and hallucination rates. Nearly 20% of package recommendations in AI-generated code point to libraries that do not exist. Detection frameworks achieve 90% accuracy and 77% fix accuracy for discovered hallucinations, but only when applied explicitly.

A large-scale empirical analysis of 211 million changed lines of code found an 8-fold increase in duplicated code blocks associated with AI-assisted development, with projections that defect rates will double as AI-assisted development scales. The root cause is fundamental: LLMs lack access to complete project context, relying instead on next-token prediction trained on large swaths of open-source data. This limitation drives systematic errors in non-standalone functions where local project conventions and constraints are unavailable to the model.

Static analysis: the adoption gap

Static analysis tools occupy a central role in defect detection but face persistent adoption challenges. The economics are unfavorable: state-of-the-art tools miss between 47% and 80% of vulnerabilities in real-world C programs — even combining multiple tools reduces but does not close this gap (30–69% false negative range). Simultaneously, false positive rates of 50% or higher are common, with organizations receiving an average of 960 security alerts daily and over 80% of analysts reporting false positives as their primary source of alert fatigue.

The behavioral consequence is suppression accumulation. A 2025 empirical study of 7,357 suppressions across 6.69 million lines of code found that developers suppress warnings for three primary reasons: eliminating false positives, imprecise warning messages, and postponing fixes. Approximately 50.8% of suppressions are useless — they provide no actual value and can unintentionally hide future warnings. In a user study with 20 experienced developers, 95% felt that static analysis tools do not present results with sufficient information to assess what the problem is, why it is a problem, and what corrective action should be taken.

When false positive rates exceed 50%, developer trust erodes irreversibly. Teams learn to distrust all alerts, delaying investigation of genuine issues that resemble previous false positives. The erosion is self-reinforcing: declining confidence leads to disabled rules and accumulated suppressions, further reducing tool effectiveness.

Adoption improves when tools integrate into familiar developer workflows. Incremental static analysis — analyzing only changed code — reduces analysis time from hours to minutes for large repositories, enabling sustainable adoption. LLM-augmented approaches show promise in reducing false positives: a Tencent study demonstrated LLM filtering of SAST tool output eliminated false positives without pruning true positives, though precision and recall vary significantly by model and prompting strategy.

LLM-driven fuzzing

Fuzzing — automated generation of inputs to trigger crashes and assertion failures — has been transformed by large language models. LLMs function as zero-shot fuzzers without task-specific training, generating effective test cases for diverse domains based on implicit knowledge from pretraining. They can learn intricate API constraints and dependencies from training data, eliminating the need for expensive hand-crafted grammar specifications. They can also solve complex constraint satisfaction problems that prevent traditional coverage-guided fuzzing from reaching deep code regions.

Empirical results demonstrate the scale of discovery: Fuzz4All discovered 98 bugs across GCC, Clang, Z3, CVC5, OpenJDK, and Qiskit with 64 confirmed as previously unknown. TitanFuzz found 65 bugs in TensorFlow and PyTorch with 44 confirmed as new, achieving 30–50% higher code coverage than prior state-of-the-art. CovRL identified 58 security-related bugs in JavaScript interpreters including 15 CVEs. HLPFuzz discovered 52 bugs across 9 language processors with 37 confirmed.

Controversies & Debates

TDD: what the evidence actually says

Test-Driven Development occupies contested empirical ground. Industrial case studies from Microsoft and IBM reported pre-release defect density reductions of 40–90%, with an upfront development time cost of 15–35%. Yet systematic meta-analyses of 27 TDD studies find only a small positive effect on code quality and little to no discernible effect on productivity. Controlled experiments with professional developers often find statistically insignificant differences on acceptance test cases passed, cyclomatic complexity, branch coverage, and lines of code.

The gap between industrial and academic findings is attributable to multiple confounds. Studies define TDD differently and participants often lack a shared understanding of the methodology. Research participants are often students or TDD newcomers rather than experienced practitioners. Studies typically focus on greenfield code, not the legacy codebases where most industrial teams actually work. And claimed quality gains are concentrated in low-rigor studies — meta-analytic subgroup analysis shows TDD effects are much larger in industrial studies than in academic controlled experiments. The most defensible claim is that TDD leads developers to write more unit tests with higher fault-detection capability, but the mechanism (test-first ordering vs. increased test volume) remains unresolved.

A related claim — that test code coverage predicts bug detection effectiveness — has also been challenged. Coverage is not strongly correlated with test suite effectiveness in detecting bugs; high coverage numbers are easy to achieve with low-quality tests.

Key Takeaways

  1. Bugs span the full development lifecycle—from requirements ambiguities to race conditions—and are not rare accidents but structural properties of complex systems. No production codebase generates consistently defect-free code. Semantic bugs dominate all other defect categories while being hardest to detect automatically. Understanding bugs requires engaging with taxonomy, economics, human factors, and tooling.
  2. The cost of fixing a bug grows exponentially: 1x at requirements, 6.5x during implementation, 15x during testing, 60–100x after release. Defects propagate downstream into dependent artifacts. Modern CI/CD practices compress this curve by enabling detection within minutes, while shift-left testing operationalizes early quality activities.
  3. Three canonical bug families emerge from large-scale empirical studies: semantic bugs (dominant, application-specific logic flaws), memory bugs (declining but persistent), and concurrency bugs (especially prevalent in systems software). Semantic bugs resist generic automated detection. Memory bugs have declined with tool adoption. Concurrency bugs are fundamentally difficult to reason about; 73% cannot be fixed by simply modifying locks.
  4. Programmers introduce bugs through slips (inattentional failures), lapses (memory failures), and mistakes (planning errors from incorrect mental models). Confirmation bias in testing, time pressure under deadlines, and debugging-induced cognitive fatigue all amplify defect introduction. These are human factors, not skill deficits.
  5. AI-generated code contains approximately 1.7x more issues than human-written code, with 90–93% code smells, 5–8% bugs, and around 2% security vulnerabilities. AI code shows 75% more logic errors, 2x more error handling gaps, 2x more concurrency mistakes, 2.66x more formatting issues, and 3x more readability issues than human code. Security vulnerabilities appear 1.5–2.74x more frequently.
  6. Static analysis tools suffer from high false positive rates (50%+) and false negatives (47–80%), creating alert fatigue that erodes developer trust irreversibly. When false positive rates exceed 50%, teams learn to distrust all alerts. Incremental static analysis and LLM-augmented filtering show promise in improving practical adoption.
  7. Test-Driven Development shows a small positive effect on code quality in meta-analysis but concentrated gains in industrial studies, not controlled experiments with students. The methodological gap arises from inconsistent TDD definitions, novice practitioners, greenfield code focus, and low-rigor studies. TDD's main effect is driving higher test volume with better fault detection capability.

Further Exploration

Foundational empirical work

AI-generated code quality

Static analysis

TDD evidence