Engineering

Python

A gradual, protocol-driven language navigating the path from simplicity to performance

Lead Summary

Python is a high-level, interpreted programming language built around a protocol-driven data model and a philosophy of readability and flexibility. Its data model defines a rich vocabulary of special methods — commonly called "dunder methods" — that allow user-defined types to participate in the full range of language operations, from arithmetic to context management to iteration. This extensibility, combined with a gradual type system added incrementally since 2015, makes Python simultaneously one of the most dynamic and most widely typed languages in use today.

Two major long-running efforts are reshaping Python's capabilities: the Faster CPython project targeting a 5x performance improvement over several releases, and free-threaded CPython (PEP 703) making optional GIL removal a production-supported feature as of Python 3.14. Neither effort has resolved Python's fundamental tensions — between dynamic power and static verification, between single-threaded performance and parallelism — but both represent the most ambitious engineering investments in the language's history.


Core Concepts

The Data Model and Dunder Methods

Python's extensibility rests on its data model: a system of special methods prefixed and suffixed with double underscores that Python invokes automatically in response to language-level operations. The informal name "dunder" (for "double underscore") is widely used in the Python community to refer to these methods. Any class that defines __add__ participates in the + operator; any class defining __iter__ and __next__ becomes a valid iterator.

This protocol-based design is the foundation of Python's duck typing philosophy: an object's fitness for a role is determined by the presence of required methods, not by its inheritance hierarchy. PEP 544 formalized this concept as typing.Protocol, enabling static duck typing — type checkers can now verify structural subtyping without requiring explicit inheritance.

Several protocols are central to the data model:

  • Iterator protocol (__iter__, __next__): formalized in PEP 234, it underlies all for loops and iterable constructs.
  • Container protocol (__len__, __getitem__, __setitem__, __delitem__): defines how custom objects behave like sequences and mappings.
  • Context manager protocol (__enter__, __exit__): powers the with statement and guarantees deterministic resource cleanup even in the presence of exceptions.
  • Numeric protocol (__add__, __radd__, __mul__, etc.): enables operator overloading for arithmetic, with a reflection fallback mechanism when the left operand returns NotImplemented.
  • Descriptor protocol (__get__, __set__, __delete__): the most fundamental layer, underpinning properties, methods, class methods, static methods, super(), and __slots__.
Dunder lookup bypasses the instance dictionary

Dunder methods are resolved through a slot-based lookup that walks the MRO directly at the class level. An instance cannot override __add__ by setting an instance attribute — only class-level definitions count for language operations. This is a deliberate design choice to keep language semantics stable and predictable.

The Descriptor Protocol in Depth

Descriptors that define __set__() or __delete__() are data descriptors and take precedence over instance dictionary entries of the same name. Descriptors that define only __get__() are non-data descriptors and are overridden by instance dictionary entries. This asymmetry is what makes properties (data descriptors) able to enforce constraints while allowing methods (non-data descriptors) to be overridden per-instance.

Python's property() builtin is itself implemented as a data descriptor. Read-only properties define __set__() to raise AttributeError, which is sufficient to make the descriptor take precedence over any instance dictionary entry.

The __slots__ mechanism builds on the same protocol: when defined, Python creates member_descriptor objects for each slot name, replacing dynamic __dict__-based storage with direct memory slots. The result is reportedly a 40–50% reduction in per-instance memory and faster attribute access due to eliminated dictionary lookups.

Abstract Base Classes and Formal Protocols

Python offers two approaches to formalizing protocols:

  1. Abstract Base Classes (ABCs), introduced in PEP 3119. The collections.abc module defines ABCs for common protocols (Iterator, Iterable, Sized, Container, Sequence, MutableSequence, etc.). ABCs can be verified via isinstance() and provide mixin implementations of derived methods. They convert informal duck typing into a verifiable nominal system.

  2. Structural protocols (typing.Protocol), introduced in PEP 544. Protocol compliance is structural and implicit — objects need not inherit from the Protocol class; they simply need to implement the required methods. Type checkers verify this statically.

The two mechanisms are complementary: ABCs provide runtime verification and mixin logic; structural protocols provide static analysis without requiring inheritance.


Metaprogramming Patterns

Python's metaprogramming toolkit spans a range of tools with different scopes and tradeoffs.

Decorators

Python decorators are executed at runtime when the decorated function or class is defined, not at some separate compilation phase. This is the critical distinction from annotations in other languages like Java, where annotations are metadata with limited runtime execution. Python decorators can perform arbitrary computations: register functions in global registries, modify class hierarchies, intercept behavior at definition time. Frameworks like Flask and FastAPI rely entirely on this capability.

The @dataclass decorator (introduced in PEP 557) is a canonical example of this pattern used conservatively: it examines a class's type annotations to identify fields, then generates __init__, __repr__, __eq__, and other dunder methods at decoration time. Crucially, it returns the same class object — no new class is created.

Metaclasses and Their Limitations

Metaclasses give full control over class creation, enabling arbitrary transformation of the class namespace before the class object is finalized. However, they have a fundamental composability problem: if two unrelated base classes use different custom metaclasses, there is no automatic way to combine them. A new metaclass must be explicitly created that inherits from both. This makes metaclasses toxic in public library code.

In practice, the vast majority of metaclass use cases fall into just three categories according to PEP 487: running initialization code after class creation, initializing descriptors, and preserving attribute definition order. All three can now be addressed with simpler mechanisms.

Simpler Alternatives (PEP 487)

PEP 487 introduced two metaclass-free hooks:

  • __init_subclass__(): defined on a parent class, it is automatically called whenever a subclass is created, with keyword arguments from the class definition forwarded to it. Unlike class decorators, it applies to all future subclasses — not just the immediate class.
  • __set_name__(): called during class creation to automatically inform descriptors of their owner class and the name they were assigned to, eliminating a major metaclass use case.

Class decorators and __init_subclass__ serve complementary roles: decorators are one-off transformations; __init_subclass__ enforces invariants across an entire hierarchy.

MRO and Multiple Inheritance

Python uses the C3 linearization algorithm for Method Resolution Order (MRO) since Python 2.3, producing a consistent and predictable ordering of base classes. When two classes list the same base classes in conflicting orders, Python raises a TypeError: MRO conflict rather than silently resolving the ambiguity. This design ensures that MRO is always unambiguous even in complex multiple-inheritance hierarchies.


The Type System: Gradual Typing

History and Foundation

PEP 484 introduced type hints to Python, establishing the foundational specification for Python's gradual type system. The theoretical framework, described in PEP 483, rests on the consistency relation — a replacement for traditional type equality that allows interaction between typed and untyped code. The special type Any is consistent with all other types, enabling code to transition between strictly typed and dynamically typed regions without type errors.

Gradual typing allows developers to annotate code incrementally. A fully unannotated codebase is valid; annotations can be added one file or one function at a time without requiring full coverage.

Adoption

Type hints have achieved substantial adoption. A survey examined by Meta found that 88% of respondents report "Always" or "Often" using type hints in their Python code, according to Engineering at Meta.

The ecosystem built around type hints now includes:

  • Typeshed: a centralized repository of external type stubs for the standard library, builtins, and third-party packages, used by all major type checkers.
  • PEP 561: the standard for distributing type information in packages, with stub packages versioned to match their corresponding runtime package.
  • Runtime validators: frameworks like Pydantic and FastAPI use type annotations at runtime for data validation and API documentation generation — a distinct usage category beyond static checking.

Advanced Typing Constructs

The type system has grown considerably:

  • ParamSpec (PEP 612): enables forwarding parameter signatures through decorators and higher-order functions.
  • TypeVarTuple (PEP 646): variadic generics, particularly useful for array shape typing.
  • TypeGuard: type narrowing predicates that allow type checkers to understand when a runtime check validates a more specific type.
  • isinstance() narrowing: type checkers perform automatic type narrowing based on isinstance() checks.
Python's type hints system is Turing complete — the type system can express computations of arbitrary complexity, and certain programs cause mypy to enter infinite loops.

This theoretical result, proven in a 2023 ECOOP paper, has practical consequences: no type checker can guarantee termination for all valid Python type annotations.

The Type Checker Landscape

Multiple type checkers exist with differing behaviors:

  • mypy: the original, developed at Dropbox, which historically served as the reference implementation. Whatever mypy implemented became the de facto specification in areas not formally covered by a PEP. However, mypy now passes only 57% of test cases in the official Python typing specification conformance suite, indicating significant gaps.
  • Pyright: Microsoft's type checker written in TypeScript. Pyright achieves 3–5x performance improvement over mypy through lazy evaluation (demand-driven type computation vs. multi-pass architecture). Pyright and mypy diverge on reachability analysis, handling of unannotated code, and class decorator semantics.
  • Pyrefly: Meta's newer entry, which published the conformance comparisons cited above.

Metaprogramming and Static Analysis Limits

Static type checkers can only reason about information that exists before runtime. Python's dynamic metaprogramming — reflection, code generation, runtime method manipulation — fundamentally resists static analysis. The @dataclass_transform() decorator in the typing module is the current workaround: it annotates custom decorators, classes, or metaclasses that perform dataclass-like code generation, telling type checkers to treat the target as performing "magic" transformations. It is a pragmatic approximation, not a general solution.

An emerging research approach, gradual metaprogramming, proposes type-checking code fragments as they are generated, with incremental runtime checks as they are spliced together — enabling detection of code generation errors at the source location rather than at runtime.


Concurrency

Python's concurrency story is fragmented across three orthogonal models, each with different tradeoffs.

The Global Interpreter Lock

The GIL is a deliberate design choice in CPython, not an oversight. It exists to protect CPython's reference counting memory management: since every object access increments or decrements a reference count, concurrent modifications without synchronization would cause memory corruption. The GIL provides a single lock that guards all reference counting operations, eliminating the need for fine-grained per-object locks and optimizing single-threaded performance.

The consequence is that multiple threads cannot execute Python bytecode simultaneously. Threading remains useful for I/O-bound workloads because the GIL is released during blocking I/O operations (network, file, database). For CPU-bound workloads, threading provides no parallelism benefit.

Multiprocessing

The classical workaround for CPU-bound parallelism is multiprocessing: spawning independent processes, each with its own Python interpreter and its own GIL. True parallelism is achieved, but at a cost — all data shared between processes must be serialized (pickled) and deserialized, which introduces significant overhead for large objects like NumPy arrays or Pandas DataFrames.

Python 3.8+ introduced multiprocessing.shared_memory as a partial solution, allowing processes to access the same memory block directly without pickling. The choice among the three IPC primitives (Queue, Pipe, Manager) involves real tradeoffs: Pipes offer the best performance for point-to-point communication at the cost of a simpler API; Queues are thread-safe but have higher overhead; Managers are flexible but slow.

Asyncio

Asyncio implements cooperative multitasking through an event loop that manages a single thread running multiple coroutines. Explicit await keywords serve as yield points where control returns to the event loop, allowing other coroutines to proceed. Because coroutine scheduling operates at the Python level rather than through OS context switches, thousands of coroutines can run within a single thread with minimal overhead.

The event loop uses platform-specific I/O multiplexing (epoll on Linux, kqueue on macOS, IOCP on Windows) to monitor multiple I/O operations simultaneously.

The colored function problem

Python's async/await syntax propagates like a virus: once a function is declared async def, all callers must also be async. This contrasts with Go's goroutines, which handle scheduling transparently without requiring explicit annotations throughout the codebase. Structured concurrency tools like TaskGroup (Python 3.11+) improve safety but do not address the coloring problem.

TaskGroup, introduced in Python 3.11, provides structured concurrency with stronger safety guarantees than asyncio.gather(). When any task raises an exception, the TaskGroup automatically cancels all remaining scheduled tasks and waits for them to finish before propagating the exceptions in an ExceptionGroup. This prevents orphaned tasks.

Subinterpreters (PEP 554 and 684)

PEP 554 provides multiple independent Python interpreters within a single process. PEP 684 extends this with a per-interpreter GIL, enabling true parallelism where individual interpreters no longer share the same lock. This is a third architectural approach to concurrency, distinct from threading, multiprocessing, and asyncio.


Free-Threaded Python (PEP 703)

The most structurally significant change to Python's concurrency model comes from PEP 703, which makes the GIL optional. It was reignited by Sam Gross in 2021 and formalized as PEP 703 in 2023, driven primarily by contributors at Meta including Ken Jin, Donghee Na, and Itamar Oren.

Technical Mechanisms

Free-threaded CPython replaces the GIL with a combination of three mechanisms:

  1. Biased reference counting: leverages the observation that most objects are accessed by only a single thread, enabling non-atomic reference count updates in the common case and atomic updates only when objects are accessed from multiple threads.
  2. Deferred reference counting: postpones atomic reference count updates for frequently-accessed objects (top-level functions, code objects, modules, methods) by deferring their destructors until the next cyclic garbage collection cycle.
  3. Immortalization: marks certain long-lived immutable objects with a special flag, preventing deallocation entirely and eliminating all reference count updates for these objects.

Per-object locks replace the GIL for protecting shared mutable state within built-in types like dict, list, and set. CPython's default allocator (pymalloc) is replaced with mimalloc, a thread-safe allocator designed for multi-threaded workloads.

Performance Profile

Free-threaded Python carries measurable overhead on single-threaded workloads:

  • Python 3.13 incurred approximately 40% single-threaded overhead (primarily because PEP 659 specialization was disabled).
  • Python 3.14 reduced this to 5–10% after re-enabling specialization, at the cost of approximately 15–20% memory overhead.
  • Benchmarks show the Steering Council budgets for a 10% single-threaded performance regression and a 15% memory increase.

For multi-threaded CPU-bound workloads, the gains are substantial:

  • Free-threaded Python 3.13 achieved approximately 2.2x faster execution on multi-threaded benchmarks.
  • Python 3.14 improved this to approximately 3.1x speedup.

Ecosystem Status

Free-threaded CPython is not ABI-compatible with the standard GIL-enabled build. The Python object header structure was changed to support biased reference counting, requiring C extensions to be explicitly recompiled. C extensions that do not declare free-threading support trigger a warning and cause the free-threaded interpreter to automatically re-enable the GIL — preventing native parallelism until the ecosystem catches up.

As of Python 3.14, free-threaded Python transitioned from experimental (3.13) to officially supported but non-default. PEP 779 defines the criteria for this supported status. The decision to make free-threaded the default build remains undecided, contingent on ecosystem adoption and further performance improvements.


Performance Engineering

The Faster CPython Project

The Faster CPython project, led by Mark Shannon and funded by Microsoft, established a target of achieving 5x speedup over several releases starting from Python 3.10 as the baseline. Microsoft's funding was discontinued in May 2025, transitioning the project to community stewardship, though development continues.

Progress by version:

VersionSpeedup vs. 3.10Key mechanism
3.11~1.25x (10–60%)PEP 659 specializing adaptive interpreter with inline caching
3.12incrementalInterpreter generated from a DSL; multi-opcode optimization
3.13incremental (JIT disabled)Micro-op translation infrastructure (disabled by default due to overhead)
3.14~1.2–1.4x cumulativeMicro-op JIT enabled; zero-cost exceptions

As of Python 3.14, cumulative improvements from 3.11–3.14 amount to approximately 20–40% over the 3.10 baseline.

PEP 659: The Specializing Adaptive Interpreter

PEP 659 is the architectural foundation of Faster CPython. It exploits type stability — the observation that while Python is dynamically typed, most code regions have types that rarely change at runtime. The interpreter rewrites bytecode instructions in-place with type-specialized versions during execution, caching type information in CACHE instructions stored directly in the instruction stream.

This inline caching approach improves cache locality and reduces indirection during bytecode dispatch. When types change, the specialized instruction is replaced with the generic version, and the process starts over.

Python 3.13: Zero-Cost Exceptions

The Faster CPython project implemented zero-cost exceptions, eliminating the runtime overhead of try statements when no exception is raised. This allows Python code to use exception handling idiomatically — using exception-based control flow like StopIteration in generators — without incurring performance penalties on the happy path.

Alternative Implementations

A comprehensive empirical study (arxiv.org/abs/2505.02346) compared eight Python compilation tools across performance, energy consumption, memory usage, and cache behavior:

Fig 1
Code modification required → Performance gain → PyPy Numba Codon Cython Nuitka Mypyc CPython JIT
Python compilation tool tradeoffs: performance gain vs. compatibility cost

Key findings:

  • PyPy (tracing JIT via RPython meta-tracing): combines CPython compatibility with minimal code changes and achieves large speedups, but faces adoption barriers from C extension incompatibility. PyPy's garbage collector breaks fundamental C API assumptions about object lifetime, requiring an expensive emulation layer (cpyext).
  • Numba (LLVM-based JIT): achieves performance approaching C or Fortran for numerical code, with 1–2 orders of magnitude speedup. Requires @jit decorator annotations and has significant restrictions in nopython mode (no arbitrary standard library calls).
  • Codon: achieves over 90% speed and energy improvements but requires substantial code modifications and has compatibility issues with standard libraries like NumPy.
  • Cython (AOT compilation to C extensions): generally 20–60% faster for general code, 60–90% for control structures, but Cython 3.0 introduced a 3x compilation time regression compared to version 0.29.
  • Deployment restrictions: platforms prohibiting runtime code generation (e.g., iOS) cannot use JIT-based optimizations, limiting achievable speedup to roughly 2x through interpretive and AOT approaches.

Mojo

Mojo is a Python superset built on MLIR infrastructure targeting systems-level programming and accelerator hardware (GPUs, TPUs). Unlike other Python optimization projects, its goal is not merely faster Python execution but direct access to accelerated hardware. It integrates CPython for Python interoperability and claims up to 68,000x performance of Python for specific workloads, though this figure reflects specialized hardware acceleration rather than general Python speedup.


Controversies & Debates

Mypy Conformance and Checker Fragmentation

Despite mypy's historical role as the reference implementation, it now passes only 57% of test cases in the Python typing specification's conformance test suite, according to Pyrefly's analysis. The existence of multiple type checkers with diverging behavior — pyright, mypy, pyrefly, pytype — creates a fragmented ecosystem where code that passes one checker may fail another.

The divergences are substantive, not cosmetic: reachability analysis, handling of unannotated code, class decorator semantics, and overload resolution all differ between checkers. This fragmentation was recognized as a problem leading to the formalization of a typing specification separate from any single implementation.

The GIL Removal Decision

PEP 703 chose to make the GIL optional rather than removing it. The standard CPython build continues to include and use the GIL. This decision prioritizes backward compatibility: existing code and C extensions continue to function without modification on the standard GIL-enabled build. The cost is maintaining two ABI-incompatible builds indefinitely.

Whether free-threaded builds will eventually become the default is undecided. The Steering Council has set specific criteria (PEP 779) but made no commitment on timeline.

# type: ignore and Technical Debt

The # type: ignore directive can suppress type warnings without identifying which specific error is being suppressed. Used as a blanket suppression, it creates technical debt that obscures real issues. Best practices — encoding specific error codes (# type: ignore[import-not-found]), documenting the reason, tracking ignores for eventual refactoring — are not enforced by default. Tools like mypy's warn_unused_ignores and Ruff's PGH003 rule exist to enforce discipline, but adoption is optional.


Further Reading

Python's data model and internals

Gradual typing

Performance

Free-threading

Concurrency