Apparently(?) this also needs to be attached to the function declarator and does not work as a function specifier: `static void *__preserve_none slowpath();` and not `__preserve_none static void *slowpath();` (unlike GCC attribute syntax, which tends to be fairly gung-ho about this sort of thing, sometimes with confusing results).
Yay to getting undocumented MSVC features disclosed if Microsoft thinks you’re important enough :/
Important enough, or benefits them directly? I have no good guesses how improving Python's performance would benefit them, but I would guess that's the real reason.
Python’s goal is never really to be fast. If that were its goal, it would’ve had a JIT long ago instead of toying with optimizing the interpreter. Guido prioritized code simplicity over speed. A lot of speed improvements including the JIT (PEP 744 – JIT Compilation) came about after he stepped down.
Really nice results on MSVC. The idea that tail calls effectively reset compiler heuristics and unblock inlining is pretty convincing. One thing that worries me though is the reliance on undocumented MSVC behavior — if this becomes widely shipped, CPython could end up depending on optimizer guarantees that aren’t actually stable. Curious how you’re thinking about long-term maintainability and the impact on debugging/profiling.
Thanks for reading! For now, we maintain all 3 of the interpreters in CPython. We don't plan to remove the other interpreters anytime soon, probably never. If MSVC breaks the tail calling interpreter, we'll just go back to building and distributing the switch-case interpreter. Windows binaries will be slower again, but such is life :(.
Also the interpreter loop's dispatch is autogenerated and can be selected via configure flags. So there's almost no additional maintenance overhead. The main burden is the MSVC-specific changes we needed to get this working (amounting to a few hundred lines of code).
> Impact on debugging/profiling
I don't think there should be any, at least for Windows. Though I can't say for certain.
That makes sense, thanks for the detailed clarification. Having the switch-case interpreter as a fallback and keeping the dispatch autogenerated definitely reduces the long-term risk.
TLDR: The tail-calling interpreter is slightly faster than computed goto.
> I used to believe the the tailcalling interpreters get their speedup from better register use. While I still believe that now, I suspect that is not the main reason for speedups in CPython.
> My main guess now is that tail calling resets compiler heuristics to sane levels, so that compilers can do their jobs.
> Let me show an example, at the time of writing, CPython 3.15’s interpreter loop is around 12k lines of C code. That’s 12k lines in a single function for the switch-case and computed goto interpreter.
> […] In short, this overly large function breaks a lot of compiler heuristics.
> One of the most beneficial optimisations is inlining. In the past, we’ve found that compilers sometimes straight up refuse to inline even the simplest of functions in that 12k loc eval loop.
The Python interpreter core loop sounds like the perfect problem for AlphaEvolve. Or it's open source equivalent OpenEvolve if DeepMind doesn't want to speed up Python for the competition.
The money shot (wish this were included in the blog post):
https://github.com/python/cpython/pull/143068/files#diff-45b...Apparently(?) this also needs to be attached to the function declarator and does not work as a function specifier: `static void *__preserve_none slowpath();` and not `__preserve_none static void *slowpath();` (unlike GCC attribute syntax, which tends to be fairly gung-ho about this sort of thing, sometimes with confusing results).
Yay to getting undocumented MSVC features disclosed if Microsoft thinks you’re important enough :/
Important enough, or benefits them directly? I have no good guesses how improving Python's performance would benefit them, but I would guess that's the real reason.
I wonder if this is related to Python in Excel. You'll have lots of people running numerical stuff written in Python, running on Microsoft servers.
This seems like very low hanging fruit. How is the core loop not already hyper optimized?
I'd have expected it to be hand rolled assembly for the major ISAs, with a C backup for less common ones.
How much energy has been wasted worldwide because of a relatively unoptimized interpreter?
Python’s goal is never really to be fast. If that were its goal, it would’ve had a JIT long ago instead of toying with optimizing the interpreter. Guido prioritized code simplicity over speed. A lot of speed improvements including the JIT (PEP 744 – JIT Compilation) came about after he stepped down.
> This has caused many issues for compilers in the past, too many to list in fact. I have a EuroPython 2025 talk about this.
Looks like it refers to this:
https://youtu.be/pUj32SF94Zw
(wish it's a link in the article)
Really nice results on MSVC. The idea that tail calls effectively reset compiler heuristics and unblock inlining is pretty convincing. One thing that worries me though is the reliance on undocumented MSVC behavior — if this becomes widely shipped, CPython could end up depending on optimizer guarantees that aren’t actually stable. Curious how you’re thinking about long-term maintainability and the impact on debugging/profiling.
Thanks for reading! For now, we maintain all 3 of the interpreters in CPython. We don't plan to remove the other interpreters anytime soon, probably never. If MSVC breaks the tail calling interpreter, we'll just go back to building and distributing the switch-case interpreter. Windows binaries will be slower again, but such is life :(.
Also the interpreter loop's dispatch is autogenerated and can be selected via configure flags. So there's almost no additional maintenance overhead. The main burden is the MSVC-specific changes we needed to get this working (amounting to a few hundred lines of code).
> Impact on debugging/profiling
I don't think there should be any, at least for Windows. Though I can't say for certain.
That makes sense, thanks for the detailed clarification. Having the switch-case interpreter as a fallback and keeping the dispatch autogenerated definitely reduces the long-term risk.
TLDR: The tail-calling interpreter is slightly faster than computed goto.
> I used to believe the the tailcalling interpreters get their speedup from better register use. While I still believe that now, I suspect that is not the main reason for speedups in CPython.
> My main guess now is that tail calling resets compiler heuristics to sane levels, so that compilers can do their jobs.
> Let me show an example, at the time of writing, CPython 3.15’s interpreter loop is around 12k lines of C code. That’s 12k lines in a single function for the switch-case and computed goto interpreter.
> […] In short, this overly large function breaks a lot of compiler heuristics.
> One of the most beneficial optimisations is inlining. In the past, we’ve found that compilers sometimes straight up refuse to inline even the simplest of functions in that 12k loc eval loop.
if the author of this blog reads this: can we can an RSS, please?
Got it. I'll try to set one up this weekend.
The Python interpreter core loop sounds like the perfect problem for AlphaEvolve. Or it's open source equivalent OpenEvolve if DeepMind doesn't want to speed up Python for the competition.