PhilipRoman 16 hours ago

IME CPU caches can be observed in almost all languages, regardless of how detached they are from the machine. What I'm really wondering is, if branch prediction can be observed from interpreted code.

  • exyi 15 hours ago

    It is! Although my test case is probably an unrealistically bad scenario:

    It's the classic, why is processing sorted array faster than unsorted one

        def f(arr):
            r = True
            for x in arr:
               if x > 128: r = not r
            return r
    
        arr = [ random.randint(0, 255) for i in range(0, 1000_000) ]
        arr_sorted = list(sorted(arr))
    
        %timeit f(arr)
        %timeit f(arr_sorted)
    
    
    Results are (on my machine): 17.5 ms for unsorted, and 13.5 ms for sorted. For comparison, in a compiled language, the difference is 4x
    • ammar2 14 hours ago

      Edit: Analyzed the wrong thing earlier.

      This depends on the Python version, but if it has the specializing interpreter changes, the `COMPARE_OP` comparing the integers there is probably hitting a specialized `_COMPARE_OP_INT` [1].

      This specialization has a ternary that does `res = (sign_ish & oparg) ? PyStackRef_True : PyStackRef_False;`. This might be the branch that ends up getting predicted correctly?

      Older versions of Python go through a bunch of dynamic dispatch first and then end up with a similar sort of int comparison in `long_richcompare`. [2]

      [1] https://github.com/python/cpython/blob/561965fa5c8314dee5b86...

      [2] https://github.com/python/cpython/blob/561965fa5c8314dee5b86...

      • dzaima 14 hours ago

        This isn't actually timing the sorting, but just the (dumb) function f.

        • ammar2 14 hours ago

          Oh whoops, that's right. I totally missed that.

    • LPisGood 12 hours ago

      This is a really good example. It is more like branch prediction than standard data/instruction caching.

      I wonder if you could do Spectre type vulnerabilities in python. You would need some way to leak micro-architectural state, so without being particularly clever maybe python code could only be used as a gadget or something.

    • cma 14 hours ago

      Python speed up is probably from small integer caching, a sorted array will have runs of pointers to the same integers adjacent. The compiled language one is probably branch prediction right?

      • exyi 12 hours ago

        I intentionally stayed in the small integer range to avoid benchmarking the cache. 256 distinct values should fit into L1 just fine in both cases.

        I'm now thinking that the difference might be even larger if we instead avoid small integers and let the CPU get stuck chasing pointers. The idea is that it gets stuck on a memory access, which forces it to speculate much further, which in turn makes it backtrack a longer path if a branch was mispredicted. I'm obviously no expert on this, feel free to correct me

        The results for 1B range instead of 255 are 17.6 ms for unsorted / 68.2 ms for sorted! We are back to what the original article observed and it's a way stronger effect than what branch prediction can offer. So don't sort your arrays, keep them in the order the boxed values were allocated ;)

        • cma 12 hours ago

          How big is the pointed to small integer? With alignment etc. I'm seeing some stuff saying 256 of them would fill an 8KB L1. Plus other stuff for the interpreter might overfill it. Sorted that would be less of an issue.

          Larger range one being slower unsorted yes makes sense because of allocation order no longer matching the iteration order.

          • exyi 11 hours ago

            I don't know how large are those boxes, but normal CPU L1 cache has 32 or 48KB which should be plenty for this. Python opcodes for this program are going to be tiny, and the interpreter itself uses the instruction-L1 cache (which is another 32-48KB). I hope the sequential scan of the big array won't flush the L1 cache (there should be 12-way associativity with LRU, so I don't see how it could).

            Anyway, there is no need to have 256 integers, just 2 is enough. When I try that, the results are similar: 17.5 ms (unsorted) / 12.5 ms (sorted)

      • bgwalter 12 hours ago

        That seems very likely. The benchmark should probably use a range that is guaranteed to be outside of the cached smallint range.

        • exyi 11 hours ago

          Then you are back to what the article discusses. Each integer is in a separate box, those boxes are allocated in one order, sorting the array by value will shuffle it by address and it will be much slower. I tested this as well, see the other comment.

  • vlovich123 15 hours ago

    Why would you think that branch prediction wouldn’t be? Do you think the interpreter is adding too much false sharing to the branch predictor to render it worthless?

    • almostgotcaught 14 hours ago

      you do realize that the Python

        if x > 128: 
           r = not r
      
      doesn't necessarily correspond to one branch at the ASM instruction level right? in fact, a priori, it doesn't even correspond to absolutely any branches at the ASM instruction level.
      • vlovich123 13 hours ago

        Yes I do and that’s neither here nor there. There almost certainly ASM instruction level branches to implement the conditional and the branch predictor isn’t tied to a single instruction - the rough high level mental model is a cache of a few of the least significant digits of the CPU to the prediction although in practice it’s far more complicated. Since the predictor is right like 80% of the time, it means that even when there’s false sharing of a lot of branches, the CPU does a good job predicting. It’s performance is primarily impacted when execution takes both branches closer to 50/50 than to 100/0.

        That’s why I asked about false sharing specifically as that’s the main way I can think of that Python code wouldn’t be able to observe the branch predictor performance - because the interpreter has so many branches internally that it dominates any branches that might be caused by the Python code itself.

      • exyi 12 hours ago

        It corresponds to a way more than one branch at instruction level. The branch prediction AFAIK does not care based on what are you branching, it just assumes branches will go in similar sequences as they did last time. If the Python 'if' is never taken, the instruction-level predictor will learn that after the comparison operation, there is an 'if' operation and then another array access operation. If the Python 'if' is unpredictable unpredictable, CPU predictor can't be sure which opcode are we processing after 'if', so there will be penalty.

        • mardifoufs 12 hours ago

          Is there any public documentation on modern branch prediction algorithms? I know branch prediction is very important to modern CPU so SOTA techniques are probably not public... But it's really amazing what it can do especially considering the "limited" cache sizes that branch predictors have .

          • exyi 11 hours ago

            I guess it depends on how deep you want to go, I think the real predictors are based on publicly known algorithms such as TAGE. This seems to be nice overview: https://comparch.net/2013/06/30/why-tage-is-the-best/ (it's 2013, so definitely not SOTA, but advanced enough for my taste :) )

  • saagarjha 8 hours ago

    Yep, this is why Spectre mitigations are applied to browsers.

  • stingraycharles 15 hours ago

    It should be, as most interpreted code uses a JIT native compiler at this point, which implies the branch prediction is native as well.

    I would be happy to stand corrected if someone knows more about this, though!

    • pansa2 15 hours ago

      > most interpreted code uses a JIT native compiler at this point

      I suspect most JavaScript code is JIT-compiled - but most Python, Ruby etc is interpreted (via bytecode).

    • tarruda 15 hours ago

      AFAIK CPython doesn't JIT compile.

      • pansa2 15 hours ago

        Recent versions of CPython do have an experimental JIT compiler. I'm not sure how widely-used it is, though.

        https://peps.python.org/pep-0744/

        • Qem 15 hours ago

          It's disabled by default, not production ready. You need to compile CPython yourself if you want to try it.

  • eulgro 13 hours ago

    My guess would be that branch misprediction does have an impact on interpreted language, but much less. If bytecode instructions take on average 20 CPU cycles to execute, and the branch misprediction penalty is 50 CPU cycles, the relative cost of a misprediction is much smaller than in compiled code.

    • adgjlsfhk1 12 hours ago

      the counterpoint is that a lot of those 20 instructions are also branches

Delk 7 hours ago

I really don't see why you wouldn't expect to find cache-sensitivity in Python, or in some other similarly high-level language.

Python sequences are defined as "support[ing] efficient element access using integer indices" [1]. Python lists are sequences and thus must support random access. In practice that means the implementation is a (dynamically resizing) array allocated contiguously. That means spatial locality is relevant.

If the list type were defined simply as an iterable collection, with no requirement of constant-time random access, the definition would be abstract enough that an implementation might end up being something else than a contiguous array. But if you define the type so that it supports constant-time random access [2], you pretty much end up with cache-friendly sequential access as well.

If you don't define the list type as supporting random access, you also sacrifice asymptotic algorithmic efficiency for lookups by index. Any language that cares about efficiency at all separates collections that support efficient random access from those that don't. (For example, Python has lists and dicts/sets for different access patterns. The Java standard library has separate types for contiguous arrays/lists, hashtable-backed collections and tree-backed collections because the three have different algorithmic efficiencies for different access patterns. In practice it leads to different properties in terms of cache-friendliness as well.)

[1] https://docs.python.org/3/glossary.html#term-sequence

[2] As usual, the Python documentation is a bit vague on the details. It doesn't really say random access has to be constant-time, only that it has to be "efficient". So you might be able to have a non-constant time implementation such as an indexable skiplist while arguing that it's efficient, but you'd have to go out of your way to do that.

  • zahlman 7 hours ago

    >I really don't see why you wouldn't expect to find cache-sensitivity in Python, or in some other similarly high-level language... Python lists are sequences and thus must support random access. In practice that means the implementation is a (dynamically resizing) array allocated contiguously.

    The contents of the dynamically allocated array, in the C implementation, are pointers (PyObject), since the lists are heterogeneous and must be able to store any Python object. That entails indirection, which defeats spatial locality. Even if you iterate over cached data, you have to jump around to retrieve the actual objects to do anything* with them.

    • Delk 6 hours ago

      Sure. But that's one less level of indirection than if you also had to jump around to get the reference to the object in the first place.

PaulHoule 15 hours ago

"Mechanical sympathy" effects are so strong on modern CPUs you feel them even in languages like Python that are far from the metal.

SteelByte 10 hours ago

Great overview of CPU caching and its impact on Python performance. The benchmark really drives home the point. It would be interesting to see how other dynamic languages like JavaScript compare, and if JIT compilers are able to optimize away some of these caching inefficiencies.

  • saagarjha 8 hours ago

    No, they can’t. The “caching inefficiencies” (?!) are part of the hardware and can’t be optimized away by a JIT. I think you have a very confused understanding of the post.

davvid 11 hours ago

The article didn't mention Python's atomic refcounts. They're very bad for cache usage as they're constantly invalidating cache.

  • senderista 9 hours ago

    Non-atomic refcounts invalidate cache as well, they're just not as expensive.

  • quotemstr 10 hours ago

    The only version of Python that uses atomic reference counting is the very new free threaded version.