JIT: Restore vhaddps-based lowering for Vector256/512 floating-point Sum() by EgorBo · Pull Request #126255 · dotnet/runtime

EgorBo · 2026-03-28T13:55:45Z

Summary

Restore vhaddps/vhaddpd-based lowering for Vector256<float>.Sum() and Vector256<double>.Sum() (and transitively Vector512 floating-point Sum), replacing the more expensive vpermilps + vaddps decomposition that was introduced in #95568 and later expanded in e012fd4.

Root Cause

Two changes caused the regression:

PR Updating Sum() implementation for Vector128 and Vector256 + adding lowering for Vector512 #95568 (Dec 2023) replaced vhaddps with a vpermilps + vaddps (shuffle + vertical add) sequence, claiming hadd was "not the most efficient."
Commit e012fd4 (Jun 2024) added lane-independent splitting for floating-point to "ensure deterministic results matching the software fallback." This doubled the instruction count for Vector256<float>.Sum() from 6 to 11 instructions.

The determinism concern was valid in principle — the software fallback processes each 128-bit lane independently. However, vhaddps already operates within 128-bit lanes (it is defined as two independent 128-bit horizontal adds), so it naturally preserves the same lane-by-lane addition order. The lane-splitting was unnecessary.

Fix

For 256-bit floating-point Sum when AVX is available, use NI_AVX_HorizontalAdd (vhaddps/vhaddpd) followed by a lane-combine step. This produces 4 instructions for float and 3 for double, vs the previous 11 and 5.

Vector512 floating-point Sum also benefits because it recursively calls the 256-bit Sum path.

Codegen — `float Test(Vector256<float> vec) => Vector256.Sum(vec);`

Before (`vpermilps + vaddps`, 62 bytes, 14 instructions)

       vmovups  ymm0, ymmword ptr [rcx]
       vmovaps  ymm1, ymm0
       vpermilps xmm2, xmm1, -79
       vaddps   xmm1, xmm2, xmm1
       vpermilps xmm2, xmm1, 78
       vaddps   xmm1, xmm2, xmm1
       vextractf128 xmm0, ymm0
       vpermilps xmm2, xmm0, -79
       vaddps   xmm0, xmm2, xmm0
       vpermilps xmm2, xmm0, 78
       vaddps   xmm0, xmm2, xmm0
       vaddss   xmm0, xmm1, xmm0
       vzeroupper
       ret

After (`vhaddps`, 26 bytes, 7 instructions)

       vmovups  ymm0, ymmword ptr [rcx]
       vhaddps  ymm0, ymm0, ymm0
       vhaddps  ymm0, ymm0, ymm0
       vextractf128 xmm1, ymm0
       vaddps   xmm0, xmm0, xmm1
       vzeroupper
       ret

58% less code, 50% fewer instructions. On AVX-512 hardware, the improvement is even larger because:

vhaddps has HW_Flag_NoEvexSemantics, constraining registers to ymm0–15 and avoiding 4-byte EVEX prefixes
The reduced loop body size avoids micro-op cache (DSB) pressure

Validation

Numerical output is identical before and after (deterministic FP addition order preserved)
SPMI replay clean: 50,228 ASP.NET contexts, zero failures

Note

This comment was generated by Copilot.

Benchmark results with PR #126255

Benchmarked using the reporter's code on two AVX-512 server CPUs (cloud, via EgorBot):

Intel Emerald Rapids (Xeon 8573C)

Method	Toolchain	NumElevations	Mean	Ratio
Vector<float> portable SIMD	PR	1024	907.4 ns	1.00
Vector256<float> explicit SIMD	PR	1024	564.4 ns	0.62
Vector<float> portable SIMD	main	1024	884.1 ns	0.97
Vector256<float> explicit SIMD	main	1024	647.2 ns	0.71

Vector<float> portable SIMD	PR	4096	3,549.4 ns	1.00
Vector256<float> explicit SIMD	PR	4096	2,280.3 ns	0.64
Vector<float> portable SIMD	main	4096	3,583.4 ns	1.01
Vector256<float> explicit SIMD	main	4096	2,543.1 ns	0.72

AMD Turin (EPYC 9V45, Zen 5)

Method	Toolchain	NumElevations	Mean	Ratio
Vector<float> portable SIMD	PR	1024	487.2 ns	1.00
Vector256<float> explicit SIMD	PR	1024	338.0 ns	0.69
Vector<float> portable SIMD	main	1024	486.5 ns	1.00
Vector256<float> explicit SIMD	main	1024	380.8 ns	0.78

Vector<float> portable SIMD	PR	4096	1,959.6 ns	1.00
Vector256<float> explicit SIMD	PR	4096	1,365.9 ns	0.70
Vector<float> portable SIMD	main	4096	1,959.1 ns	1.00
Vector256<float> explicit SIMD	main	4096	1,541.5 ns	0.79

Summary

CPU	Size	Vector256 improvement
Emerald Rapids	1024	−12.8%
Emerald Rapids	4096	−10.3%
Turin (Zen 5)	1024	−11.3%
Turin (Zen 5)	4096	−11.4%

Consistent ~11% improvement on both Intel and AMD AVX-512 server CPUs. Tiger Lake (the reporter's CPU) is not available in the cloud; the improvement there would likely be larger due to the smaller µop cache being more sensitive to the 28% loop body size reduction.

dotnet-policy-service · 2026-03-28T13:56:50Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copilot

Pull request overview

Restores the AVX vhaddps / vhaddpd-based lowering for Vector256<float/double>.Sum() in the JIT to address a performance regression introduced by prior permute+add decomposition and lane-splitting, while maintaining deterministic lane-by-lane FP addition order.

Changes:

Add an AVX-only fast path for 256-bit floating-point Sum() using NI_AVX_HorizontalAdd and a final lane-combine step.
Keep the existing non-AVX floating-point path that sums 128-bit halves independently for determinism (software-fallback matching), but clarify the comment.

EgorBo · 2026-03-28T17:08:11Z

@EgorBot -amd

using System;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

[MemoryDiagnoser]
public class Vector256RegressionBenchmark
{
    private const float NullHeight = -3.4E+38f;
    private const float FillTolerance = 0.01f;
    private const float NegCutTolerance = -0.01f;

    private float[] _baseElevations;
    private float[] _topElevations;

    [Params(1024, 4096)]
    public int NumElevations { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        var rng = new Random(42);
        _baseElevations = new float[NumElevations];
        _topElevations = new float[NumElevations];
        for (var i = 0; i < NumElevations; i++)
        {
            _baseElevations[i] = rng.NextDouble() < 0.2
                ? NullHeight : (float)(rng.NextDouble() * 100);
            _topElevations[i] = rng.NextDouble() < 0.2
                ? NullHeight : (float)(rng.NextDouble() * 100 + 10);
        }
    }

    [Benchmark(Description = "Vector<float> portable SIMD", Baseline = true)]
    public unsafe (double cut, double fill) PortableVector()
    {
        double cutVol = 0, fillVol = 0;
        var nullVec = new Vector<float>(NullHeight);
        var fillTolVec = new Vector<float>(FillTolerance);
        var negCutTolVec = new Vector<float>(NegCutTolerance);

        fixed (float* bp = _baseElevations, tp = _topElevations)
        {
            var bv = (Vector<float>*)bp;
            var tv = (Vector<float>*)tp;
            for (int i = 0, limit = NumElevations / Vector<float>.Count;
                 i < limit; i++, bv++, tv++)
            {
                var mask = ~(Vector.Equals(*bv, nullVec)
                           | Vector.Equals(*tv, nullVec));
                if (Vector.Sum(mask) == 0) continue;

                var delta = Vector.ConditionalSelect(mask, *tv, Vector<float>.Zero)
                          - Vector.ConditionalSelect(mask, *bv, Vector<float>.Zero);

                var fillMask = Vector.GreaterThan(delta, fillTolVec);
                var usedFill = -Vector.Sum(fillMask);
                if (usedFill > 0)
                    fillVol -= Vector.Dot(delta, Vector.ConvertToSingle(fillMask));

                if (usedFill < Vector<float>.Count)
                {
                    var cutMask = Vector.LessThan(delta, negCutTolVec);
                    var usedCut = -Vector.Sum(cutMask);
                    if (usedCut > 0)
                        cutVol -= Vector.Dot(delta, Vector.ConvertToSingle(cutMask));
                }
            }
        }
        return (cutVol, fillVol);
    }

    [Benchmark(Description = "Vector256<float> explicit SIMD")]
    public unsafe (double cut, double fill) ExplicitVector256()
    {
        if (!Vector256.IsHardwareAccelerated) return (0, 0);

        double cutVol = 0, fillVol = 0;
        var nullVec = Vector256.Create(NullHeight);
        var fillTolVec = Vector256.Create(FillTolerance);
        var negCutTolVec = Vector256.Create(NegCutTolerance);

        fixed (float* bp = _baseElevations, tp = _topElevations)
        {
            var bv = (Vector256<float>*)bp;
            var tv = (Vector256<float>*)tp;
            for (int i = 0, limit = NumElevations / Vector256<float>.Count;
                 i < limit; i++, bv++, tv++)
            {
                var mask = ~(Vector256.Equals(*bv, nullVec)
                           | Vector256.Equals(*tv, nullVec));
                if (Vector256.EqualsAll(mask, Vector256<float>.Zero)) continue;

                var delta = Vector256.ConditionalSelect(mask, *tv, Vector256<float>.Zero)
                          - Vector256.ConditionalSelect(mask, *bv, Vector256<float>.Zero);

                var fillMask = Vector256.GreaterThan(delta, fillTolVec);
                if (Vector256.ExtractMostSignificantBits(fillMask) != 0)
                    fillVol += Vector256.Sum(
                        Vector256.ConditionalSelect(fillMask, delta, Vector256<float>.Zero));

                var cutMask = Vector256.LessThan(delta, negCutTolVec);
                if (Vector256.ExtractMostSignificantBits(cutMask) != 0)
                    cutVol += Vector256.Sum(
                        Vector256.ConditionalSelect(cutMask, delta, Vector256<float>.Zero));
            }
        }
        return (cutVol, fillVol);
    }
}

saucecontrol · 2026-03-28T22:21:40Z

src/coreclr/jit/gentree.cpp


    if (simdSize == 32)
    {
+        if (varTypeIsFloating(simdBaseType) && compOpportunisticallyDependsOn(InstructionSet_AVX))


simdSize == 32 implies AVX support, so this check and the fallback code below can be removed.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Copilot · 2026-03-29T00:07:24Z

src/coreclr/jit/gentree.cpp

        if (varTypeIsFloating(simdBaseType))
        {
-            // We need to ensure deterministic results which requires
-            // consistently adding values together. Since many operations
-            // end up operating on 128-bit lanes, we break sum the same way.
+            // Fallback for non-AVX: process lanes independently to ensure
+            // deterministic results matching the software fallback.



In the simdSize == 32 block, the if (varTypeIsFloating(simdBaseType)) at lines 27232–27240 is now unreachable because the floating-point case returns earlier (lines 27203–27225). This also makes the new comment about a "Fallback for non-AVX" misleading (simdSize==32 already implies AVX on xarch). Consider removing this dead branch or restructuring the logic as an else so the intent is clear.

github-actions · 2026-03-29T00:10:58Z

Note

This review was generated by Copilot.

🤖 Copilot Code Review — PR #126255

Holistic Assessment

Motivation: Justified. The linked issue #126250 documents a 71% regression for Vector256<float>.Sum() on .NET 10 vs .NET 8 on AVX-512 hardware, traced to the permute+add decomposition introduced in #95568 and expanded in e012fd4. The regression is well-characterized with BenchmarkDotNet data and disassembly comparisons.

Approach: Sound. The fix restores vhaddps/vhaddpd-based reduction for 256-bit floating-point Sum. This is the right approach because vhaddps/vhaddpd naturally operate within 128-bit lanes — the determinism concern that motivated the original lane-splitting was valid in principle, but hadd already satisfies that constraint by construction. The PR includes benchmark data on both Intel (Emerald Rapids) and AMD (Turin/Zen 5) showing a consistent ~11% improvement, and SPMI replay confirms zero regressions across 50K+ contexts.

Summary: ⚠️ Needs Human Review. The core codegen change is correct and well-motivated — the hadd lane semantics match the software fallback's addition order, and the benchmarks confirm a real improvement. However, the PR leaves dead code in the simdSize == 32 block (already flagged by @saucecontrol and the automated Copilot reviewer) that should be cleaned up before merge. A JIT maintainer should confirm the determinism analysis and sign off.

Detailed Findings

✅ Correctness — Floating-point addition order is preserved

Traced the addition order for both element types to verify determinism:

float (8 elements in 256-bit): getSIMDVectorLength(16, TYP_FLOAT) = 4, genLog2(4) = 2 → 2 hadd iterations.

After two vhaddps(x, x), low lane holds (a0+a1)+(a2+a3), high lane holds (a4+a5)+(a6+a7).
Extract upper+lower and add: ((a0+a1)+(a2+a3)) + ((a4+a5)+(a6+a7)).
This matches the old path which split to two 128-bit halves and summed each with permute+add, then added the scalars: same grouping, same order.

double (4 elements in 256-bit): getSIMDVectorLength(16, TYP_DOUBLE) = 2, genLog2(2) = 1 → 1 hadd iteration.

After vhaddpd(x, x), low lane holds d0+d1, high lane holds d2+d3.
Final add: (d0+d1) + (d2+d3) — matches old path.

Vector512 float/double recursively splits to two 256-bit halves and sums each via this path — the transitive improvement is correct.

✅ ISA guard — Assert is sufficient

The assert(compOpportunisticallyDependsOn(InstructionSet_AVX)) at line 27202 is correct: simdSize == 32 on x86/x64 always implies AVX. The new hadd path uses NI_AVX_HorizontalAdd which requires AVX — this is safe.

⚠️ Dead code — Floating-point branch at lines 27232–27240 is now unreachable

After the new early return at line 27224, the if (varTypeIsFloating(simdBaseType)) block at lines 27232–27240 is dead code within the simdSize == 32 block. The flow is:

Line 27203: if (varTypeIsFloating(simdBaseType)) → returns at 27224
Line 27227–27230: extract lower/upper (only reached for integer types)
Line 27232: if (varTypeIsFloating(simdBaseType)) → always false, dead code

The updated comment "Fallback for non-AVX" is also misleading since simdSize == 32 always implies AVX (as the assert at 27202 confirms — there is no non-AVX path for 256-bit vectors on xarch).

Recommendation: Remove lines 27232–27240 entirely. This simplifies the control flow and eliminates confusion. Both @saucecontrol and the automated Copilot review already flagged this.

✅ Performance — Well-evidenced improvement

The PR includes:

Codegen comparison: 7 instructions / 26 bytes (after) vs 14 instructions / 62 bytes (before) for Vector256<float>.Sum()
BenchmarkDotNet results on two architectures (Intel Emerald Rapids, AMD Turin Zen 5) showing consistent ~11% wall-clock improvement
SPMI replay across 50,228 ASP.NET contexts with zero diffs
The HW_Flag_NoEvexSemantics flag on NI_AVX_HorizontalAdd (verified in hwintrinsiclistxarch.h) correctly constrains to VEX encoding on AVX-512 hardware, avoiding 4-byte EVEX prefixes

💡 Naming — `vectorLength` could be clearer

At line 27210, getSIMDVectorLength(16, simdBaseType) computes elements per 128-bit lane, not the full 256-bit vector length. The variable name vectorLength is slightly misleading. A name like laneElementCount or elementsPerLane would make the intent clearer, though this is minor and the correctness is not affected.

Generated by Code Review for issue #126255 · ◷

improve Sum

2453d81

Copilot AI review requested due to automatic review settings March 28, 2026 13:55

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 28, 2026

dotnet-policy-service bot assigned EgorBo Mar 28, 2026

Copilot started reviewing on behalf of EgorBo March 28, 2026 13:56 View session

Copilot AI reviewed Mar 28, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

This comment was marked as resolved.

Sign in to view

EgorBot mentioned this pull request Mar 28, 2026

Benchmarks for dotnet/runtime#126255 (for @EgorBo) EgorBot/Benchmarks#79

Open

EgorBot mentioned this pull request Mar 28, 2026

Benchmarks for dotnet/runtime#126255 (for @EgorBo) EgorBot/Benchmarks#80

Open

EgorBo mentioned this pull request Mar 28, 2026

Vector256 explicit intrinsics 71% slower on .NET 10 vs .NET 8 on AVX-512 hardware #126250

Open

This was referenced Mar 28, 2026

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

[android-arm64] The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#6408

Open

saucecontrol reviewed Mar 28, 2026

View reviewed changes

Update gentree.cpp

9ab2d65

Copilot AI review requested due to automatic review settings March 29, 2026 00:03

Copilot started reviewing on behalf of EgorBo March 29, 2026 00:04 View session

Copilot AI reviewed Mar 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Restore vhaddps-based lowering for Vector256/512 floating-point Sum()#126255

JIT: Restore vhaddps-based lowering for Vector256/512 floating-point Sum()#126255
EgorBo wants to merge 2 commits intomainfrom
hadd-fix

EgorBo commented Mar 28, 2026 •

edited

Loading

Uh oh!

dotnet-policy-service bot commented Mar 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

This comment has been minimized.

This comment was marked as resolved.

EgorBo commented Mar 28, 2026

Uh oh!

saucecontrol Mar 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 29, 2026

Uh oh!

github-actions bot commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

EgorBo commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix

Codegen — float Test(Vector256<float> vec) => Vector256.Sum(vec);

Before (vpermilps + vaddps, 62 bytes, 14 instructions)

After (vhaddps, 26 bytes, 7 instructions)

Validation

Benchmark results with PR #126255

Intel Emerald Rapids (Xeon 8573C)

AMD Turin (EPYC 9V45, Zen 5)

Summary

Uh oh!

dotnet-policy-service bot commented Mar 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

This comment has been minimized.

This comment was marked as resolved.

EgorBo commented Mar 28, 2026

Uh oh!

saucecontrol Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 29, 2026

🤖 Copilot Code Review — PR #126255

Holistic Assessment

Detailed Findings

✅ Correctness — Floating-point addition order is preserved

✅ ISA guard — Assert is sufficient

⚠️ Dead code — Floating-point branch at lines 27232–27240 is now unreachable

✅ Performance — Well-evidenced improvement

💡 Naming — vectorLength could be clearer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

EgorBo commented Mar 28, 2026 •

edited

Loading

Codegen — `float Test(Vector256<float> vec) => Vector256.Sum(vec);`

Before (`vpermilps + vaddps`, 62 bytes, 14 instructions)

After (`vhaddps`, 26 bytes, 7 instructions)

💡 Naming — `vectorLength` could be clearer