Skip to content

JIT: Restore vhaddps-based lowering for Vector256/512 floating-point Sum()#126255

Open
EgorBo wants to merge 2 commits intomainfrom
hadd-fix
Open

JIT: Restore vhaddps-based lowering for Vector256/512 floating-point Sum()#126255
EgorBo wants to merge 2 commits intomainfrom
hadd-fix

Conversation

@EgorBo
Copy link
Copy Markdown
Member

@EgorBo EgorBo commented Mar 28, 2026

Fixes #126250

Summary

Restore vhaddps/vhaddpd-based lowering for Vector256<float>.Sum() and Vector256<double>.Sum() (and transitively Vector512 floating-point Sum), replacing the more expensive vpermilps + vaddps decomposition that was introduced in #95568 and later expanded in e012fd4.

Root Cause

Two changes caused the regression:

  1. PR Updating Sum() implementation for Vector128 and Vector256 + adding lowering for Vector512 #95568 (Dec 2023) replaced vhaddps with a vpermilps + vaddps (shuffle + vertical add) sequence, claiming hadd was "not the most efficient."
  2. Commit e012fd4 (Jun 2024) added lane-independent splitting for floating-point to "ensure deterministic results matching the software fallback." This doubled the instruction count for Vector256<float>.Sum() from 6 to 11 instructions.

The determinism concern was valid in principle — the software fallback processes each 128-bit lane independently. However, vhaddps already operates within 128-bit lanes (it is defined as two independent 128-bit horizontal adds), so it naturally preserves the same lane-by-lane addition order. The lane-splitting was unnecessary.

Fix

For 256-bit floating-point Sum when AVX is available, use NI_AVX_HorizontalAdd (vhaddps/vhaddpd) followed by a lane-combine step. This produces 4 instructions for float and 3 for double, vs the previous 11 and 5.

Vector512 floating-point Sum also benefits because it recursively calls the 256-bit Sum path.

Codegen — float Test(Vector256<float> vec) => Vector256.Sum(vec);

Before (vpermilps + vaddps, 62 bytes, 14 instructions)

       vmovups  ymm0, ymmword ptr [rcx]
       vmovaps  ymm1, ymm0
       vpermilps xmm2, xmm1, -79
       vaddps   xmm1, xmm2, xmm1
       vpermilps xmm2, xmm1, 78
       vaddps   xmm1, xmm2, xmm1
       vextractf128 xmm0, ymm0
       vpermilps xmm2, xmm0, -79
       vaddps   xmm0, xmm2, xmm0
       vpermilps xmm2, xmm0, 78
       vaddps   xmm0, xmm2, xmm0
       vaddss   xmm0, xmm1, xmm0
       vzeroupper
       ret

After (vhaddps, 26 bytes, 7 instructions)

       vmovups  ymm0, ymmword ptr [rcx]
       vhaddps  ymm0, ymm0, ymm0
       vhaddps  ymm0, ymm0, ymm0
       vextractf128 xmm1, ymm0
       vaddps   xmm0, xmm0, xmm1
       vzeroupper
       ret

58% less code, 50% fewer instructions. On AVX-512 hardware, the improvement is even larger because:

  • vhaddps has HW_Flag_NoEvexSemantics, constraining registers to ymm0–15 and avoiding 4-byte EVEX prefixes
  • The reduced loop body size avoids micro-op cache (DSB) pressure

Validation

  • Numerical output is identical before and after (deterministic FP addition order preserved)
  • SPMI replay clean: 50,228 ASP.NET contexts, zero failures

Note

This comment was generated by Copilot.

Benchmark results with PR #126255

Benchmarked using the reporter's code on two AVX-512 server CPUs (cloud, via EgorBot):

Intel Emerald Rapids (Xeon 8573C)

Method Toolchain NumElevations Mean Ratio
Vector<float> portable SIMD PR 1024 907.4 ns 1.00
Vector256<float> explicit SIMD PR 1024 564.4 ns 0.62
Vector<float> portable SIMD main 1024 884.1 ns 0.97
Vector256<float> explicit SIMD main 1024 647.2 ns 0.71
Vector<float> portable SIMD PR 4096 3,549.4 ns 1.00
Vector256<float> explicit SIMD PR 4096 2,280.3 ns 0.64
Vector<float> portable SIMD main 4096 3,583.4 ns 1.01
Vector256<float> explicit SIMD main 4096 2,543.1 ns 0.72

AMD Turin (EPYC 9V45, Zen 5)

Method Toolchain NumElevations Mean Ratio
Vector<float> portable SIMD PR 1024 487.2 ns 1.00
Vector256<float> explicit SIMD PR 1024 338.0 ns 0.69
Vector<float> portable SIMD main 1024 486.5 ns 1.00
Vector256<float> explicit SIMD main 1024 380.8 ns 0.78
Vector<float> portable SIMD PR 4096 1,959.6 ns 1.00
Vector256<float> explicit SIMD PR 4096 1,365.9 ns 0.70
Vector<float> portable SIMD main 4096 1,959.1 ns 1.00
Vector256<float> explicit SIMD main 4096 1,541.5 ns 0.79

Summary

CPU Size Vector256 improvement
Emerald Rapids 1024 −12.8%
Emerald Rapids 4096 −10.3%
Turin (Zen 5) 1024 −11.3%
Turin (Zen 5) 4096 −11.4%

Consistent ~11% improvement on both Intel and AMD AVX-512 server CPUs. Tiger Lake (the reporter's CPU) is not available in the cloud; the improvement there would likely be larger due to the smaller µop cache being more sensitive to the 28% loop body size reduction.

Copilot AI review requested due to automatic review settings March 28, 2026 13:55
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 28, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Restores the AVX vhaddps / vhaddpd-based lowering for Vector256<float/double>.Sum() in the JIT to address a performance regression introduced by prior permute+add decomposition and lane-splitting, while maintaining deterministic lane-by-lane FP addition order.

Changes:

  • Add an AVX-only fast path for 256-bit floating-point Sum() using NI_AVX_HorizontalAdd and a final lane-combine step.
  • Keep the existing non-AVX floating-point path that sums 128-bit halves independently for determinism (software-fallback matching), but clarify the comment.

@github-actions

This comment has been minimized.

@EgorBo

This comment was marked as resolved.

@EgorBo
Copy link
Copy Markdown
Member Author

EgorBo commented Mar 28, 2026

@EgorBot -amd

using System;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

[MemoryDiagnoser]
public class Vector256RegressionBenchmark
{
    private const float NullHeight = -3.4E+38f;
    private const float FillTolerance = 0.01f;
    private const float NegCutTolerance = -0.01f;

    private float[] _baseElevations;
    private float[] _topElevations;

    [Params(1024, 4096)]
    public int NumElevations { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        var rng = new Random(42);
        _baseElevations = new float[NumElevations];
        _topElevations = new float[NumElevations];
        for (var i = 0; i < NumElevations; i++)
        {
            _baseElevations[i] = rng.NextDouble() < 0.2
                ? NullHeight : (float)(rng.NextDouble() * 100);
            _topElevations[i] = rng.NextDouble() < 0.2
                ? NullHeight : (float)(rng.NextDouble() * 100 + 10);
        }
    }

    [Benchmark(Description = "Vector<float> portable SIMD", Baseline = true)]
    public unsafe (double cut, double fill) PortableVector()
    {
        double cutVol = 0, fillVol = 0;
        var nullVec = new Vector<float>(NullHeight);
        var fillTolVec = new Vector<float>(FillTolerance);
        var negCutTolVec = new Vector<float>(NegCutTolerance);

        fixed (float* bp = _baseElevations, tp = _topElevations)
        {
            var bv = (Vector<float>*)bp;
            var tv = (Vector<float>*)tp;
            for (int i = 0, limit = NumElevations / Vector<float>.Count;
                 i < limit; i++, bv++, tv++)
            {
                var mask = ~(Vector.Equals(*bv, nullVec)
                           | Vector.Equals(*tv, nullVec));
                if (Vector.Sum(mask) == 0) continue;

                var delta = Vector.ConditionalSelect(mask, *tv, Vector<float>.Zero)
                          - Vector.ConditionalSelect(mask, *bv, Vector<float>.Zero);

                var fillMask = Vector.GreaterThan(delta, fillTolVec);
                var usedFill = -Vector.Sum(fillMask);
                if (usedFill > 0)
                    fillVol -= Vector.Dot(delta, Vector.ConvertToSingle(fillMask));

                if (usedFill < Vector<float>.Count)
                {
                    var cutMask = Vector.LessThan(delta, negCutTolVec);
                    var usedCut = -Vector.Sum(cutMask);
                    if (usedCut > 0)
                        cutVol -= Vector.Dot(delta, Vector.ConvertToSingle(cutMask));
                }
            }
        }
        return (cutVol, fillVol);
    }

    [Benchmark(Description = "Vector256<float> explicit SIMD")]
    public unsafe (double cut, double fill) ExplicitVector256()
    {
        if (!Vector256.IsHardwareAccelerated) return (0, 0);

        double cutVol = 0, fillVol = 0;
        var nullVec = Vector256.Create(NullHeight);
        var fillTolVec = Vector256.Create(FillTolerance);
        var negCutTolVec = Vector256.Create(NegCutTolerance);

        fixed (float* bp = _baseElevations, tp = _topElevations)
        {
            var bv = (Vector256<float>*)bp;
            var tv = (Vector256<float>*)tp;
            for (int i = 0, limit = NumElevations / Vector256<float>.Count;
                 i < limit; i++, bv++, tv++)
            {
                var mask = ~(Vector256.Equals(*bv, nullVec)
                           | Vector256.Equals(*tv, nullVec));
                if (Vector256.EqualsAll(mask, Vector256<float>.Zero)) continue;

                var delta = Vector256.ConditionalSelect(mask, *tv, Vector256<float>.Zero)
                          - Vector256.ConditionalSelect(mask, *bv, Vector256<float>.Zero);

                var fillMask = Vector256.GreaterThan(delta, fillTolVec);
                if (Vector256.ExtractMostSignificantBits(fillMask) != 0)
                    fillVol += Vector256.Sum(
                        Vector256.ConditionalSelect(fillMask, delta, Vector256<float>.Zero));

                var cutMask = Vector256.LessThan(delta, negCutTolVec);
                if (Vector256.ExtractMostSignificantBits(cutMask) != 0)
                    cutVol += Vector256.Sum(
                        Vector256.ConditionalSelect(cutMask, delta, Vector256<float>.Zero));
            }
        }
        return (cutVol, fillVol);
    }
}


if (simdSize == 32)
{
if (varTypeIsFloating(simdBaseType) && compOpportunisticallyDependsOn(InstructionSet_AVX))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simdSize == 32 implies AVX support, so this check and the fallback code below can be removed.

Copilot AI review requested due to automatic review settings March 29, 2026 00:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment on lines 27232 to 27236
if (varTypeIsFloating(simdBaseType))
{
// We need to ensure deterministic results which requires
// consistently adding values together. Since many operations
// end up operating on 128-bit lanes, we break sum the same way.
// Fallback for non-AVX: process lanes independently to ensure
// deterministic results matching the software fallback.

Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the simdSize == 32 block, the if (varTypeIsFloating(simdBaseType)) at lines 27232–27240 is now unreachable because the floating-point case returns earlier (lines 27203–27225). This also makes the new comment about a "Fallback for non-AVX" misleading (simdSize==32 already implies AVX on xarch). Consider removing this dead branch or restructuring the logic as an else so the intent is clear.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

Note

This review was generated by Copilot.

🤖 Copilot Code Review — PR #126255

Holistic Assessment

Motivation: Justified. The linked issue #126250 documents a 71% regression for Vector256<float>.Sum() on .NET 10 vs .NET 8 on AVX-512 hardware, traced to the permute+add decomposition introduced in #95568 and expanded in e012fd4. The regression is well-characterized with BenchmarkDotNet data and disassembly comparisons.

Approach: Sound. The fix restores vhaddps/vhaddpd-based reduction for 256-bit floating-point Sum. This is the right approach because vhaddps/vhaddpd naturally operate within 128-bit lanes — the determinism concern that motivated the original lane-splitting was valid in principle, but hadd already satisfies that constraint by construction. The PR includes benchmark data on both Intel (Emerald Rapids) and AMD (Turin/Zen 5) showing a consistent ~11% improvement, and SPMI replay confirms zero regressions across 50K+ contexts.

Summary: ⚠️ Needs Human Review. The core codegen change is correct and well-motivated — the hadd lane semantics match the software fallback's addition order, and the benchmarks confirm a real improvement. However, the PR leaves dead code in the simdSize == 32 block (already flagged by @saucecontrol and the automated Copilot reviewer) that should be cleaned up before merge. A JIT maintainer should confirm the determinism analysis and sign off.


Detailed Findings

✅ Correctness — Floating-point addition order is preserved

Traced the addition order for both element types to verify determinism:

float (8 elements in 256-bit): getSIMDVectorLength(16, TYP_FLOAT) = 4, genLog2(4) = 2 → 2 hadd iterations.

  • After two vhaddps(x, x), low lane holds (a0+a1)+(a2+a3), high lane holds (a4+a5)+(a6+a7).
  • Extract upper+lower and add: ((a0+a1)+(a2+a3)) + ((a4+a5)+(a6+a7)).
  • This matches the old path which split to two 128-bit halves and summed each with permute+add, then added the scalars: same grouping, same order.

double (4 elements in 256-bit): getSIMDVectorLength(16, TYP_DOUBLE) = 2, genLog2(2) = 1 → 1 hadd iteration.

  • After vhaddpd(x, x), low lane holds d0+d1, high lane holds d2+d3.
  • Final add: (d0+d1) + (d2+d3) — matches old path.

Vector512 float/double recursively splits to two 256-bit halves and sums each via this path — the transitive improvement is correct.

✅ ISA guard — Assert is sufficient

The assert(compOpportunisticallyDependsOn(InstructionSet_AVX)) at line 27202 is correct: simdSize == 32 on x86/x64 always implies AVX. The new hadd path uses NI_AVX_HorizontalAdd which requires AVX — this is safe.

⚠️ Dead code — Floating-point branch at lines 27232–27240 is now unreachable

After the new early return at line 27224, the if (varTypeIsFloating(simdBaseType)) block at lines 27232–27240 is dead code within the simdSize == 32 block. The flow is:

  1. Line 27203: if (varTypeIsFloating(simdBaseType))returns at 27224
  2. Line 27227–27230: extract lower/upper (only reached for integer types)
  3. Line 27232: if (varTypeIsFloating(simdBaseType))always false, dead code

The updated comment "Fallback for non-AVX" is also misleading since simdSize == 32 always implies AVX (as the assert at 27202 confirms — there is no non-AVX path for 256-bit vectors on xarch).

Recommendation: Remove lines 27232–27240 entirely. This simplifies the control flow and eliminates confusion. Both @saucecontrol and the automated Copilot review already flagged this.

✅ Performance — Well-evidenced improvement

The PR includes:

  • Codegen comparison: 7 instructions / 26 bytes (after) vs 14 instructions / 62 bytes (before) for Vector256<float>.Sum()
  • BenchmarkDotNet results on two architectures (Intel Emerald Rapids, AMD Turin Zen 5) showing consistent ~11% wall-clock improvement
  • SPMI replay across 50,228 ASP.NET contexts with zero diffs
  • The HW_Flag_NoEvexSemantics flag on NI_AVX_HorizontalAdd (verified in hwintrinsiclistxarch.h) correctly constrains to VEX encoding on AVX-512 hardware, avoiding 4-byte EVEX prefixes

💡 Naming — vectorLength could be clearer

At line 27210, getSIMDVectorLength(16, simdBaseType) computes elements per 128-bit lane, not the full 256-bit vector length. The variable name vectorLength is slightly misleading. A name like laneElementCount or elementsPerLane would make the intent clearer, though this is minor and the correctness is not affected.

Generated by Code Review for issue #126255 ·

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vector256 explicit intrinsics 71% slower on .NET 10 vs .NET 8 on AVX-512 hardware

3 participants