[new model] Add Zyphra/ZAYA1-8B by JJJYmmm · Pull Request #45862 · huggingface/transformers

JJJYmmm · 2026-05-09T12:22:30Z

Zyphra recently released ZAYA1-VL-8B, which has a small number of active parameters and looks like a nice fit here.

Since ZAYA1-VL depends on the text-only ZAYA1-8B backbone, which has not been merged yet, this PR adds support for the text-only ZAYA1 backbone first. I can follow up with the VL model in a separate PR if preferred. 😃

I also noticed that #42669 worked on a similar integration. This PR updates the implementation to better fit the current v5 codebase, including a cleaner CCA cache design and other code cleanups. I also checked the numerical outputs against Zyphra's implementation: https://github.com/Zyphra/transformers/tree/zaya1 cc @nanduruganesh

Tests:

RUN_SLOW=1 python -m pytest tests/models/zaya/test_modeling_zaya.py -q

vasqu

The model is a bit more complicated so please let me know if certain stuff is unclear I tried to nudge towards a better structure. The main points are

The CCA module is way too complicated for what it essentially tries, I tried to simplify a bit
The split into layers is unnecessary, they should be one decoder layer with mlp and attn; this also fixes the residuals paths
Modular can be used a lot more; the current code relies on a lot of v4 specific things / remote code but we actually don't need a lot of those
The cache is somewhat natively integrated within the hybrid layer type
RoPE can have its own layer types (see dsv4); it seems to me that we actually don't use SWA at all but it was used as workaround which is bad

It is super detailed this time, lmk if you would like a less detailed one next time. I'm usually inclined to go full force but some don't like that :D

vasqu · 2026-05-11T19:10:29Z

+                if output_attentions:
+                    all_self_attns += (layer_outputs[1],)
+
+        hidden_states, residual = _apply_residual_scaling(hidden_states, residual, self.res_scale, self.final_norm)


seeing the order of those residual I feel like the order is just messed up as we split the layer types

You want attn -> residual -> mlp -> residual but because of the implementation you have skip first residual -> attn -> residual -> mlp -> residual which could be fixed if we just fuse properly into the one decoder layer with 2 residuals each time

still not resolved 😉

ops, i forgot about this. now i shift the res_scale by one layer earlier, so we can avoid the residual between layers! 🫡

nanduruganesh · 2026-05-11T21:24:48Z

Thank you very much for the rebase and cleanups! About the interleaved sliding window / rope theta confusion, these configs are in place for the ZAYA1-74B-preview model which also uses this branch but has 4k-SWA / 10k rope base every other layer. All of @vasqu's suggestions sound good to me, and another user has found additional fixes to support GRPO trainer on this branch (PR). @JJJYmmm would you be able to integrate all the changes into your branch? Thanks again for this PR.

but need to construct cache from _make_zaya_cache

HuggingFaceDocBuilderDev · 2026-05-12T11:57:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

JJJYmmm · 2026-05-12T12:38:42Z

@vasqu thank you for the detailed review! it was really helpful for me to catch up with the latest changes and learn the unified code style, so that’s totally ok. thanks a lot for your time 😃

in the latest code, i fixed most of the inheritance issues. since the original checkpoint has some uncommon kwargs / weight layouts, which would require a lot of custom code, i wrote a conversion script and uploaded the converted 8b checkpoint here: https://huggingface.co/JJJYmmm/ZAYA1-8B-HF. i also tested it with a fake 74b checkpoint with swa.

the conversion mainly does three things:

use more common names in the config, e.g. intermediate_size, and update some fields like rope_parameters
remove nn.Sequential and use explicit module names
combine the separate attention and mlp layers into a single ZayaDecoderLayer, with the corresponding config fixes, e.g. num_hidden_layers: 80 -> 40
3d experts

what do you think about this conversion? 🫡 @nanduruganesh

JJJYmmm · 2026-05-12T12:52:49Z

another user has found additional fixes to support GRPO trainer on this branch (Zyphra#2)

@nanduruganesh i also checked this pr, and most of the fixes are already covered in the current branch. the only exception seems to be 8. router_aux_loss_coef, but i think zaya does not use an auxiliary loss, right?

besides the conversion mentioned above, i also noticed a small detail in the official code about the SWA mask calculation. in the official branch, the swa mask is:

if window_size > 0:
    causal_mask = (
        torch.ones((seq_length, seq_length), dtype=torch.bool, device=query_states.device)
        .tril_(diagonal=0)
        .triu_(diagonal=-window_size)
    )
    attn_weights.masked_fill_(~causal_mask, -1e4)
elif attention_mask is not None:  # no matter the length, we just slice it
    causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
    attn_weights = attn_weights + causal_mask

this means that in the swa branch, attention_mask is discarded. for now, i kept the same behavior to preserve numerical consistency with the original implementation. is this expected?

EDIT: another reminder: in the conversion script, i increase swa window_size by one (4096 -> 4097).
this is because in the original logic, .tril_(diagonal=0).triu_(diagonal=-window_size) means the current query can attend to window_size keys. but in the current branch, window_size means the total window size directly.

ArthurZucker

Super small review, exciting! 🔥

ArthurZucker · 2026-05-13T06:22:47Z

+        if use_cache and (past_key_values is None or not _is_zaya_cache(past_key_values)):
+            if past_key_values is not None and past_key_values.get_seq_length() > 0:
+                raise ValueError("ZAYA requires a native hybrid cache created from `make_zaya_cache`.")
+            past_key_values = make_zaya_cache(self.config)


let's prevent having to add this and use a simple dynamic cache, registering the layer in

transformers/src/transformers/cache_utils.py

Lines 871 to 887 in cc832f9

LAYER_TYPE_CACHE_MAPPING.update(

{

"full_attention": DynamicLayer,

# From a cache point of view, sliding and chunked are the same in how they should behave;

# only the mask differs.

"sliding_attention": DynamicSlidingWindowLayer,

"chunked_attention": DynamicSlidingWindowLayer,

# Linear-attention-shaped layers (mamba / conv / pure linear-attention / moe placeholders)

# don't grow per-token KV; they're tracked just so position bookkeeping stays consistent.

"mamba": LinearAttentionLayer,

"conv": LinearAttentionLayer,

"linear_attention": LinearAttentionLayer,

"moe": LinearAttentionLayer,

# Hybrid layers (e.g. zamba / zamba2) carry both a linear-attention state and a dynamic-attention state.

"hybrid": LinearAttentionAndFullAttentionLayer,

}

)

yes, i already reused the hybrid mapping. the current issue is this one: #45862 (comment)

to solve it in a simple way, i changed the layer_types logic from:

transformers/src/transformers/cache_utils.py

Line 1286 in cc832f9

layer_types = getattr(decoder_config, "layer_types", None)

to:

getattr(decoder_config, "cache_layer_types", None) or getattr(decoder_config, "layer_types", None)

so models like zaya can keep layer_types for attention variants, while using cache_layer_types to describe the cache layout.

Let's use the hybrid and hybrid sliding for all in the end 🫡 see my earlier/first comments

add a new mapping for zaya 🫡

"hybrid_sliding": LinearAttentionAndSlidingWindowAttentionLayer

ArthurZucker · 2026-05-13T06:23:08Z

+                if output_attentions:
+                    all_self_attns += (layer_outputs[1],)
+
+        hidden_states, residual = _apply_residual_scaling(hidden_states, residual, self.res_scale, self.final_norm)


still not resolved 😉

vasqu

Heya another round, the focus is on really avoiding passing too many args and let these values live in the config.

One of my main ideas tbh is to introduce a sliding hybrid type that would be hybrid but SWA --> then we can use the same layer types across cache and masks. (instead of this current split)

Other than that, mostly details as we try to keep naming conventions the same where we can.

vasqu · 2026-05-13T17:36:48Z

+        if use_cache and (past_key_values is None or not _is_zaya_cache(past_key_values)):
+            if past_key_values is not None and past_key_values.get_seq_length() > 0:
+                raise ValueError("ZAYA requires a native hybrid cache created from `make_zaya_cache`.")
+            past_key_values = make_zaya_cache(self.config)


Let's use the hybrid and hybrid sliding for all in the end 🫡 see my earlier/first comments

vasqu

Looking pretty good now, I think we are getting close to merge. Just a few more details here and there 🤗 thanks a lot for all the iterating

github-actions · 2026-05-17T12:05:56Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, zaya

github-actions · 2026-05-17T12:17:44Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45862&sha=d362c9

JJJYmmm added 6 commits May 9, 2026 20:03

zaya1 support

b35c5e0

add test

d26fffc

update example

8191d39

new config

c125ef3

remove empty line

c90df6f

pass ci

b90759f

tarekziade mentioned this pull request May 11, 2026

[new model] Add Zyphra/ZAYA1-8B tarekziade/tarekziade-transformers-reviewer-test#18

Open

vasqu reviewed May 11, 2026

View reviewed changes

JJJYmmm added 6 commits May 12, 2026 11:39

modify config, laguna-sytle rope

7e29999

use existing cache

cf083aa

cca refine + use llama attn

69d09f3

use dict for 2d/4d mask

d936d54

optimize, reuse existing code

733e687

inherit from AfmoeForCausalLM,

eb7c8cc

but need to construct cache from _make_zaya_cache

ArthurZucker added the New model label May 12, 2026

JJJYmmm and others added 4 commits May 12, 2026 18:43

checkpoint conversion

4d5bda4

align with official implement, check 74b conversion

f3e8e02

easier test

f4f206c

Merge branch 'main' into add_zaya1

059912d

remove mapping since we convert the ckpt

7c48ee1

JJJYmmm added 4 commits May 12, 2026 21:17

use default_swa_theta

498c252

update date

3d63061

temp init

4d74296

modular

d77d5d4

ArthurZucker reviewed May 13, 2026

View reviewed changes

better residual scaling

1c16fec

JJJYmmm added 2 commits May 13, 2026 18:13

better cache

3f53fbc

ops forget init again

dc7ac50

vasqu reviewed May 13, 2026

View reviewed changes

nanduruganesh mentioned this pull request May 13, 2026

[New Model]: ZAYA1-8B vllm-project/vllm#42286

Open

1 task

JJJYmmm added 2 commits May 14, 2026 14:18

better naming

8be4b1e

llama decoderlayer

7bb5122

vasqu reviewed May 15, 2026

View reviewed changes

JJJYmmm added 2 commits May 16, 2026 11:51

improve

b315ae0

update date

0df3204

JJJYmmm mentioned this pull request May 17, 2026

[new model] Add zaya1 vl #46011

Draft

Fix ZAYA residual stream precision regression

d362c90

	LAYER_TYPE_CACHE_MAPPING.update(
	{
	"full_attention": DynamicLayer,
	# From a cache point of view, sliding and chunked are the same in how they should behave;
	# only the mask differs.
	"sliding_attention": DynamicSlidingWindowLayer,
	"chunked_attention": DynamicSlidingWindowLayer,
	# Linear-attention-shaped layers (mamba / conv / pure linear-attention / moe placeholders)
	# don't grow per-token KV; they're tracked just so position bookkeeping stays consistent.
	"mamba": LinearAttentionLayer,
	"conv": LinearAttentionLayer,
	"linear_attention": LinearAttentionLayer,
	"moe": LinearAttentionLayer,
	# Hybrid layers (e.g. zamba / zamba2) carry both a linear-attention state and a dynamic-attention state.
	"hybrid": LinearAttentionAndFullAttentionLayer,
	}
	)

Conversation

JJJYmmm commented May 9, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nanduruganesh commented May 11, 2026

Uh oh!

HuggingFaceDocBuilderDev commented May 12, 2026

Uh oh!

JJJYmmm commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JJJYmmm commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

JJJYmmm commented May 12, 2026 •

edited

Loading

JJJYmmm commented May 12, 2026 •

edited

Loading