simple ViT implementation #10

EIFY · 2025-04-17T00:01:51Z

includes necessary modules (LN, bias, scale, constant posemb) and notebook showing it working on MNIST. Initial tuning shows that momentum w/ dualize not quite as performant as Adam but

That was a 10-minute effort
Not sure how serious we should take MNIST performance after 1000 steps anyway

More of a demonstration but could be merged.

includes necessary modules (LN, bias, scale, constant posemb)

In most cases image data has a channel dimension (that's usually 3) So making the input shape NHWC makes the ViT more applicable externally. Unexpectedly, bias alone stabilizes the model tho not nearly as performant.

Just like not caching RoPE, it's more in the spirit of JAX to defer constant materialization to forward pass and allows distributed init. without triggering discallowed host-to-device transfer error.

EIFY · 2025-05-04T22:36:27Z

I have ported this "dualized ViT" to Big Vision and started training a dualized ViT-S/16 on ImageNet-1k.
TL;DR: So far it severely underperforms the baseline but there are still a few knobs to turn.

Implementation: I made a custom branch of modula that dry-runs dualize recursively and caches the target_norm for each atom. I then export them along with the class name of the atom module and graft the behavior of dualize onto the Big Vision implementation by using optax.partition of the optax nightly (see the Big Vision branch that does this). So far the training loss looks like this:

wandb report

LR=0.05, WD=0.005 or 0.0001 (decoupled from LR), momentum w/ beta=0.95
No warm-up, cosine learning rate decay.
Notably l2_params starts out not too different from the baseline but grows quickly to 50x of that. My knee-jerk reaction to raise WD from 0.0001 to 0.005 however seems to make the model even worse:

I don't have great intuition here. Perhaps it still needs LR warm-up even though the model training is stable without it?

Notable architecture (or rather, just scaling) differences from the baseline:

GELU is scaled by 1/1.1289 to keep the derivative <= 1.
The dot product of the dot-product attention is scaled by 1/d instead of 1/sqrt(d) where d is the dimension of the attention head (same as μP). The final attention output is then scaled by 1/3.
The residual branch is scaled by s = 1/(2 * depth) and the residual connection is normalized as x = (1-s) * x + s * y. This is more aggressive than the 1/sqrt(depth) residual branch scaling for ViT here

Optimizer differences from the "conventional" muon:

Due to the different target_norm values we effectively have different LR for each layer.
Instead of AdamW fallback we have momentum + L2-normalized update for bias (same as Embed) and Lion-like momentum + sign update for scale (following the derivation of muon and consider the scale of layernorm as the linear operator of diag(scale)).
We don't clip the sqrt(fanout / fanin) factor from below as sqrt(max(1, fanout / fanin)). It's what it is.

If necessary it may be interesting to bisect these differences.

gngdb · 2025-07-30T13:56:55Z

modula/bond.py

        x1 = x[..., self.rope_dim:]  # shape [batch, n_heads, seq_len, rope_dim]
        x2 = x[..., :self.rope_dim]  # shape [batch, n_heads, seq_len, rope_dim]

+        # Why is the order reversed!?


I noticed this too and corrected it in #13. I guess it doesn't really matter for performance?

simple ViT implementation

f5b027b

includes necessary modules (LN, bias, scale, constant posemb)

EIFY mentioned this pull request Apr 17, 2025

Lack of a bias option #9

Open

EIFY added 3 commits April 16, 2025 20:36

fix wrong comments

a6f2f57

make ViT input shape NHWC, bias-only experiment

691b9cd

In most cases image data has a channel dimension (that's usually 3) So making the input shape NHWC makes the ViT more applicable externally. Unexpectedly, bias alone stabilizes the model tho not nearly as performant.

Defer Constant materialization

b33055e

Just like not caching RoPE, it's more in the spirit of JAX to defer constant materialization to forward pass and allows distributed init. without triggering discallowed host-to-device transfer error.

EIFY mentioned this pull request Jul 30, 2025

Modula composite module in jax -> nn.Sequential in PyTorch #13

Open

gngdb reviewed Jul 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

simple ViT implementation #10

simple ViT implementation #10

Uh oh!

EIFY commented Apr 17, 2025

Uh oh!

EIFY commented May 4, 2025 •

edited

Loading

Uh oh!

gngdb Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

simple ViT implementation #10

Are you sure you want to change the base?

simple ViT implementation #10

Uh oh!

Conversation

EIFY commented Apr 17, 2025

Uh oh!

EIFY commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gngdb Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EIFY commented May 4, 2025 •

edited

Loading