MHC: Manifold-Constrained Hyper-Connections

5 points | by ipnon 3 hours ago

1 comments

Alifatisk 4 minutes ago
So if I get this right, all transformers until today has the same residual design, one stream carrying information between layers. DeepSeek figured out how to widen it without training collapsing. Wow, incredible work Deepseek!