5 points | by ipnon 3 hours ago
1 comments
So if I get this right, all transformers until today has the same residual design, one stream carrying information between layers. DeepSeek figured out how to widen it without training collapsing. Wow, incredible work Deepseek!
So if I get this right, all transformers until today has the same residual design, one stream carrying information between layers. DeepSeek figured out how to widen it without training collapsing. Wow, incredible work Deepseek!