Analog Hacker News

ashirviskas 28 minutes ago

I wonder what if we just crammed more into the "tokens"? I am running an experiment of replacing discrete tokens with embeddings + small byte encoder/decoder. That way you can use embedding space much more efficiently and have it contain much more nuance.

Experiments I want to build on top of it:

1. Adding lsp context to the embeddings - that way the model could _see_ the syntax better, closer to how we use IDEs and would not need to read/grep 25k of lines just to find where something is used. 2. Experiments with different "compression" ratios. Each embedding could encode a different amount of bytes and we would not rely on a huge static token dictionary.

I'm aware that papers exist that explore these ideas, but so far no popular/good open source models employ this. Unless someone can prove me wrong.

lostmsu an hour ago

Comparison with vanilla of the same size/flops budget?

[-]

Lerc an hour ago

I'm not sure if that is the right calculation.

Provided the flops are not prohibitive. Output quality per model bytes might be better. In general people run the largest model they can.

I certainly think trading speed for quality at the same size is worth looking at. Especially if it uses methods that can benefit from the efforts of others to improve speed in general.

That said performance difference at 30M may not be representative of performance difference at 30B

There are probably a lot of really good ideas out there waiting for someone to drop a few million in training to reveal how good they are on large sizes.

[-]

lostmsu an hour ago

So no comparison?

keyle an hour ago

Does this make any sense, to anyone?

[-]

kannanvijayan an hour ago

I think this is an attempt to try to enrich the locality model in transformers.

One of the weird things you do in transformers is add a position vector which captures the distance between the token being attended to the some other token.

This is obviously not powerful enough to express non-linear relationships - like graph relationships.

This person seems to be experimenting with doing pre-processing of the input token set, to linearly reorder it by some other heuristic that might map more closely to the actual underlying relationship between each token.

liteclient an hour ago

it makes sense architecturally

they replace dot-product attention with topology-based scalar distances derived from a laplacian embedding - that effectively reduces attention scoring to a 1D energy comparison which can save memory and compute

that said, i’d treat the results with a grain of salt give there is no peer review, and benchmarks are only on 30M parameter model so far

geoffbp 20 minutes ago

I dug into this a bit (with AI ofc) and it spat this out. I found it an easy way to visualise and start to understand:

> Standard AI models (like GPT-4) treat data using Global Geometry. They imagine every word as a point floating in a massive, flat, high-dimensional room. To see how two words relate, they draw a straight line between them.

> Local Topology changes the "room" into a landscape (a manifold). Instead of a flat void, the data exists on a curved surface that has hills, valleys, and paths.

Starting from scratch: Training a 30M Topological Transformer