Better Models: Worse Tools

13 points | by leemoore an hour ago

5 comments

lukasco 5 minutes ago
It sounds like harnesses might have to start to have model by model system prompts, though retrying works, I guess. It reminds me of the ancient times when browsers all read HTML and CSS differently, and differently on different devices. In that sense, this is nothing new. I was going to say, at least we don't have different device types, but then, the model still has to output the right variant of `grep` as well.
mappu 13 minutes ago
In my harness i implemented apply_patch just taking unified diffs for patch -p1. I was shocked to see how bad models are at generating them. I started logging diff failures to analyse -
- All models are terrible at generating line numbers for a proper diff, give up on them
- Some models (Owl-alpha) must have been post-trained on Codex transcripts, because they occasionally push its V4A patch format into any diff tool available
- Codex puts a lot of info in its system prompt about the desired patch style, making larger hunks instead of granular ones, etc
ares623 a minute ago
Open source developer surprised and concerned by the trajectory their favorite proprietary software is taking.
dofm 16 minutes ago
As critical as I am about articles endlessly concerned with the weaknesses of closed-source cloud products, this one is pretty great, and not just because it concerns interactions with Pi, which looks to me like it's going to end up a sort of quasi-reference implementation of an open source harness, and because it has so much technical detail.
But:
"Now I’m somewhat worried about the track we’re on here. Alternative tool schemas might not just be unfamiliar. They might be implicitly punished by post-training that optimizes for one particular, forgiving tool ecology."
Only implicitly?
--
Many decades ago when I was working on research related to using MOOs as a learning environment, you would add "tool calls" into the stream of text that a MOO object might generate, so your rich client would e.g. show a picture, load a web page in a frame, move you on a map, trigger a change in an on-screen representation of an object.
Everyone who tried this ran into more or less the same problems that LLM clients do: any attempt to shoehorn control sequences into in-band content was riddled with security risks, objects accidentally triggering the wrong interface etc.; you could never truly communicate out-of-band.
The more I read about how agentic harnesses work, the less embarrassed I feel about the code twenty-something-year-old me wrote in a MOO client.
cyanydeez 13 minutes ago
building deterministic tools on non-determinism is hard enough; try adding another layer where your cloud provider decides to massage the context, realigns it's permitted output, arbitrarily downgrades context to cheaper models, or they hire an MBA who determines your plan value can be tied to a degraded model under a new shrinkfied.
It's amazing anyone watched the last 2 decades of tech's enshitification and wants to hook their wagon to this shitshow.