i was thinking about the same problem of course. agent round-trips even with context caching are too long. suppose the agent turns spit out multi-step automations to drive, in whatever language - by the time they get done executing the world has moved. you ran into an obstacle, a biter ate your face, or you got run over by a train.
it seems like you'd need to design a non-LLM part of the bot that can respond to things in real-time on principles and then the LLM acts like the executive function that changes the state / goals / parameters of its other more twitchy part of the bot. it could be fun to figure out what the split is, how they interact, and how to structure the interfaces for it