> That seems like it's the secret sauce - the ginormous hint that got ignored by data/statisticis-centric ML researchers. When you learn how to map token streams with significant internal structure, the function your neural net is being trained to approximate will inevitably come to implement at least some of the processing that generated your token streams.

I think this is too strong a claim. The LLM system will simulate similar features as those found in what generated the token systems but I would argue this in not an implementation of those features. If you will it is more like a ghost image or echo of those features. The mapping is very much not the territory.

