This thing is impressive, and I would never thought of trying it.
But it doesn't strike me as incredible or impossible ... just hard.
Analogous problems crop up all the time in speech recognition and
optical character recognition.
In its simplest form, you can do analysis-by-synthesis. Treat it as
a hidden Markov model (HMM). You keep a running model of what you
/think/ is there. If you see a discrepancy, you make an appropriate
transition in the HMM, choosing the transition that best explains
what you're seeing.
If you're modeling a single instrument, it's like doing OCR on
printed characters in a known font (which is much, much easier
than doing something like OCR on handwritten digits, where you
don't know anything about the "font"). You might be able to
get by with a fancy preprocessor followed by a Viterbi decoder.