Music Information Retrieval in a Post Bedtime Environment
Oh dear, instead of being asleep I/we/me are perusing this guys thesis (PDF warning). His PhD project centered on writing software that can automatically identify the identity of a singer by analyzing a recording of their performance. The really neat thing is that it works for recordings which have other instruments in them, so he had to come up with a way to determine whether certain sounds we’re voice-like or instrument-like.
This thesis reading came about due to me taking another poke at my previous attempts at music visualization (or more fancy-schmancily: MIR - Music Information Retrieval). Like so many research related items, although the general concept is fairly well-defined and straightforward the devil is definitely in the heapings of details.
There are all kinds of issues that rapidly start cropping up after the basics have been sort of sorted out.
For example, humans are pretty good at picking a melody. It’s hard to tell a computer to do that. As soon as you have more than one note being played at a time it’s a very non-trivial problem to work out which notes belong to the melody and which to the accompaniment. Particularly if you have, for example, a melody which switches from high to low notes, or passes between instruments, or has sharp changes in its dynamics.
In short, this means it is pretty easy to extract the melody from a monophonic performance of “Mary Had a Little Lamb”, but nearly impossible to extract it from anything else. At least automatically.
However, this brings up an interesting point. Can people even accurately identify which notes form the melody in a piece of music? Here’s a quote from the above thesis which alludes to what I’m getting at, which in this case is talking about automatically identifying musical instruments:
Martin [41] examines the classification of isolated samples from 37 instruments using
hand-picked features as inputs to a quadratic classifier … best-case accuracy is reported at 71.6%.
Martin [41] and Brown [51] also cite human performance for instrument identification tasks. Brown notes that human performance on isolated tones for her two-instrument identification task is comparable to her system’s performance. Martin finds that humans score much worse than his system (i.e., 50-67% versus 71.6%) on the instrument identification task and with comparable accuracy for instrument family identification.
Some MIR goals might not actually be possible. This is possibly true for 100% accurate two-instrument identification, but I bet it’s definitely true for melody extraction. In fact, the more I consider this problem the more I realize that it would be frickin’ impossible to say which of the notes in a piece from part of the melody.
It’s like when you try whistling a famous piece. Whistling should be a perfect example of melody recall, right? We whistle the melodies we have extracted from music, it’s our best attempt at melody identification. Well, anyway, when I try to do this the first part usually goes pretty well, you can sometimes make it all the way through the exposition without getting too confused. However, as soon as you hit the development section you’re basically screwed. You try to whistle three parts at once and it falls apart like a… a… buttercup.
At least, that’s what happens when I try. Now I’ll probably get smarmy eMails from music majors who can whistle every orchestral part of Beethoven’s Ninth all the way through. At once.
Tags: classical music, music information retrieval