Score Processing, Part III
Here is the latest product of my epic battle to nicely animate a score to music:
(widescreen here)
The motivation here was to get a program to work out when notes were being hit, just by looking at the volume. The idea is that when the volume suddenly increases, a note is being played. If I can work out exactly when all the notes are being hit in a recording, then I can map the analyzed score (that is, the raw notes from the sheet music) onto the timings, and make a neat animation.
The video shows a moving plot of the volume. Well sort of. It’s the absolute value (i.e., all negative values are made positive) of the amplitude of the waveform. You know how when you look at a speaker cone real close you can see it vibrating? The amplitude tells you how how far it is moving. If the amplitude is large, it is compressing a lot of air, which sounds loud. If the amplitude is very small, the speaker is barely moving, and we can hardly hear anything. This doesn’t exactly correspond to what we hear as volume, because there are lots of psychological effects which affect our perception (for example, hearing low and high pitches differently).
As you can see, it works pretty well when not much is happening, like during the first few minutes. It is interesting to see how the sound slowly drops off after each note is struck. When things get more hectic it gets way harder to separate the notes, since the sound level is continuously fairly high. Quieter notes get lost in the sustains from previous ones.
Imagine hitting a low C on the piano and then immediately hitting a high one, much more quietly. We could probably hear both because they have such different frequencies, despite the volumes. But if instead of hitting the high C you quietly hit the low one again, it would be really hard to hear. That’s pretty much what is happening here. We are not using any of the pitch information.
There is a way to do that, but it is much trickier to program.
Score Processing, Part II
No-one exists here. We are still within the midst (or midsts? Which is it? Well, given that it is from the 15th century I guess I can get away with either. That’s how they used to roll.) of thanksgiving. No-one is here. Except me. I think my supervisor would have murdered me if I left this weekend, given that I was skipping around NYC for most of last week.
I did have some time to play around a bit more with the score processing. Here’s my standard guinea-pig type piece, Beethoven Op. 111:
Do you love the glorious widescreen? Oh wait… the embedded player doesn’t work with widescreen yet. Well, if you (like me) are hot for 16:9 you can watch the full thing here. HOWEVER. It still won’t have any audio. Why not? It turns out that synchronizing the audio is actually the real crux of this score analysis malarkey.
You see, the notes in the video are a literal transcription of the score (extracted from MusicXML versions of MIDI files), but no-one ever plays a literal transcription of the score. The tempos vary all the time. So making the audio match the notes is a much more difficult problem than getting the notes themselves. I have about five different ideas for getting this to work, but all of them are several day long programming sessions.
Still, the video looks kind of pretty, right?
Score Processing, Part I
Dear regular readers, you know that fantastic idea I had last week about automatically analyzing scores? Well, it’s a bit more complicated than (the royal) we had hoped. Of course. Stuff like that always is. It’s the same with lab research: when I look back at the sum of the previous year’s work, it is frickin’ astounding how much effort I have put in, in order to advance such a tiny distance. Oh woe, woe is we.
But enough of the whining. Here is why it doesn’t work:
The top line is the first violin part from… well… any guesses? The bottom, in red, is the average amount of “stuff” happening at each point in time. Specifically, it’s a measure of the average pixel intensity — which is why it dips down when notes are being played, because there are more black pixels, which have zero intensity. If this system worked as well as I would like then every note would be associated with a dip in the red line. That does in fact happen, but the problem is that all the other junk also makes it dip, like those f’s and accidentals.
So I’m going to have to implement something a bit more sneaky, like a normalized cross-correlation. Instead of just looking at blackness, an NCC would search through the score for stuff that “looks like” a note. Unfortunately it’s a lot slower, and more difficult to program.


