| Subscribe via RSS

How to visualize music using animated spectrograms with open-source everything

April 29th, 2008 | No Comments | Posted in music

A while ago I was playing with music visualizations that actually corresponded to the noted being played (in contrast to most of the default visualizations in music players which are just meant to look pretty). This was what I came up with before:

YouTube Preview Image

In which the vertical axis corresponds to the frequency of the note being played and the horizontal motion is time. I’ve been poking around with this again recently, I’ll demonstrate the new improved version when I don’t have acres of aluminum to machine and essays about nuclear RNA export to write, which should be around Friday. In the meantime, various people have asked for information about how this early example was put together, so for posterity…

First I obtained music in WAV form, I chose to make this completely open source so used music from Wikimedia. This wave data was then converted into a text based data file giving the waveform as a value at each second on the left and right channels, by using a piece of software called Sox (short for SOund eXchange):

sox filename.wav filename.dat

Which gives you a file that looks like:

0.089342404   -0.00051879883  -0.00079345703
0.089365079   -0.00030517578   -0.0009765625
0.089387755   -0.00030517578  -0.00061035156

...

In which the first column is the time in seconds and the second and third are the left and right channels. Then I used a Python script (using PIL and numpy) to load these numbers and perform a running Fourier transform to extract all of the pure frequencies present at each timepoint, producing a spectrogram. The script is pretty messy as I was simultaneously learning Python, but I’ll try and explain.

First we load the required modules (numpy for FFTs, Image for rendering frames) and read the sample rate from the data file, as well as calculating the number of samples, etc:

import numpy
from numpy.fft import fft
import Image, ImageDraw, ImageOps, sys

filename=sys.argv[1]

f=open(filename, "r")
data=f.readlines()
print str(len(data)) +" samples"
samplerate=data[0].split()[3]
print samplerate+"Hz => "+str(1/float(samplerate))+" seconds per sample"
print str(len(data)/float(samplerate))+"s"
lengthins=len(data)/float(samplerate)

The next section sets, calculates and displays various pieces of information (for example, the last time point to process given the width of the transform and the desired length of time to process). The width indicates the amount of time to process with each transform, whereas the spacing is the separation in time between each of them:

length_to_process=100

fourierspersecond=24
fourierwidth=0.3
fourierspread=1.0/fourierspersecond

totaltransforms=round(length_to_process*fourierspersecond)
fourierspacing=round(fourierspread*float(samplerate))

fourierwidthindex=fourierwidth*float(samplerate)
print "For Fourier width of "+str(fourierwidth)+" need "+str(fourierwidthindex)+" samples each FFT"
print "Doing "+str(fourierspersecond)+" Fouriers per second"
print "Total " + str(totaltransforms*fourierspread)
print "Spacing: "+str(fourierspacing)
print "Total transforms "+str(totaltransforms)

lastpoint=round(length_to_process*float(samplerate)+fourierwidthindex)

The following initializes several arrays with zeros:

fourierarray=numpy.zeros(fourierwidthindex)
time=numpy.zeros(lastpoint)
sound=numpy.zeros(lastpoint)

This next bit averages the two channels and stores the result into the sound array:

for line in range(2,lastpoint):
  row=data[line].split()
  time[line]=float(row[0])
  sound[line]=(float(row[2])+float(row[1]))/2
  f.close

Now the real meat of the program. This first allocates an image large enough to store every time point to be calculated. Then we iterate through the data, extracting an array of the desired size at each point and loading it into fourierarray, which then has it’s Fourier transform taken. Finally, the data in the Fourier transform output array outfft is iterated through and the values are scaled to pixel values:


im=Image.new("RGB",(totaltransforms+offset,300))
imd=ImageDraw.Draw(im)

for position in range(0,totaltransforms):
  print "FFT: ",str(position).zfill(3)
  fourierarray=sound[((position*fourierspacing)):((position*fourierspacing)+(fourierwidthindex))]
  outfft=fft(fourierarray)
  for x in range(300):
    imd.point((position+offset,x),((255*((outfft[x].real)**2)/160),0,0))

This section renders movie frames by extracting parts of the complete spectrogram, recoloring them and adding a line. Each frame is output as a numbered JPG:

moviescanrange=100*24
moviewidth=400
movieheight=300
offset=round(moviewidth/2)

lowfrequency=30
highfrequency=1000

frame=Image.new("RGB",(moviewidth,movieheight))
linepos=(moviewidth/2)-fourierspersecond*fourierwidth

for xp in range(0,moviescanrange):
  print "Rendering frame "+str(xp).zfill(4)
  leftpart=im.crop((xp,0,xp+linepos,movieheight-1)).point(lambda i:i*0.4)
  rightpart=im.crop((xp+linepos,0,xp+moviewidth-1,movieheight-1))
  frame.paste(leftpart,(0,0))
  frame.paste(rightpart,(linepos,0))
  frame=ImageOps.flip(frame)
  framed=ImageDraw.Draw(frame)
  framed.line([(linepos,0),(linepos,500)],fill=(0,255,0))
  frame.save("frame_"+str(xp).zfill(4)+".jpg")

Finally, the JPGs are animated and combined with the original music into a movie using ffmpeg:

ffmpeg -i frame_%04d.jpg -i music.wav output.avi

And there you have it. You can also download all the source in one file soxtoframes.py.

Tags: , , , , , , , ,