Tag optimization (2)

Optimizing iTunesAnalysis: faster database access
Mood: tired
Posted on 2009-07-09 09:44:00
Tags: optimization essay projects programming
Words: 280

The second in an occasional series
Idea: Now I'm focusing on improving the time to insert the iTunes data into database. Where we left off last time, our script took 71 seconds to run, ~50 seconds of which was database operations. The idea I had to speed this up was to batch a bunch of queries together and thus make fewer calls to the database. It turns out this actually slowed things down.

So I did a little research and it turns out if you insert data with the same query structure over and over again (but with different bind variables), the database doesn't have to reparse the query which speeds things up a lot. I tried doing this with pyPgSql but couldn't find any documentation how it was supposed to work, so I switched to using psycopg2 and changed the query for inserting the playlist data. Just switching to a psycopg2 sped things up a lot, it seems. I tried switching to a similar sort of query for inserting track data, but that actually slowed things down.

Anyway, the new script runs in 25 seconds, and it looks like only around 9 seconds for database operations. This is a 400% speedup in the database time! Overall, this step improved performance by ~180%, and since we started at 114 seconds we've improved ~350%.

Conclusion: Another big success, and I'm not sure how much more I can squeeze out of the iTunesInfo.py script. Next time I'll focus on the analyzedb.py script, which does the analysis from the database - right now it's taking between 5 and 12 minutes to run on my library of 6400 tracks.

Source files:
- old script
- new script

0 comments

Optimizing iTunesAnalysis through smarter parsing
Mood: geeky
Posted on 2009-07-08 10:31:00
Tags: optimization essay projects programming
Words: 631

The first in an occasional series
Intro: A while back I wrote a script to analyze an iTunes library and find your favorite artists, albums, etc. It works pretty well and I regularly use it to update my own analysis. Unfortunately, it generally takes a long time to run, which is sort of OK for me (because I just start it running and go do something else) but less good for people who are running the analysis through the web site.

So I'd like to make it run faster, and I have a number of ideas to do so.

Idea: There are two main parts to the system - parsing the iTunes Music Library.xml file into a database, and running the analysis on the database. First I'm focusing on the parsing part.

The first version of the parsing script uses Python's xml.dom.minidom package to completely parse the library file.

After profiling the first version by running python -m cProfile -o profiledata.oldway iTunesInfo.py "iTunes Music Library.xml", I see that the whole parsing process takes 114 seconds. The major parts of this are 60 seconds for the xml.dom.minidom.parse method and 46 seconds for the database operations. Note that this only leaves ~8 seconds for figuring out the track information - clearly this is not the bottleneck!

So I'd like to improve parsing speed. There are two basic kinds of XML parsers - what we're using now is a DOM or Document Object Model-style parser, which means that the parser reads the entire file in and returns a parsed structure containing all the data. (I remember writing a simple XML parser that did this as a project in COMP 314. Ah, memories...) The advantage to this method is that after the parsing is done, it's easy to traverse the DOM tree and find the data that you're interested in. The downside is that, well, it's slow. Also, the entire document has to be read into memory which means that your memory usage is proportional to the size of the file you're processing, which adds to the slowness and can lead to out of memory problems on huge files (although we weren't seeing that here).

The other basic kind of XML parser is known as SAX, or Simple API for XML. You provide callback functions that are called whenever the parser runs across the start of a tag, end of a tag, character data, and...that's it. Whatever processing you want to do you have to do in those callback functions. So if you're just, say, counting the number of <key> tags in a document this works really well. It's also much faster than the DOM-style parser, since it doesn't have to generate a giant tree structure. But doing the sorts of processing we're doing on the library file seems a bit more tricky.

Anyway, I take a stab at it, and after a bit end up with version 2 of the script. Notice that the logic in the Handler class is a bit twisted - we have to keep track of where we are in the document (so if things get out of order we'll have problems) and use a state-based system which is a bit brittle and unclear.

But how does it perform? The old version of the script ran in 114 seconds, and this version runs in 71 seconds for a ~60% increase in speed. But really, it's better than that, because the database operations still take around 50 seconds - if we subtract that from both we get 64 seconds versus 21 seconds which is a ~200% increase in the speed of the parsing.

Conclusion: This was a big success! Most of the time is now in the database layer, which I have some ideas for speeding up next time.

Source files:
- old script
- new script

0 comments

This backup was done by LJBackup.