I love data
Mood: giddy
Posted on 2007-04-25 16:05:00
Tags: programming ngrams
Words: 80

Thanks to Peter Norvig's (he wrote my AI textbook! and is director of research at Google!!) article about writing a spelling corrector, I was reminded that Google released a giiiiant list of n-grams found on the web. Unfortunately, it's only for noncommercial use unless you join and pay thousands of dollars (noncommercial is OK) and costs $180(!) to buy and ship. On the other hand, it's 6 DVDs of compressed data (24 GBs of gzipped files). This is soooo tempting.


4 comments

Comment from wonderjess:
2007-04-25T17:19:43+00:00

too bad it's past your birthday. :)

Comment from quijax:
2007-04-25T21:36:01+00:00

I have that book! I didn't realize you dabble in NLP. Just out of curiosity, do you have any projects in mind?

Comment from gregstoll:
2007-04-25T22:24:49+00:00

I dabble in random AI stuff (although not in a while), and NLP seems neat!

I'm making a list of projects I would do if I get the DVDs. If these seem convincing enough in a week (to avoid the "this sounds super super cool" effect and then I forget all about it, which happens a lot with me) then I'll probably get them. Maybe we could share them if you're interested :-)

Anyway:
- showing simple letter frequency
- some sort of solver of cryptograms (knowing the prior probabilities of words would be helpful, and it would be a super huge dictionary)
- simple interface to see which words are more common. You could do it with people's names for extra motivation!
- Google Suggest kinda thing where you type in a few letters and you get the most common words starting with that letter.
- one of those markov chain things that spit out english text based on 1, 2, etc. grams.

This is just off the top of my head - I'm open to other cool ideas!

Comment from quijax:
2007-04-26T17:38:56+00:00

wow, cryptograms. That one sounds hard enough to be interesting.

This backup was done by LJBackup.