I recently came across an interesting post on the Powerset Blog recently about garden path sentences. Garden path sentences are sentences that lead you down the wrong path through a string of words with multiple meanings. For example,
The complex houses married and single students and their families
In this case, most readers would probably think complex was an adjective that modified the plural noun houses. The post ended with a challenge – how easy would it be to create a program to automatically generate these sentences. Since school is out and I have some free time, I tried it myself. I found a decent free xml dictionary, and wrote a Ruby script to parse the important bits (the type of word and alternate forms) into an SQL database. I cross-checked all the words against a word frequency table to make sure there were no obscure words. I then wrote a Python script to put the words together into a (hopefully meaningful, but not often) sentence. I put the Python script onto my server so you can play with it here April 2009 Update: I removed the live demo as part of a server move.
As you can see, the sentences that it comes up with are far from meaningful. However, in most cases you can at least see how a reader could be taken down the wrong path (at least in the cases where there is a right path). In the above example, concrete could be an adjective or a noun, and spheres could be a noun or a verb (to form a sphere). Foster could be an adjective or a noun depending on the context, but I couldn’t see the reader seeing it as an adjective here. Certainly the sentence generator leaves a lot to be desired (especially considering that this was one of the better sentences), but I got about as far with it as I expected to. I think it could be improved further with a few modifications:
- Words in the database are already cross-checked to make sure they aren’t obscure, but often a word will be common as a noun and uncommon as a verb, or vice versa. I didn’t have a dataset that allowed me to determine if this was the case for a particular word.
- The valency of verbs is ignored. All verbs are assumed to be transitive, even though valency information is available in the database.
- I underestimated the difficulty of having a computer generate a meaningful sentence. It is difficult to determine what verbs are compatible with what nouns, I guess you would need to parse a large amount of English text (perhaps some of Project Gutenberg – I think Wikipedia would not be varied enough but I could be wrong).
I noticed later that Ero Carrera had taken a similar approach to what I did, but with his linguistics experience he better anticipated the problems I ran into. He has some good ideas, and his post is an interesting read.


This is a great first attempt at this problem! Getting the meanings to really line up is the hard part.
That first sentence — “The complex houses married and single students and their families” — was amazing. It took me about 5 attempts to parse it. And good work on gardenpath.py — that sort of thing’s always fun. I quite liked this one, “Her correspondent buffers shot frequencies”. (BTW, “sentances” is spelt “sentences”. :-)
Ben, thanks for the heads up on the spelling. I just assumed that FireFox would go ahead and underline the misspelled words in the WordPress dashboard like it does on other text fields and WYSIWYG inputs. Embarrassingly, this is not the case.
Btw, I can’t wait to see micropledge launch.
Yeah, Firefox does not do spell check on fields, only on ‘s. I really don’t know why. They should at least do it for style text boxes. Bad idea to do it for text boxes since I could tell if someone’s password is in the dictionary if it doesn’t show a swigly line as I’m watching over him, and then try to brute force it later.
Yikes! My previous comment makes no sense since wordpress stripped all the HTML tags! WordPress should be doing a htmlspecialchars(), not a strip_tags() on comments!
Rejesh, yeah, but the weird thing is, Firefox spellcheck works fine with GMail’s styled text input.
And I agree, WordPress’s tag stripping is a bit stupid. I haven’t been too happy with WordPress in general over the last little while; if I had the time I would port my blog over to another system that is better for my needs. Something with syntax highlighting and LaTeX rendering would be cool, I might end up putting one together myself if I have to.
I’m pretty sure there’s a plugin for wordpress that can do LaTeX. Try this one: http://sixthform.info/steve/wordpress/, which implements a nice wrapper over http://www.forkosh.com/mimetex.html
Similarly, if you look carefully enough through http://wp-plugins.net/, I’m sure you’d find one for syntax highlighting as well.
My very own blog, meetrajesh.com, does LaTeX rendering, PHP syntax highlighting using PHP’s built-in highlight_string(), AND chemical reaction and equation notation, which is just LaTeX at the end of the day. See http://www.meetrajesh.com/archive_2005_06.html#testing.txt for an example.
Sorry because my comments are getting way off-topic from the original blog post. I do like the garden path sentences generated by your Python script though. Many hours of free linguistic entertainment!
Hmm, cool, I’ll look into that. I should have though of checking wp-plugins.. I guess I was just looking for an excuse to move to another blogging platform :). Leonardo (http://jtauber.com/leonardo/) seems to be almost exactly what I’m looking for, actually (it even supports TeX and syntax highlighting).