Garden Path Sentences

I recently came across an interesting post on the Powerset Blog recently about garden path sentences. Garden path sentences are sentences that lead you down the wrong path through a string of words with multiple meanings. For example,

The complex houses married and single students and their families

In this case, most readers would probably think complex was an adjective that modified the plural noun houses. The post ended with a challenge – how easy would it be to create a program to automatically generate these sentences. Since school is out and I have some free time, I tried it myself. I found a decent free xml dictionary, and wrote a Ruby script to parse the important bits (the type of word and alternate forms) into an SQL database. I cross-checked all the words against a word frequency table to make sure there were no obscure words. I then wrote a Python script to put the words together into a (hopefully meaningful, but not often) sentence. I put the Python script onto my server so you can play with it here April 2009 Update: I removed the live demo as part of a server move.

His concrete spheres foster complexities

As you can see, the sentences that it comes up with are far from meaningful. However, in most cases you can at least see how a reader could be taken down the wrong path (at least in the cases where there is a right path). In the above example, concrete could be an adjective or a noun, and spheres could be a noun or a verb (to form a sphere). Foster could be an adjective or a noun depending on the context, but I couldn’t see the reader seeing it as an adjective here. Certainly the sentence generator leaves a lot to be desired (especially considering that this was one of the better sentences), but I got about as far with it as I expected to. I think it could be improved further with a few modifications:

  • Words in the database are already cross-checked to make sure they aren’t obscure, but often a word will be common as a noun and uncommon as a verb, or vice versa. I didn’t have a dataset that allowed me to determine if this was the case for a particular word.
  • The valency of verbs is ignored. All verbs are assumed to be transitive, even though valency information is available in the database.
  • I underestimated the difficulty of having a computer generate a meaningful sentence. It is difficult to determine what verbs are compatible with what nouns, I guess you would need to parse a large amount of English text (perhaps some of Project Gutenberg – I think Wikipedia would not be varied enough but I could be wrong).

I noticed later that Ero Carrera had taken a similar approach to what I did, but with his linguistics experience he better anticipated the problems I ran into. He has some good ideas, and his post is an interesting read.

Endless Google Search

April 2009 Update: Originally, I had an live example of this running. However, the Google API doesn’t seem to work any more (it was discontinued over two years ago). In any case, there are better examples online now. Try Live Search Images or Terrel Dent’s blog. I would make the source available, but it was an weekend hack and there isn’t much to it.

I felt like coding today, so I put together a little hack from an idea I have had for a while. What I came up with is a web search (powered by Google), that loads new search results as you scroll the page down. Try it, it’s actually pretty cool.

Here is how it works: there is a large div element at the bottom of the page just to take up space. When it comes onto the screen, an ajax request is made to the server to get the next 10 results from Google. The requests are made through Google’s SOAP api, which is no longer available, but I had an old API key so I was able to get it to work. I had all the client stuff working within an hour, but Google’s API took a while to figure out. Google uses SOAP, which is powerful but hard to code for compared to a simple GET API. It took me a couple of hours to get the server-side stuff working but it is still a hack, so don’t be surprised if you get an error or some unexpected behaviour.

It was designed for FireFox/Mozilla browsers. The only other browser I have tried it with is IE, which it does not work with. So if you are using Internet Explorer, you won’t see anything interesting.

Try it here

JSSpamBlock Modifications

Update: Due to lack of time and interest (on my part), I am no longer maintaining JSSpamBlock or ImageScaler.

The way JSSpamBlock has evolved since I first released it has reminded me why I love open-source. From day one, I had users pointing out bugs and features they would like added, sometimes even submitting a fix for the bug or adding a new feature in themselves. Here are some modifications I have come across on other blogs:

After Georg Kaindl and I had a discussion on whether a database was really neccesary (he made some excellent points on why this is not the case, though I still maintain that the extra protection is worth the small cost of time), he released a JSSpamBlock modification as a new plugin called simpleAntiSpam. He also came up with a clever way to require that the form be parsed once by the bot for each post (although the bot can make unlimited comments to a post once it has parsed the form). I have considered making this functionality the default in an upcoming version of JSSpamBlock, since it will be more than enough protection for the average user.

More recently, I got a comment from Brandon Checketts, who had modified JSSpamBlock so that the comment field names were different than the defaults. The reason was that even if spam bots adapt to JSSpamBlock, modified field names will throw them off. Although I can’t see anyone modifying their spam bots to specifically get around my plugin, I have always tried to design it as if they eventually would, so this will likely be a feature in future versions as well.

Kevin Pendleton, another user, has ported JSSpamBlock to Perl. His version is a bit simpler; it uses a hard-coded value instead of a randomly generated one. In my experience with blocking bots, this should be enough to block out the vast majority of spam bots.

A simple diff algorithm in PHP

A diff algorithm in its most basic form takes two strings, and returns the changes needed to make the old string into the new one. They are useful in comparing different versions of a document or file, to see at a glance what the differences are between the two. Wikipedia, for example, uses diffs to compare the changes between two revisions of the same article.

Solving the problem is not as simple as it seems, and the problem bothered me for about a year before I figured it out. I managed to write my algorithm in PHP, in 18 lines of code. It is not the most efficient way to do a diff, but it is probably the easiest to understand.

It works by finding the longest sequence of words common to both strings, and recursively finding the longest sequences of the remainders of the string until the substrings have no words in common. At this point it adds the remaining new words as an insertion and the remaining old words as a deletion.

You can download the source here: PHP SimpleDiff

JSSpamBlock 1.4

Update: Due to lack of time and interest (on my part), I am no longer maintaining JSSpamBlock or ImageScaler.

It must look like JSSpamBlock is all I have been working on these days, which is the opposite of true. I have a couple cool projects coming along that I hope to post soon, but I fixed another oversight in JSSpamBlock. Basically, if you installed JSSpamBlock in a folder called /jsspamblock/ in the plugins directory (rather than putting the file directly in the plugins directory), the activate hook was not called, so the database tables were not created. This is now fixed. Thanks to david_kw of exfer network for discovering the problem and the solution. You can find the new JSSpamBlock 1.4 in the WordPress plugin directory.

JSSpamBlock 1.3

Update: Due to lack of time and interest (on my part), I am no longer maintaining JSSpamBlock or ImageScaler.

A user of JSSpamBlock found a bug which is rather undesirable; it incorrectly assumes that comments are spam if a new comment hash has since been generated. Versions up to 1.2 have this bug. The new version 1.3 does not, and can be found here: http://wordpress.org/extend/plugins/jsspamblock/ . Sorry for any inconvenience. This will be the last JSSpamBlock for a while, I promise ;).

Thanks to Stephen Darlington for finding this bug.

JSSpamBlock 1.2

Update: Due to lack of time and interest (on my part), I am no longer maintaining JSSpamBlock or ImageScaler.

I have made a few small changes to JSSpamBlock, my WordPress spam detection plugin. I found that the plugin had some problems with custom WordPress themes, since some theme developers apparently don’t include the comment form hook. I have added instructions on how to call JSSpamBlock manually from the template file. I have also fixed the plugin for older versions of WordPress which did not have the wp_die() function.

The plugin is now hosted at the WordPress Plugin Directory. You can find it’s page here: JSSpamBlock 1.2. If you have a working installation, there is no reason to upgrade.

Preventing Comment Spam with JavaScript bot detection

Update: Due to lack of time and interest (on my part), I am no longer maintaining JSSpamBlock or ImageScaler.

I got my first comment spam on this blog the other day. It inspired me to try an idea I got a few months back. My theory was that these bots aren’t very smart – they are programmed to post as many comments as possible on as many sites as possible, hoping that a handful of these comments would get past whatever system the blogger was using to prevent spam. I hypothesized that these bots did not understand JavaScript, and that by requiring some JavaScript to run in the browser I would be able to check with reasonable accuracy weather the comment was submitted by a human or a bot.

I wrote up a simple plugin to test the theory. I checked the logs to find that I was right. In fact, most of the bots that were spamming my blog did not even include the hidden element, which indicates that they were posting to the wp-comments-post.php file directly rather than accessing the form first. The bots that did access the form did not execute the JavaScript and therefore their comments were blocked. Since the trick only involves JavaScript, most users will not even notice the difference. Users without JavaScript simply need to follow the given instructions to copy a number to a text box in order to prove they are human. This is what users without JavaScript will see:

JSSpamBlock Screenshot

JSSpamBlock Screenshot

If you want to use JSSpamBlock on your blog, check out the JSSpamBlock project page.

webFractal: Web-based Fractal Explorer

Last weekend, I won a nice new Toshiba laptop in a local software competition. My entry was a web-based fractal explorer. I had a lot of fun making it, and it is fun to play with as well. I have decided to release it under an open-source license so that other people can play around with it (see the download link at the bottom of this post).

Unfortunately, I do not have access to a powerful Tomcat server with a lot of bandwidth, so I can’t host an online demo. If anyone has the resources and is interested in hosting it, please let me know.

Here are some screenshots of the application in action:

Since it is a web-based application, any supported web browser can be the client (see the documentation for a list of supported browsers; any modern Gecko-based browser is supported as well as IE and Opera.) The client interface is loosely based on Google Maps. The server is a Java Servlet run through Tomcat. You can read more about how it works in the documentation.

Downloads: