Data Structures for Range-Sum Queries (slides)

This week I attended the Canadian Undergraduate Mathematics Conference. I enjoyed talks from a number of branches of mathematics, and gave a talk of my own on range-sum queries. Essentially, range-aggregate queries are a class of database queries which involve taking an aggregate (in SQL terms, SUM, AVG, COUNT, MIN, etc.) over a set of data where the elements are filtered by simple inequality operators (in SQL terms, WHERE colname {<, <=, =, >=, >} value AND …). Range-sum queries are the subset of those queries where SUM is the aggregation function.

Due to the nature of the conference, I did my best to make things as accessible to someone with a general mathematics background rather than assuming familiarity with databases or order notation.

I’ve put the slides (pdf link, embedded below also) online. They may be hard to follow as slides, but I hope they pique your interest enough to check out the papers referenced at the end if that’s the sort of thing that interests you. I may turn them into a blog post at some point. The presentation begins with tabular data and shows some of the insights that led to the Dynamic Data Cube, which is a clever data structure for answering range-sum queries.

Posted in Computer Science, Math | Leave a comment

An experiment in A/B Testing my Résumé

Objective

I’ll admit it: my résumé doesn’t stand out. I’ve had some great internships, but also a tendency to work for companies that aren’t (yet!) household names. And though I’m doing fine academically, it’s not well enough to stand out on my marks alone.

On the other hand, my blog lets me stand out. I’ve had a few opportunities to meet and interview with some great people and companies because they read my blog. Naturally, then, the primary goal of my résumé is to get people to visit my blog. Since I don’t quite have the audacity to make my résumé a note telling people to visit my blog, I’m faced with the problem of how to optimize my résumé to ensure people see my blog. That’s where this experiment comes in.

Methodology

I started thinking about variables in my résumé that could affect the rate at which people viewed my blog. I narrowed it down to three that I could easily test.

The first is the length of the résumé. My friends Rajesh and Shams are adamant about keeping their résumés down to a single page. Their arguments are sound, but I wanted to see if the data would back up their beliefs. I created a “short” version of my résumé which I squeezed into one page by omitting the Awards section and removing some skills.

Second, I wanted to know how my grades affected the résumé. Obviously I couldn’t start making things up, but since my major average differs from my overall average by a good margin, I had two numbers that I could use truthfully with a subtle change in wording.

Résumé variations with different grades


Finally, I wanted to test whether it pays to include social media links on the résumé. I chose GitHub, LinkedIn, and twitter as the links to test. GitHub was an obvious choice because it emphasizes my free-time projects. LinkedIn seemed like a good one to test, given that it is for professional networking. I chose twitter as another variation because I was curious to see what the reaction to a more personal social networking site would be. All résumés linked back to my blog as well. Finally, I had another resume which linked only to my blog, as a control group.

Résumé variation with a link to GitHub


In all, these three variations resulted in 16 unique résumés. Fortunately I didn’t have to create them all by hand. I was already using LaTeX for my résumé, using one of the elegantly typeset templates from The CV Inn as a base. I simply threw my latest résumé into a Mako Template and wrote some python code to spit out the 16 possible variations of the LaTeX code. Then I used pdflatex to create pdf files. Since I was putting the résumés online, I made a landing page. To keep things simple, the landing page was just an image version of the résumé with a link to download the pdf, and just enough CSS to look presentable.

I wanted to track three things: how many people downloaded the résumé, how many people scrolled to the bottom of the landing page, and how many people visited my blog. The first I accomplished by logging downloads. The second I accomplished with jQuery and an Ajax callback. The third I accomplished with a tracking image, just like hit counters in the 90s. I used IP address and cookies to match up actions with the associated résumé.

The only remaining problem was how to get hundreds of people to see my résumé in a short period of time. Fortunately Google offered me $110 in AdWords credits as a Google Analytics user, so I took advantage of that and ran ads on Google searches. Here is one of the half-dozen variations I ran:

One of the Google ads I ran

Results

After less than a week, I managed to exhaust my AdWords budget and gather a fair bit of data. I wrote a few hadoop jobs with my new toy and then brought the data into R for analysis and visualization.

Length

As you might expect, the people who encountered the short resume were much more likely to scroll to the bottom. Just over half did, verses just under a third of those presented with the long resume. This makes sense because there is less to scroll through, but it was nice to have the data confirm my suspicions. Note that in the following graph, and all others in this post, the grey lines indicate the 90% confidence interval.


The short résumé also resulted in more downloads and blog views, but not enough to be statistically significant with the amount of data I collected.

Grades

The grades shown on the résumé didn’t affect any of the metrics I was measuring in a statistically significant way.

Links

I was surprised to find that the non-blog link shown on my résumé affected the frequency of click-throughs to my blog. Even adding a link to my GitHub profile more than halved the frequency of a clickthrough to my blog. LinkedIn and twitter were even worse.


I created a heatmap-like visualization from the relative significances of each link to each other. For example, the upper leftmost cell means that it is 97.2% likely that if a sufficiently large group of people were exposed to each of the LinkedIn and blog-link-only versions of my résumé, the group that saw the blog-link-only version would visit my blog more. Jesse E. Farmer has written more about the details of how this is calculated.


Oddly, the effect was reversed when you consider downloads rather than blog views. The résumés without any social media links were far less likely to be downloaded than those with. Even a résumé with a twitter profile did better than one without, though not by enough to be statistically significant.



The additional links also reduced the frequency of readers scrolling to the bottom of the page.


Conclusion

There are two main things I learned from this experiment. First, I’m going to keep social network links off of my résumé. Although they increased the download rate, they decreased visits to my blog. Since the latter is my priority, I’m not going to start adding social networks to my résumé.

Second, the short résumé did better in every way. However, the improvement in blog views was not statistically significant. For now, I’m keeping my online résumé at two pages, but I will use the one-page version in print.

There’s a number of disclaimers I should make here. For one, even if my findings are true of my résumé, they might not be true of other résumés. Maybe a change in layout would diminish the effect of linking to social media profiles, or make the longer résumé convert better. I should also point out that I have no way of knowing who my audience was. They probably weren’t all in a position to hire a programmer or data scientist, so the factors that make them visit my blog may or may not have the same effect on those who are.

And finally, a shameless plug. I’m looking for an interesting data science internship this fall (September to December 2010). If you’re doing cool things with data, I’d be glad to hear from you. My contact information is in the sidebar.

Posted in Data Mining, Math, R, Statistics | 4 Comments

Why R doesn't suck

I first encountered the R programming language a few years ago when I needed to make some plots. Although I’ve used it occasionally since, I always considered it a sort of “Perl for statisticians” — a useful swiss-army knife with ugly syntax and inconsistent semantics. My workflow generally involved manipulating the data in Python and using R to make a simple plot, minimizing the amount of R code I wrote as much as possible.

When I recently decided to sit down and properly learn the language, I was pleasantly surprised that underneath the line noise was an interesting and unique language. R is a descendant of LISP and, deep down, maintains some of the beauty its ancestor. It also borrows some unique and interesting features from other functional and dynamic languages.

Code is Data

R is true to its LISP roots in that you can create, modify, and evaluate parse trees from the code itself. One way to do so is with the quote() special-function, which returns its argument, unevaluated, as an expression object that can be traversed, modified and evaluated.

A fun (though not especially useful) consequence of this is that you can write an expression which returns itself as a quote:

> (function(x) substitute((x)(x)))(function(x) substitute((x)(x)))
(function(x) substitute((x)(x)))(function(x) substitute((x)(x)))
> expression <- (function(x) substitute((x)(x)))(function(x) substitute((x)(x)))
> expression == eval(expression)
[1] TRUE

Optional Laziness

By default, R uses eager evaluation, so expressions are evaluated as soon as they are assigned. However, R takes after functional languages like Haskell and OCaml in that it allows lazy evaluation, where expressions are only evaluated at the time they are first used.

For example, consider the Haskell code:

m = sum [1..]

Where sum returns the sum of a list and [1..] is the (infinite) list of all natural numbers. In most languages, the assignment would cause the program to loop forever trying to sum all the natural numbers so it can assign that value to m. In Haskell, the assignment does complete; it simply assigns the expression sum [1..] to m so that it can be evaluated when the value of m is first used.

In R we can accomplish something similar with the delayedAssign() function:

delayedAssign("m", sum(1:Inf))

Note that in R, unlike OCaml, the variables may be explicitly made lazy with delayedAssign, but are evaluated automatically when they are used.

Unfortunately, R evaluates lazy variables when they are pointed to by a data structure, even if their value is not needed at the time. This means that infinite data structures, one common application of laziness in Haskell, are not possible in R.

Operators are functions

When using higher-order functions, it’s sometimes useful to be able to treat operators as functions. Python accomplishes this in a clunky way: there is an operator module which redefines the built-in operators as functions. R takes a more functional approach. As in Haskell and O’Caml, operators are just syntactic sugar for ordinary functions. Enclosing any operator in backticks lets you use it as if it were an ordinary function. For example, calling `+`(2, 3) returns 5.

In fact, both the infix and prefix form are indistinguishable once they are parsed.

> quote(3 + 4) == quote(`+`(3, 4))
[1] TRUE

One surprising fact in R is that the assignment operators (<-, <<- and =) are functions like any other. As a result, they can be overwritten or passed around as desired, though neither strikes me as a particularly good idea.

Continuations

Continuations in R are a way of “breaking out” of a computation and jumping down the call stack to return early. The R function callCC() (call with current continuation) takes one argument, a function. It then evaluates that function, passing in a special function as an argument. callCC() then returns the first value that the special function is called with, or the return value of evaluating its argument if the special function is not called before the function returns.

To give you a better idea of what that looks like, consider this example:

> callCC(function(m) {return(4)})
[1] 4
> callCC(function(m) {m(2); return(4)})
[1] 2

Calling the function m(2) essentially cuts the computation short, drops down in the call stack to callCC, and returns 2.

If you’ve used continuations in another language, note that in R the exit function can only be called before callCC() returns. This makes R’s continuation semantics less powerful than those of languages like Scheme, Smalltalk, and Ruby.

R is not without its flaws and legacy baggage (you can trace its roots back to the S programming language 35 years ago), but once you learn to use it right, it’s a very powerful and indispensable language.

Posted in Data Mining, Haskell, Math, Programming, Python, R, Ruby | 1 Comment

Groupon Math: Data Scraping to Estimate Revenue

There’s been a lot of talk recently about the Chicago startup Groupon. Groupon brands itself as a group-buying site, but it’s really more of a localized version of what woot.com does. They post a new deal (which they call a Groupon) every day, available only on that day. If enough people want to buy it, everyone gets it for a substantial discount. Otherwise, nobody gets anything, but this rarely happens from what I can tell.

According to TechCrunch, the company is in the process of raising money at a $1.2 billion dollar valuation. There was lots of speculation about the future worth of the company, but little information about current revenue, even though there is a lot of raw data readily available in the site’s archives. I put together a scraper (in just a few lines of Python, thanks to BeautifulSoup) and gathered a total of 1065 past Groupons.

It isn’t clear how Groupon decides which Groupons to display in its archives. Presumably they are the better selling ones, so my sample is not a random sample, which would affect the numbers. Everything that follows should be taken with a grain of salt, but they should be reasonable as ballpark figures.

According to the data I collected, the average Groupon costs $30 and entitles the buyer to 57% off. On average 1155 people purchase it, resulting in $28,130 of revenue to Groupon ($28,130 is less than 1155 * $30 = $34,650 because, apparently, people are more willing to buy the cheaper Groupons.)

Averages are nice, but what I really wanted was totals. I was able to approximate what fraction of the data I had because Groupon advertises the “Total dollars saved” and “Total Groupons bought” on every page. By dividing my numbers by those, I determined that I had a little over a third of the data. Specifically, my data covered 31.2% of Groupons sold, and 37.4% of total savings.

Extrapolating the data I had (again, with the disclaimer that my sample may not be random), I calculated the total revenue since the beginning to be $80,188,176. If Groupon takes a 35% cut (to take a wild guess), $28 million of that is left after Groupon pays the company offering the deal. According to CrunchBase Groupon employs 90 people. I won’t speculate as to the operating costs of Groupon over the last year and a bit of operation, but once you subtract that number the rest is profit to date.

Looking on a monthly basis, the recent growth of the company is clear. A third of the total savings — in over a year of business — happened last month. This works out to $26,706,059 in revenue last month alone, or about $9.3 million (less the operating costs) profit if you assume they take a 35% cut. The below graph shows the growth by month.


Whether or not it’s a $1.2 billion company (BusinessInsider says that’s actually low, though without any quantitative justification), they’re clearly doing well for a company just over a year after launch.

Here are a couple more graphs constructed from the data (click to enlarge).

(Graphs were created with Google Fusion Tables.)

Update May 26: a few more graphs with more recent data follow.

Posted in Data Mining, Math | 11 Comments

Do we really need another programming language?

Yesterday, a group inside Google released a new programming language called go. Among the many comments on the new language, I noticed a number of people rhetorically asking if there is really room for more programming languages. I’ve seen the same sentiment expressed before, seemingly whenever someone releases a pet language (Arc, Factor) or rediscovers an older programming language (Smalltalk, Erlang, LISP).

I’m convinced that there is, in fact, room for more programming languages.

I recently read The Mythical Man Month, the 1975 classic about software engineering. Two things stuck out at me. First, that compared to programmers in 1975, programmers today are spoiled by hardware. Many software constraints, like memory and CPU speed — let alone the size of the program on disk — are not issues worth much thought for many software projects. There are of course exceptions: low-level systems code, graphics, data-intensive calculations (in other words, “the fun stuff”) — but in general, application programmers face far fewer constraints than they did 35 years ago. The second thing that struck me is how comparatively little programming has changed since then. Despite the loosening of constraints, we still write applications that, say, directly use pointer arithmetic and manually allocate and deallocate memory.

It’s not just C++, either. Here’s an experiment. Consider the last program you wrote in any language. Imagine you had to describe, in plain English, your program. The description must be complete enough that another programmer could read it and produce your program. Not your program exactly, of course, but one that satisfies the same requirements your program does. (If you actually try this, you may get something that looks like it was written by a lawyer who charges by the comma. That’s fine.)

Now compare your description to the program’s source code. The source code is almost certainly several times longer, less intuitive, and less descriptive. And yet, if your description is precise enough, all the source code does is communicate the same idea to the compiler.

There are two main things that differentiate the source from your description. One is that the source is designed so that a computer can parse it. This amounts to having some regular structure and a well-defined meaning to all the symbols. The other difference is that the description defines what the program should do, and the source describes how. These differences both contribute to the verbosity of the source code, but I suspect the latter contributes more (consider how concise formal math notation can be, despite having structure and being well-defined).

The obvious way to make a language more expressive, then, is to make it more declarative — to describe what the program does, not how it does it. Ideally our programs could be completely declarative. We could translate the description we wrote for the experiments a few paragraphs back into a more formal language, without adding any implementation details. Then we would have a compiler that created a program which satisfied our description and produced an executable. This type of programming is actually possible in certain, constrained domains (SQL and Prolog, for example), but unfortunately there isn’t a good general algorithm to translate any program we describe declaratively into machine instructions.

The best we can do, then, is to give imperative languages more declarative features. In fact, many high-level language features fall under this category. Take continuations, which let you say “resume code at this point” instead of how “store these pieces of state there, then restore the state and return here when called”. Or list comprehensions, which let you say “give me a list satisfying these properties”, rather than “loop through these elements and do that to them, discarding them if this is not true”. Or iteration constructs, which let you say “do this stuff for each element in that”, without specifying that space must be allocated for a counter, how to look up each element, or that the counter is incremented after each element.

All of those concepts have existed for decades. When available, they make code easier to write, understand, and modify. But even today, if you want to develop a native, cross-platform application, you have to give them all up or resort to using a relatively obscure language like Haskell.

It’s unfair to expect C++ to evolve to support these features, because it has decades of legacy code that it must be backwards compatible with. The language must be improved by either breaking legacy code or adding language features in an awkward sort of tacked-on way. Neither works very well, so the language evolves slowly or not at all.

I don’t mean to knock C++; sometimes it’s exactly what the problem needs. But I think a lot of programmers’ time is wasted solving high level problems with low level tools.

I’m doubtful that any one language will ever be the right tool for every job. Certain abstractions work well with certain domains (continuations with web programming, the actor model with concurrency, etc.), and abstractions that are baked into the language tend to have advantages to those that aren’t.

That’s why I welcome new programming languages. And as long as programming languages are less expressive than written English, I’ll keep welcoming them.

Posted in Programming | 19 Comments

N-Queens in a Tweet

Most people who like puzzles or study computer science have probably encountered the famous N-Queens problem. If you haven’t, before reading any further, try this online version of the most popular form, the 8-Queens problem.

The 8-Queens problem is to find positions on a chess board for eight queen chess pieces so that none of them “threaten” the others. Since a queen in chess can move horizontally, vertically, and diagonally, this means placing each queen on her own horizontal, vertical, and diagonal lines.

The N-Queens problem is a generalization of the 8-Queens problem with (surprise) N queens instead of 8, on an N \times N board instead of a standard 8 \times 8 chess board.

In the spirit of Perl Golf, I wondered what the minimal amount of code would be to solve the N-Queens problem. As an arbitrary target, I picked 140 characters, also the limit on message length imposed by Twitter. I picked Haskell, knowing its list functions and generally terse syntax would come in handy. Naturally, I broke all the conventions that would cost extra characters, like using whitespace and explicit function signatures.

Since Haskell modules have a bit of overhead that would use up my valuable characters, I had to cheat a bit by leaving the module declaration out. The only way to run the program unmodified is to name it .ghci to trick GHCi into using it as a start script.

The result follows. I used 140 characters exactly, including the newline character.

import Data.List
let nqueens n=[zipWith(\x y->x:show y)['a'..]x|x<-permutations[1..n],(length$nub$zipWith(+)(x++x)$[0,-1..1-n]++[n..])==n*2]

Pretty, eh? Fine, maybe not, but it works:

> nqueens 4
[["a2","b4","c1","d3"],["a3","b1","c4","d2"]]
> nqueens 5
[["a2","b4","c1","d3","e5"],["a3","b1","c4","d2","e5"],
["a3","b5","c2","d4","e1"],["a4","b2","c5","d3","e1"],
["a2","b5","c3","d1","e4"],["a1","b3","c5","d2","e4"],
["a4","b1","c3","d5","e2"],["a5","b3","c1","d4","e2"],
["a1","b4","c2","d5","e3"],["a5","b2","c4","d1","e3"]]

The result is a list of queen positions given as coordinates on a chess board (generalizing them in the obvious way).

As an example, the result ["a1","b5","c8","d6","e3","f7","g2","h4"] would look like this:

Eight Queens Solution

Eight Queens Solution (created with ChessUp)

Before I explain how it works, I'll put the whitespace back in.

import Data.List
let nqueens n = [zipWith (\x y -> x : show y) ['a'..] x | x <- permutations [1..n], (length $ nub $ zipWith (+) (x ++ x) $ [0,-1..1-n] ++ [n..]) == n*2]

Ignore diagonal movements for now — we could call it the N-Rooks problem. If we want to place N rooks on a chess board so none of them are threatened, each needs to be in its own column and row. If you number the rows 1, 2 \ldots N, each solution corresponds to a permutation of these numbers. The board above, for example, would be [1, 5, 8, 6, 3, 7, 2, 4]. So to solve the N-Rooks problem in Haskell, all we need is permutations [1..n].

That's a good start, but we have to consider the diagonal lines as well. The simplest way to do this is to assign the columns numbers from 1 to N. Then add those to the number of the row the queen in that column is in (remember, since we are using permutations, there is exactly one queen in each column).

For example, if we add [1, 5, 8, 6, 3, 7, 2, 4] and [1, 2, 3, 4, 5, 6, 7, 8], we get [2, 7, 11, 10, 8, 13, 9, 12]. The resulting numbers are all different, so the queens are all on different diagonal lines — at least for downward diagonals. For upward diagonals, we could do the same thing but use [N..1] instead of [1..N].

Since every character counts here, it's better if we can do both diagonal directions at once. This complicates things a bit because without caution, the numbers for upward diagonal lines will overlap with those for downward diagonal lines. Fortunately, we can just add a constant factor of at least 2N to either set of lines. This is essentially what the code above does, although the arithmetic is a bit more cryptic to keep the character count down.

The resulting list will have 2N elements — N for the upward diagonal lines and N more for the downward ones. Now we can use nub to rid the list of duplicates. If the list had no duplicates — that is, each queen is on her own diagonal lines — the length of that list will still equal 2N after duplicates are removed. That makes it a valid solution, so it is included in the result.

I should note that this algorithm isn't (nearly) as efficient as a standard backtracking approach as described on Wikipedia's eight queens puzzle solutions article. But good luck getting one of those solutions into a tweet.

Posted in Haskell | 8 Comments

Start-ups from UWaterloo Class of 2009

I was curious to know how many companies were founded by UWaterloo’s class of 2009, so I put together a list. It’s probably incomplete (let me know in the comments), but it may be of interest to anyone who follows the entrepreneurial community in Waterloo.

NeverBored Studios is a Waterloo-based game company founded by Velocity “alumni” Jimmy Ho, Thomas Ang, Orin Bishop, and Steve Truong. They are working on an iPhone game called ThreadBound, which they demo on YouTube. They were recently covered on TechVibes.

EightyTwenty Group, founded by Ray Cao and Aditya Shah in Toronto. They are working on enterprise software for law firms. Ray is a former president of Impact. I had the pleasure of working with Ray at Polar Mobile (another UWaterloo start-up) last summer. He and Aditya are smart and capable guys, so I’m looking forward to seeing EightyTwenty Group progress. EightyTwenty Group was featured in a National Post article in March.

Giftah originated as a Velocity project. Although it is only a part-time project for founders Rezart Bajraktari, Nick Belyaev and Henry Finn, they’ve managed to create a functional site that has been featured on CTV and The Montreal Gazette. They also won $2,000 at the Velocity Project Exhibition.

Unsynced is a Toronto software start-up founded by Ted Livingston, Vassili Skarine, and Vick Yao. They are developing BlackBerry software to allow users to sync their music collection and listen from their BlackBerry. Unsynced won free patent filing services at the (Winter 2009) Velocity Project Exhibition. They were also featured on TechVibes.

Mississauga-based Soeie is developing a tool for organizing research called Thinkpanda. I can’t find names of the entire founding team, but I do know that Fahd Butt has the title “CTO” (“Chief Thinker Officer”). Technically I think these guys are class of 2008, but since Thinkpanda is a new product and looks promising, I’ve included them anyway. They also have a cool logo.

Update: I found another (thanks, Peter)

Allerta started as the Velocity project of Eric Migicovsky. They are working on a wristwatch which, among other things, displays the caller ID from your BlackBerry when you get a call. Eric was profiled in The Record last October after winning a $1,000 pitch competition. Allerta is based in Waterloo.

A few observations:

  • Most start-ups left Waterloo to go to Toronto.
  • Three of five were part of Velocity. To me, this is some validation for Velocity after just two terms.
  • Two consumer web apps, two mobile apps, and one enterprise software company.
  • Most of the founders were engineering students.

Update 2: corrected Fahd Butt’s title (thanks, Rajesh)

Posted in Waterloo | 4 Comments

Python Debugging with Decorators

I’ve written a little python function which I have found to be very helpful for debugging. It takes a function, and returns a function which is identical to the original except that it prints a message to the console with useful information every time the function is called or returns.

Here is the function:

# Number of times to indent output
# A list is used to force access by reference
__report_indent = [0]

def report(fn):
    """Decorator to print information about a function
    call for use while debugging.
    Prints function name, arguments, and call number
    when the function is called. Prints this information
    again along with the return value when the function
    returns.
    """

    def wrap(*params,**kwargs):
        call = wrap.callcount = wrap.callcount + 1

        indent = ' ' * __report_indent[0]
        fc = "%s(%s)" % (fn.__name__, ', '.join(
            [a.__repr__() for a in params] +
            ["%s = %s" % (a, repr(b)) for a,b in kwargs.items()]
        ))

        print "%s%s called [#%s]"
            % (indent, fc, call)
        __report_indent[0] += 1
        ret = fn(*params,**kwargs)
        __report_indent[0] -= 1
        print "%s%s returned %s [#%s]"
            % (indent, fc, repr(ret), call)

        return ret
    wrap.callcount = 0
    return wrap

The function can be used as a decorator. For example, in this simple (and inefficient) recursive Fibonacci sequence function:

@report
def fibonacci(n):
    if n in [0,1]:
        return n
    else:
        return fibonacci(n - 1) + fibonacci(n - 2)

The result:

>>> fibonacci(4)
fibonacci(4) called [#1]
 fibonacci(3) called [#2]
  fibonacci(2) called [#3]
   fibonacci(1) called [#4]
   fibonacci(1) returned 1 [#4]
   fibonacci(0) called [#5]
   fibonacci(0) returned 0 [#5]
  fibonacci(2) returned 1 [#3]
  fibonacci(1) called [#6]
  fibonacci(1) returned 1 [#6]
 fibonacci(3) returned 2 [#2]
 fibonacci(2) called [#7]
  fibonacci(1) called [#8]
  fibonacci(1) returned 1 [#8]
  fibonacci(0) called [#9]
  fibonacci(0) returned 0 [#9]
 fibonacci(2) returned 1 [#7]
fibonacci(4) returned 3 [#1]
3

The level of indent reflects the level of recursion, and the [#...] at the end of each line is the number of times the function has been called.

The level of indent is independent of the function being called, so it is helpful with mutual recursion as well. For example, when used with the functions even and odd from my earlier post on tail recursion, the result looks like this:

>>> even(5)
even(5) called [#1]
 odd(4) called [#1]
  even(3) called [#2]
   odd(2) called [#2]
    even(1) called [#3]
     odd(0) called [#3]
     odd(0) returned False [#3]
    even(1) returned False [#3]
   odd(2) returned False [#2]
  even(3) returned False [#2]
 odd(4) returned False [#1]
even(5) returned False [#1]
False

I find it useful to stick @report before the function I am having trouble with, and use comments to turn it on and off while I’m debugging that function. It can also be used at times other than function declaration, for example: report(base64.encodestring)(‘test’).

Update (July 6, 2008): Fixed so that keyword arguments are printed as well.

Update (August 16, 2008): Changed .__repr__() to the more proper repr().

Posted in Python | Leave a comment

SimpleDiff in Python

A while ago I posted a PHP implementation of a diff algorithm I came up with1. Since it was well received, and it’s a useful little algorithm to have, I created a Python version as well.

There are a few performance improvements as well. The PHP version creates an array in memory proportional to the square of the size of the input, while the Python version’s array is directly proportional to the size of the input. I also sped up how the algorithm finds the indexes of the “new” elements in the “old” array.

Download simplediff.py

1 It is probably the same algorithm that others use, but I haven’t gotten around to getting an ACM membership to access the related papers

Posted in Python | Leave a comment

Tail recursion in Python

After spending a lot of time in Scheme, it’s hard not to think in recursion from time to time. When I recently started to improve my Python skills, I missed having Scheme optimize my tail recursive calls.

For example, consider the mutually recursive functions even and odd. You know a number, n, is even if it is 0, or if n – 1 is odd. Similarly, you know a number is not odd if it is 0, and that it is odd if n – 1 is even. This translates to the python code:

def even(x):
  if x == 0:
    return True
  else:
    return odd(x - 1)

def odd(x):
  if x == 0:
    return False
  else:
    return even(x - 1)

This code works, but only for x < 1000, because Python limits the recursion depth to 1000. As it turns out, it is easy to get around this limitation. Included below is a generic tail_rec function that could be used for most cases where you need tail recursion, and an example of it used for the odd/even problem.

def tail_rec(fun):
   def tail(fun):
      a = fun
      while callable(a):
         a = a()
      return a
   return (lambda x: tail(fun(x)))

def tail_even(x):
  if x == 0:
    return True
  else:
    return (lambda: tail_odd(x - 1))

def tail_odd(x):
  if x == 0:
    return False
  else:
    return (lambda: tail_even(x - 1))

even = tail_rec(tail_even)
odd = tail_rec(tail_odd)

It’s not as pretty as the Scheme version, but it does the trick. Of course, the odd/even functions are just for the sake of a simple example and have no real-world use, but the tail_rec function could be used in practice.

April 2009 Update: this article has recently had some popularity. One of the more common comments is that tail_rec could be used as a decorator. In fact, this isn’t true, because even and odd need access to the raw, undecorated versions of each other in the creation of the lambda.

Posted in Python | 16 Comments