<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Paul Butler &#187; Math</title>
	<atom:link href="http://paulbutler.org/archives/category/math/feed/" rel="self" type="application/rss+xml" />
	<link>http://paulbutler.org</link>
	<description></description>
	<lastBuildDate>Tue, 28 Feb 2012 14:45:26 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Groupon Visualization in HBR</title>
		<link>http://paulbutler.org/archives/groupon-visualization-in-hbr/</link>
		<comments>http://paulbutler.org/archives/groupon-visualization-in-hbr/#comments</comments>
		<pubDate>Mon, 25 Jul 2011 10:41:06 +0000</pubDate>
		<dc:creator>Paul Butler</dc:creator>
				<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://paulbutler.org/?p=501</guid>
		<description><![CDATA[A version of my Groupon visualization appears in the latest Harvard Business Review. At the suggestion of their editorial and design staff, I re-did the visualization to focus on a smaller subset of the data: deals in San Francisco. I &#8230; <a href="http://paulbutler.org/archives/groupon-visualization-in-hbr/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://hbr.org/2011/07/vision-statement-deconstructing-the-groupon-phenomenon/ar/1">A version of my Groupon visualization</a> appears in the latest Harvard Business Review. At the suggestion of their editorial and design staff, I re-did the visualization to focus on a smaller subset of the data: deals in San Francisco.</p>
<p><a href="http://hbr.org/2011/07/vision-statement-deconstructing-the-groupon-phenomenon/ar/1"><img src="http://paulbutler.org/wp-content/uploads/2011/07/groupon_small.png" alt="" title="Groupon spread from HBR" width="480" height="315" class="alignnone size-full wp-image-502" /></a></p>
<p>I must give the HBR staff a lot of credit for the final product &mdash; they made a visualization designed to be interactive work in print. I also created an <a href="http://hbr.org/tablet/0711/vision-statement">interactive version</a> for their website.</p>
]]></content:encoded>
			<wfw:commentRss>http://paulbutler.org/archives/groupon-visualization-in-hbr/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>What $480M of Gross Revenue Looks Like to Groupon</title>
		<link>http://paulbutler.org/archives/what-480m-of-gross-revenue-looks-like-to-groupon/</link>
		<comments>http://paulbutler.org/archives/what-480m-of-gross-revenue-looks-like-to-groupon/#comments</comments>
		<pubDate>Mon, 28 Feb 2011 11:36:36 +0000</pubDate>
		<dc:creator>Paul Butler</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://paulbutler.org/?p=491</guid>
		<description><![CDATA[On Saturday, the Wall St. Journal posted details of an internal Groupon memo that reported $760 million in revenue last year. The WSJ article came just as I was finishing up a visualization of some data I had collected on &#8230; <a href="http://paulbutler.org/archives/what-480m-of-gross-revenue-looks-like-to-groupon/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>On Saturday, the Wall St. Journal <a href=”http://online.wsj.com/article/SB10001424052748703408604576164641411042376.html”>posted details of an internal Groupon memo</a> that reported $760 million in revenue last year.</p>
<p>The WSJ article came just as I was finishing up a visualization of some data I had collected on Groupon deals, which gives perspective on that massive number in terms of the individual deals.</p>
<p>Each box is a deal. I used height to represent number sold, and width to represent the price. Area is therefore gross revenue, and colour is city for the top 20 cities.</p>
<p><a href="http://s3.amazonaws.com/gpvis/gpvis.svg"><img src="http://paulbutler.org/wp-content/uploads/2011/02/gpvis.png" alt="" title="gpvis" width="400" height="400" class="alignnone size-full wp-image-492" /></a></p>
<p><a href="http://s3.amazonaws.com/gpvis/gpvis.svg">(Click for a larger, interactive version. Only works in browsers that support SVG, i.e. not IE)</a></p>
<p>The 2D-bin-packing was implemented in R and C++, based on code by <a href="https://github.com/mackstann/binpack">mackstann</a>. Thanks to my friends <a href="http://www.getinpulse.com/">Eric</a> and <a href="http://www.lisazhang.ca/">Lisa</a> for feedback on a draft of the visualization.</p>
]]></content:encoded>
			<wfw:commentRss>http://paulbutler.org/archives/what-480m-of-gross-revenue-looks-like-to-groupon/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
		<item>
		<title>Visualizing Facebook Friends: Eye Candy in R</title>
		<link>http://paulbutler.org/archives/visualizing-facebook-friends/</link>
		<comments>http://paulbutler.org/archives/visualizing-facebook-friends/#comments</comments>
		<pubDate>Sat, 18 Dec 2010 20:31:54 +0000</pubDate>
		<dc:creator>Paul Butler</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Math]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://paulbutler.org/?p=467</guid>
		<description><![CDATA[Earlier this week I published a data visualization on the Facebook Engineering blog which, to my surprise, has received a lot of media covereage. I&#8217;ve received a lot comments about the image, many asking for more details on how I &#8230; <a href="http://paulbutler.org/archives/visualizing-facebook-friends/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Earlier this week I <a href="http://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919">published a data visualization on the Facebook Engineering blog</a> which, to my surprise, has <a href="http://www.economist.com/blogs/dailychart/2010/12/data_visualisation_1">received</a> <a href="http://blogs.forbes.com/mikeisaac/2010/12/13/what-10-million-facebook-friendships-looks-like-a-data-visualization/?boxes=Homepagechannels">a lot of</a> <a href="http://www.nbcbayarea.com/news/tech/Facebook-Map-Reveals-Web-of-Connections-111883594.html">media</a> <a href="http://newsfeed.time.com/2010/12/14/tracking-facebook-friendships-creates-a-stunning-global-map/">covereage.</a></p>
<p><a href="http://paulbutler.org/wp-content/uploads/2010/12/163413_479288597199_9445547199_5658562_14158417_n.png"><img src="http://paulbutler.org/wp-content/uploads/2010/12/163413_479288597199_9445547199_5658562_14158417_n-1024x509.png" alt="" title="Facebook Friends Visualization" width="640" height="318" class="alignnone size-large wp-image-468" /></a></p>
<p>I&#8217;ve received a lot comments about the image, many <a href="http://www.quora.com/What-data-visualization-software-did-Paul-Butler-use-to-create-the-Facebook-friend-visualization-map-published-on-12-14-10">asking for more details on how I created it</a>. When I tell people I used <a href="http://www.r-project.org/">R</a>, the reaction I get is roughly what I would expect if I told them I made it with a <a href="http://en.wikipedia.org/wiki/Paint_(software)">Microsoft Paint</a> and a bottle of <a href="http://en.wikipedia.org/wiki/J%C3%A4germeister">Jägermeister</a>. Some people even <a href="http://news.ycombinator.com/item?id=2002859">questioned whether it was actually done in R</a>. The truth is, aside from the addition of the logo and date text, the image was produced entirely with about 150 lines of R code with no external dependencies. In the process I learned a few things about creating nice-looking graphs in R.</p>
<p><strong>Transparency and Faking It</strong></p>
<p>My first attempt at plotting the data involved plotting very transparent lines. Unfortunately there was just too much data to get a meaningful plot &mdash; even at very low opacity, there were enough lines to make the entire image just a bright blob. When I increased the transparency more, the opacity was rounded down to zero by my graphics device and the result was that nothing was drawn.</p>
<p>The solution was to manipulate the drawing order of the lines. I used a simple loop over my data to draw the lines, so it was easy to control which lines are drawn first using <samp>order()</samp>. I created an ordering based on the length of the lines, so that longer lines were drawn &#8220;behind&#8221; the shorter, more local lines. Then I used <samp>colorRampPalette()</samp> to generate a color palette from black to blue to white, and colored the lines according to order they were drawn.</p>
<p><strong>Great Circles</strong></p>
<p>I wrote my own code to draw the <a href="http://en.wikipedia.org/wiki/Great_circle">great circle</a> arcs, although I later found a <a href="http://cran.r-project.org/">CRAN</a> package called <a href="http://cran.r-project.org/web/packages/geosphere/index.html">geosphere</a> that would have done it for me (albeit with rougher lines near the poles). I drew the great circle arcs in a way that was easy to derive but slow to compute. I bisected the lines recursively, finding their great circle midpoint, until they were short enough to resemble an arc. To find the great circle midpoint, I converted from <a href="http://en.wikipedia.org/wiki/Spherical_coordinate_system">spherical coordinates</a> to <a href="http://en.wikipedia.org/wiki/Cartesian_coordinate_system">Cartesian</a>, found the midpoint, then converted back to spherical coordinates and extended the radius.</p>
<p><strong>Euclidean Distance</strong></p>
<p>Several observent commenters called me out on using <a href="http://en.wikipedia.org/wiki/Euclidean_distance">Euclidean distance</a> on the projection for the ordering function. Having the ordering function depend on the distance on the projection seems counterintuitive, as Eucliden distance is wildly distorted near the poles. I accepted this drawback because the exact drawing order wasn&#8217;t important, as long as very long lines were drawn below very short ones.</p>
]]></content:encoded>
			<wfw:commentRss>http://paulbutler.org/archives/visualizing-facebook-friends/feed/</wfw:commentRss>
		<slash:comments>48</slash:comments>
		</item>
		<item>
		<title>Data Structures for Range-Sum Queries (slides)</title>
		<link>http://paulbutler.org/archives/data-structures-for-range-sum-queries-slides/</link>
		<comments>http://paulbutler.org/archives/data-structures-for-range-sum-queries-slides/#comments</comments>
		<pubDate>Sat, 10 Jul 2010 20:18:17 +0000</pubDate>
		<dc:creator>Paul Butler</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Math]]></category>

		<guid isPermaLink="false">http://test.paulbutler.org/?p=443</guid>
		<description><![CDATA[This week I attended the Canadian Undergraduate Mathematics Conference. I enjoyed talks from a number of branches of mathematics, and gave a talk of my own on range-sum queries. Essentially, range-aggregate queries are a class of database queries which involve &#8230; <a href="http://paulbutler.org/archives/data-structures-for-range-sum-queries-slides/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>This week I attended the <a href="http://cumc.math.ca/2010/en/">Canadian Undergraduate Mathematics Conference</a>. I enjoyed talks from a number of branches of mathematics, and gave a talk of my own on range-sum queries. Essentially, range-aggregate queries are a class of database queries which involve taking an aggregate (in SQL terms, <samp>SUM</samp>, <samp>AVG</samp>, <samp>COUNT</samp>, <samp>MIN</samp>, etc.) over a set of data where the elements are filtered by simple inequality operators (in SQL terms, <samp>WHERE colname {<, <=, =, >=, >} value AND &#8230;</samp>). Range-sum queries are the subset of those queries where <samp>SUM</samp> is the aggregation function.</p>
<p>Due to the nature of the conference, I did my best to make things as accessible to someone with a general mathematics background rather than assuming familiarity with databases or order notation.</p>
<p>I&#8217;ve put <a href="http://github.com/paulgb/cumc2010/blob/master/slides.pdf">the slides</a> (pdf link, embedded below also) online. They may be hard to follow as slides, but I hope they pique your interest enough to check out the papers referenced at the end if that&#8217;s the sort of thing that interests you. I may turn them into a blog post at some point. The presentation begins with tabular data and shows some of the insights that led to the Dynamic Data Cube, which is a clever data structure for answering range-sum queries.</p>
<p><iframe src="http://docs.google.com/viewer?url=http%3A%2F%2Fgithub.com%2Fpaulgb%2Fcumc2010%2Fraw%2Fmaster%2Fslides.pdf&#038;embedded=true" width="600" height="480" style="border: none;"></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://paulbutler.org/archives/data-structures-for-range-sum-queries-slides/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>An experiment in A/B Testing my Résumé</title>
		<link>http://paulbutler.org/archives/experiment-in-testing-my-resume/</link>
		<comments>http://paulbutler.org/archives/experiment-in-testing-my-resume/#comments</comments>
		<pubDate>Fri, 02 Jul 2010 01:08:54 +0000</pubDate>
		<dc:creator>Paul Butler</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Math]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://paulbutler.org/?p=372</guid>
		<description><![CDATA[Objective I&#8217;ll admit it: my résumé doesn&#8217;t stand out. I&#8217;ve had some great internships, but also a tendency to work for companies that aren&#8217;t (yet!) household names. And though I&#8217;m doing fine academically, it&#8217;s not well enough to stand out &#8230; <a href="http://paulbutler.org/archives/experiment-in-testing-my-resume/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<h3>Objective</h3>
<p>I&#8217;ll admit it: my résumé doesn&#8217;t stand out. I&#8217;ve had some great internships, but also a tendency to work for companies that aren&#8217;t (yet!) household names. And though I&#8217;m doing fine academically, it&#8217;s not well enough to stand  out on my marks alone.</p>
<p>On the other hand, my blog lets me stand out. I&#8217;ve had a few opportunities to meet and interview with some great people and companies because they read my blog. Naturally, then, the primary goal of my résumé is to get people to visit my blog. Since I don&#8217;t <em>quite</em> have the audacity to make my résumé a note telling people to visit my blog, I&#8217;m faced with the problem of how to optimize my résumé to ensure people see my blog. That&#8217;s where this experiment comes in.</p>
<h3>Methodology</h3>
<p>I started thinking about variables in my résumé that could affect the rate at which people viewed my blog. I narrowed it down to three that I could easily test.</p>
<p>The first is the <strong>length</strong> of the résumé. My friends <a href="http://www.meetrajesh.com">Rajesh</a> and <a href="http://www.eng.uwaterloo.ca/~smsshafi/index.php?newsfile=080921">Shams</a> are adamant about keeping their résumés down to a single page. Their arguments are sound, but I wanted to see if the data would back up their beliefs. I created a &#8220;short&#8221; version of my résumé which I squeezed into one page by omitting the <em>Awards</em> section and removing some skills.</p>
<p>Second, I wanted to know how my <strong>grades</strong> affected the résumé. Obviously I couldn&#8217;t start making things up, but since my major average differs from my overall average by a good margin, I had two numbers that I could use truthfully with a subtle change in wording.</p>
<div id="attachment_377" class="wp-caption alignnone" style="width: 496px"><a href="http://test.paulbutler.org/wp-content/uploads/2010/07/gpa.png"><img src="http://test.paulbutler.org/wp-content/uploads/2010/07/gpa_t.png" alt="" title="gpa_t" width="486" height="66" class="size-full wp-image-377" /></a><p class="wp-caption-text">Résumé variations with different grades</p></div><br />
Finally, I wanted to test whether it pays to include <strong>social media links</strong> on the résumé. I chose <a href="http://github.com/paulgb">GitHub</a>, <a href="http://www.linkedin.com/in/paulgb">LinkedIn</a>, and <a href="http://twitter.com/paulgb">twitter</a> as the links to test. GitHub was an obvious choice because it emphasizes my free-time projects. LinkedIn seemed like a good one to test, given that it is for professional networking. I chose twitter as another variation because I was curious to see what the reaction to a more personal social networking site would be. All résumés linked back to my blog as well. Finally, I had another resume which linked <strong>only</strong> to my blog, as a control group.</p>
<p><div id="attachment_381" class="wp-caption alignnone" style="width: 249px"><a href="http://test.paulbutler.org/wp-content/uploads/2010/07/link.png"><img src="http://test.paulbutler.org/wp-content/uploads/2010/07/link_t.png" alt="" title="link_t" width="239" height="60" class="size-full wp-image-381" /></a><p class="wp-caption-text">Résumé variation with a link to GitHub</p></div><br />
In all, these three variations resulted in 16 unique résumés. Fortunately I didn&#8217;t have to create them all by hand. I was already using LaTeX for my résumé, using one of the elegantly typeset templates from <a href="http://www.cv-templates.info/">The CV Inn</a> as a base. I simply threw my latest résumé into a <a href="http://www.makotemplates.org/">Mako Template</a> and wrote some python code to spit out the 16 possible variations of the LaTeX code. Then I used <em>pdflatex</em> to create pdf files. Since I was putting the résumés online, I made a landing page. To keep things simple, the landing page was just an image version of the résumé with a link to download the pdf, and just enough CSS to look presentable.</p>
<p>I wanted to track three things: how many people <strong>downloaded</strong> the résumé, how many people <strong>scrolled to the bottom</strong> of the landing page, and how many people <strong>visited my blog</strong>. The first I accomplished by logging downloads. The second I accomplished with <a href="http://jquery.com/">jQuery</a> and an Ajax callback. The third I accomplished with a tracking image, just like hit counters in the 90s. I used IP address and cookies to match up actions with the associated résumé.</p>
<p>The only remaining problem was how to get hundreds of people to see my résumé in a short period of time. Fortunately Google offered me $110 in AdWords credits as a Google Analytics user, so I took advantage of that and ran ads on Google searches. Here is one of the half-dozen variations I ran:</p>
<p><div id="attachment_383" class="wp-caption alignnone" style="width: 197px"><img src="http://test.paulbutler.org/wp-content/uploads/2010/07/dsad.png" alt="" title="dsad" width="187" height="62" class="size-full wp-image-383" /><p class="wp-caption-text">One of the Google ads I ran</p></div>
<h3>Results</h3>
<p>After less than a week, I managed to exhaust my AdWords budget and gather a fair bit of data. I wrote a few hadoop jobs with my <a href="http://github.com/paulgb/haskell_hadoop">new toy</a> and then brought the data into R for analysis and visualization.</p>
<h4>Length</h4>
<p>As you might expect, the people who encountered the short resume were much more likely to scroll to the bottom. Just over half did, versus just under a third of those presented with the long resume. This makes sense because there is less to scroll through, but it was nice to have the data confirm my suspicions. Note that in the following graph, and all others in this post, the grey lines indicate the 90% confidence interval.</p>
<p><a href="http://test.paulbutler.org/wp-content/uploads/2010/07/hitbottom_length.png"><img src="http://test.paulbutler.org/wp-content/uploads/2010/07/hitbottom_length.png" alt="" title="hitbottom_length" width="400" height="400" class="alignnone size-full wp-image-386" /></a><br />
The short résumé also resulted in more downloads and blog views, but not enough to be statistically significant with the amount of data I collected.</p>
<h4>Grades</h4>
<p>The grades shown on the résumé didn&#8217;t affect any of the metrics I was measuring in a statistically significant way.</p>
<h4>Links</h4>
<p>I was surprised to find that the non-blog link shown on my résumé affected the frequency of click-throughs to my blog. Even adding a link to my GitHub profile more than halved the frequency of a clickthrough to my blog. LinkedIn and twitter were even worse.</p>
<p><a href="http://test.paulbutler.org/wp-content/uploads/2010/07/blogview_link.png"><img src="http://test.paulbutler.org/wp-content/uploads/2010/07/blogview_link.png" alt="" title="blogview_link" width="400" height="400" class="alignnone size-full wp-image-396" /></a><br />
I created a heatmap-like visualization from the relative significances of each link to each other. For example, the upper leftmost cell means that it is 97.2% likely that if a sufficiently large group of people were exposed to each of the LinkedIn and blog-link-only versions of my résumé, the group that saw the blog-link-only version would visit my blog more. <a href="http://20bits.com/">Jesse E. Farmer</a> has written more about the details of <a href="http://20bits.com/articles/statistical-analysis-and-ab-testing/">how this is calculated</a>.</p>
<p><a href="http://test.paulbutler.org/wp-content/uploads/2010/07/hm_blogview_link.png"><img src="http://test.paulbutler.org/wp-content/uploads/2010/07/hm_blogview_link.png" alt="" title="hm_blogview_link" width="400" height="400" class="alignnone size-full wp-image-397" /></a><br />
Oddly, the effect was reversed when you consider downloads rather than blog views. The résumés without any social media links were far less likely to be downloaded than those with. Even a résumé with a twitter profile did better than one without, though not by enough to be statistically significant.</p>
<p><a href="http://test.paulbutler.org/wp-content/uploads/2010/07/download_link.png"><img src="http://test.paulbutler.org/wp-content/uploads/2010/07/download_link.png" alt="" title="download_link" width="400" height="400" class="alignnone size-full wp-image-400" /></a><br />
<a href="http://test.paulbutler.org/wp-content/uploads/2010/07/hm_download_link.png"><img src="http://test.paulbutler.org/wp-content/uploads/2010/07/hm_download_link.png" alt="" title="hm_download_link" width="400" height="400" class="alignnone size-full wp-image-401" /></a><br />
The additional links also reduced the frequency of readers scrolling to the bottom of the page.</p>
<p><a href="http://test.paulbutler.org/wp-content/uploads/2010/07/hitbottom_link.png"><img src="http://test.paulbutler.org/wp-content/uploads/2010/07/hitbottom_link.png" alt="" title="hitbottom_link" width="400" height="400" class="alignnone size-full wp-image-435" /></a><br />
<a href="http://test.paulbutler.org/wp-content/uploads/2010/07/hm_hitbottom_link.png"><img src="http://test.paulbutler.org/wp-content/uploads/2010/07/hm_hitbottom_link.png" alt="" title="hm_hitbottom_link" width="400" height="400" class="alignnone size-full wp-image-436" /></a></p>
<h3>Conclusion</h3>
<p>There are two main things I learned from this experiment. First, I&#8217;m going to keep social network links off of my résumé. Although they increased the download rate, they decreased visits to my blog. Since the latter is my priority, I&#8217;m not going to start adding social networks to my résumé.</p>
<p>Second, the short résumé did better in every way. However, the improvement in blog views was not statistically significant. For now, I&#8217;m keeping my online résumé at two pages, but I will use the one-page version in print.</p>
<p>There&#8217;s a number of disclaimers I should make here. For one, even if my findings are true of my résumé, they might not be true of other résumés. Maybe a change in layout would diminish the effect of linking to social media profiles, or make the longer résumé convert better. I should also point out that I have no way of knowing who my audience was. They probably weren&#8217;t all in a position to hire a programmer or data scientist, so the factors that make them visit my blog may or may not have the same effect on those who are.</p>
<p>And finally, a shameless plug. I&#8217;m looking for an interesting <strong>data science</strong> internship this fall (September to December 2010). If you&#8217;re doing cool things with data, I&#8217;d be glad to hear from you. My contact information is in the sidebar.</p>
]]></content:encoded>
			<wfw:commentRss>http://paulbutler.org/archives/experiment-in-testing-my-resume/feed/</wfw:commentRss>
		<slash:comments>22</slash:comments>
		</item>
		<item>
		<title>Why R doesn&#8217;t suck</title>
		<link>http://paulbutler.org/archives/why-r-doesnt-suck/</link>
		<comments>http://paulbutler.org/archives/why-r-doesnt-suck/#comments</comments>
		<pubDate>Sat, 19 Jun 2010 13:45:01 +0000</pubDate>
		<dc:creator>Paul Butler</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Haskell]]></category>
		<category><![CDATA[Math]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://paulbutler.org/?p=336</guid>
		<description><![CDATA[I first encountered the R programming language a few years ago when I needed to make some plots. Although I&#8217;ve used it occasionally since, I always considered it a sort of &#8220;Perl for statisticians&#8221; &#8212; a useful swiss-army knife with &#8230; <a href="http://paulbutler.org/archives/why-r-doesnt-suck/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I first encountered the <a href="http://www.r-project.org/">R programming language</a> a few years ago when I needed to make some plots. Although I&#8217;ve used it occasionally since, I always considered it a sort of &#8220;Perl for statisticians&#8221; &mdash; a useful swiss-army knife with ugly syntax and inconsistent semantics. My workflow generally involved manipulating the data in Python and using R to make a simple plot, minimizing the amount of R code I wrote as much as possible.</p>
<p>When I recently decided to sit down and properly learn the language, I was pleasantly surprised that underneath the line noise was an interesting and unique language. R is a descendant of LISP and, deep down, maintains some of the beauty its ancestor. It also borrows some unique and interesting features from other functional and dynamic languages.</p>
<h3>Code is Data</h3>
<p>R is true to its LISP roots in that you can create, modify, and evaluate parse trees from the code itself. One way to do so is with the <samp>quote()</samp> special-function, which returns its argument, unevaluated, as an expression object that can be traversed, modified and evaluated.</p>
<p>A fun (though not especially useful) consequence of this is that you can write an <a href="http://en.wikipedia.org/wiki/Quine_(computing)">expression which returns itself</a> as a quote:<br />
<code><br />
> (function(x) substitute((x)(x)))(function(x) substitute((x)(x)))<br />
(function(x) substitute((x)(x)))(function(x) substitute((x)(x)))<br />
> expression <- (function(x) substitute((x)(x)))(function(x) substitute((x)(x)))<br />
> expression == eval(expression)<br />
[1] TRUE<br />
</code></p>
<h3>Optional Laziness</h3>
<p>By default, R uses <a href="http://en.wikipedia.org/wiki/Eager_evaluation">eager evaluation</a>, so expressions are evaluated as soon as they are assigned. However, R takes after functional languages like Haskell and O&#8217;Caml in that it allows lazy evaluation, where expressions are only evaluated at the time they are first used.</p>
<p>For example, consider the Haskell code:<br />
<code><br />
m = sum [1..]<br />
</code></p>
<p>Where <samp>sum</samp> returns the sum of a list and <samp>[1..]</samp> is the (infinite) list of all natural numbers. In most languages, the assignment would cause the program to loop forever trying to sum all the natural numbers so it can assign that value to <samp>m</samp>. In Haskell, the assignment does complete; it simply assigns the expression <samp>sum [1..]</samp> to <samp>m</samp> so that it can be evaluated when the value of <samp>m</samp> is first used.</p>
<p>In R we can accomplish something similar with the <samp>delayedAssign()</samp> function:<br />
<code><br />
delayedAssign("m", sum(1:Inf))<br />
</code></p>
<p>Note that in R, unlike O&#8217;Caml, the variables may be explicitly made lazy with <samp>delayedAssign</samp>, but are evaluated automatically when they are used.</p>
<p>Unfortunately, R evaluates lazy variables when they are pointed to by a data structure, even if their value is not needed at the time. This means that infinite data structures, one common application of laziness in Haskell, are not possible in R.</p>
<h3>Operators are functions</h3>
<p>When using higher-order functions, it&#8217;s sometimes useful to be able to treat operators as functions. Python accomplishes this in a clunky way: there is an <samp>operator</samp> module which redefines the built-in operators as functions. R takes a more functional approach. As in Haskell and O&#8217;Caml, operators are just syntactic sugar for ordinary functions. Enclosing any operator in backticks lets you use it as if it were an ordinary function. For example, calling <samp>`+`(2, 3)</samp> returns <samp>5</samp>.</p>
<p>In fact, both the infix and prefix form are indistinguishable once they are parsed.<br />
<code><br />
> quote(3 + 4) == quote(`+`(3, 4))<br />
[1] TRUE<br />
</code></p>
<p>One surprising fact in R is that the assignment operators (<samp>&lt;-</samp>, <samp>&lt;&lt;-</samp> and <samp>=</samp>) are functions like any other. As a result, they can be overwritten or passed around as desired, though neither strikes me as a particularly good idea.</p>
<h3>Continuations</h3>
<p><a href="http://en.wikipedia.org/wiki/Continuation">Continuations</a> in R are a way of &#8220;breaking out&#8221; of a computation and jumping down the call stack to return early. The R function <samp>callCC()</samp> (<strong>call</strong> with <strong>c</strong>urrent <strong>c</strong>ontinuation) takes one argument, a function. It then evaluates that function, passing in a special function as an argument. <samp>callCC()</samp> then returns the first value that the special function is called with, or the return value of evaluating its argument if the special function is not called before the function returns.</p>
<p>To give you a better idea of what that looks like, consider this example:<br />
<code><br />
> callCC(function(m) {return(4)})<br />
[1] 4<br />
> callCC(function(m) {m(2); return(4)})<br />
[1] 2<br />
</code></p>
<p>Calling the function <samp>m(2)</samp> essentially cuts the computation short, drops down in the call stack to <samp>callCC</samp>, and returns <samp>2</samp>.</p>
<p>If you&#8217;ve used continuations in another language, note that in R the exit function can only be called before <samp>callCC()</samp> returns. This makes R&#8217;s continuation semantics less powerful than those of languages like Scheme, Smalltalk, and Ruby.</p>
<p>R is not without its flaws and legacy baggage (you can trace its roots back to the <a href="http://en.wikipedia.org/wiki/S_(programming_language)">S programming language</a> 35 years ago), but once you learn to use it right, it&#8217;s a very powerful and indispensable language.</p>
]]></content:encoded>
			<wfw:commentRss>http://paulbutler.org/archives/why-r-doesnt-suck/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Groupon Math: Data Scraping to Estimate Revenue</title>
		<link>http://paulbutler.org/archives/groupon-math-data-scraping-to-estimate-revenue/</link>
		<comments>http://paulbutler.org/archives/groupon-math-data-scraping-to-estimate-revenue/#comments</comments>
		<pubDate>Thu, 15 Apr 2010 15:18:12 +0000</pubDate>
		<dc:creator>Paul Butler</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Math]]></category>

		<guid isPermaLink="false">http://paulbutler.org/?p=268</guid>
		<description><![CDATA[There&#8217;s been a lot of talk recently about the Chicago startup Groupon. Groupon brands itself as a group-buying site, but it&#8217;s really more of a localized version of what woot.com does. They post a new deal (which they call a &#8230; <a href="http://paulbutler.org/archives/groupon-math-data-scraping-to-estimate-revenue/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>There&#8217;s been a lot of talk recently about the Chicago startup <a href="http://www.groupon.com/">Groupon</a>. Groupon brands itself as a group-buying site, but it&#8217;s really more of a localized version of what <a href="http://www.woot.com/">woot.com</a> does. They post a new deal (which they call a <em>Groupon</em>) every day, available only on that day. If enough people want to buy it, everyone gets it for a substantial discount. Otherwise, nobody gets anything, but this rarely happens from what I can tell.</p>
<p><img src="http://test.paulbutler.org/wp-content/uploads/2010/04/groupon.png" alt="" title="groupon" width="200" height="365" class="alignright size-full wp-image-271" /><a href="http://techcrunch.com/2010/04/13/groupon-raises-huge-new-round-at-1-2-billion-valuation/">According to TechCrunch</a>, the company is in the process of raising money at a $1.2 billion dollar valuation. There was lots of speculation about the future worth of the company, but little information about current revenue, even though there is a lot of raw data readily available in the site&#8217;s archives. I put together a scraper (in just a few lines of Python, thanks to <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a>) and gathered a total of 1065 past Groupons.</p>
<p>It isn&#8217;t clear how Groupon decides which Groupons to display in its archives. Presumably they are the better selling ones, so my sample is not a random sample, which would affect the numbers. Everything that follows should be taken with a grain of salt, but they should be reasonable as ballpark figures.</p>
<p>According to the data I collected, the average Groupon costs $30 and entitles the buyer to 57% off. On average 1155 people purchase it, resulting in $28,130 of revenue to Groupon ($28,130 is less than 1155 * $30 = $34,650 because, apparently, people are more willing to buy the cheaper Groupons.)</p>
<p>Averages are nice, but what I really wanted was totals. I was able to approximate what fraction of the data I had because Groupon advertises the &#8220;Total dollars saved&#8221; and &#8220;Total Groupons bought&#8221; on every page. By dividing my numbers by those, I determined that I had a little over a third of the data. Specifically, my data covered 31.2% of Groupons sold, and 37.4% of total savings.</p>
<p>Extrapolating the data I had (again, with the disclaimer that my sample may not be random), I calculated the total revenue since the beginning to be $80,188,176. If Groupon takes a 35% cut (to take a wild guess), $28 million of that is left after Groupon pays the company offering the deal. According to <a href="http://www.crunchbase.com/company/groupon">CrunchBase</a> Groupon employs 90 people. I won&#8217;t speculate as to the operating costs of Groupon over the last year and a bit of operation, but once you subtract that number the rest is profit to date.</p>
<p>Looking on a monthly basis, the recent growth of the company is clear. A third of the total savings &mdash; in over a year of business &mdash; happened last month. This works out to $26,706,059 in revenue last month alone, or about $9.3 million (less the operating costs) profit if you assume they take a 35% cut. The below graph shows the growth by month.</p>
<p><img src="http://test.paulbutler.org/wp-content/uploads/2010/04/groupon_growth1.png" alt="" title="groupon_growth" width="427" height="345" class="alignnone size-full wp-image-288" /><br />
Whether or not it&#8217;s a $1.2 billion company (<a href="http://www.businessinsider.com/groupon-is-cheap-at-12-billion-2010-4">BusinessInsider says that&#8217;s actually low</a>, though without any quantitative justification), they&#8217;re clearly doing well for a company just over a year after launch.</p>
<p>Here are a couple more graphs constructed from the data (click to enlarge).<br />
<a href="http://test.paulbutler.org/wp-content/uploads/2010/04/groupon_cities.png"><img src="http://paulbutler.org/wp-content/uploads/2010/04/groupon_cities-150x150.png" alt="" title="Estimated To-Date Revenue Per City" width="150" height="150" class="alignnone size-thumbnail wp-image-291" /></a><br />
<a href="http://test.paulbutler.org/wp-content/uploads/2010/04/groupon_cities_average.png"><img src="http://paulbutler.org/wp-content/uploads/2010/04/groupon_cities_average-150x150.png" alt="" title="Average Revenue Per Groupon" width="150" height="150" class="alignnone size-thumbnail wp-image-292" /></a></p>
<p>(Graphs were created with <a href="http://tables.googlelabs.com/Home">Google Fusion Tables</a>.)</p>
<p><strong>Update May 26</strong>: a few more graphs with more recent data follow.</p>
<p><a href="http://test.paulbutler.org/wp-content/uploads/2010/04/sales_by_date_top.png"><img src="http://paulbutler.org/wp-content/uploads/2010/04/sales_by_date_top-300x225.png" alt="" title="sales_by_date_top" width="300" height="225" class="alignnone size-medium wp-image-334" /></a></p>
<p><a href="http://test.paulbutler.org/wp-content/uploads/2010/05/prices_by_date_zoom_top.png"><img src="http://paulbutler.org/wp-content/uploads/2010/05/prices_by_date_zoom_top-300x225.png" alt="" title="prices_by_date_zoom_top" width="300" height="225" class="alignnone size-medium wp-image-328" /></a></p>
<p><a href="http://test.paulbutler.org/wp-content/uploads/2010/05/revenue_by_date_top.png"><img src="http://paulbutler.org/wp-content/uploads/2010/05/revenue_by_date_top-300x225.png" alt="" title="revenue_by_date_top" width="300" height="225" class="alignnone size-medium wp-image-332" /></a></p>
<p><a href="http://test.paulbutler.org/wp-content/uploads/2010/05/price_trend_firstcities.png"><img src="http://paulbutler.org/wp-content/uploads/2010/05/price_trend_firstcities-300x225.png" alt="" title="price_trend_firstcities" width="300" height="225" class="alignnone size-medium wp-image-330" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://paulbutler.org/archives/groupon-math-data-scraping-to-estimate-revenue/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>webFractal: Web-based Fractal Explorer</title>
		<link>http://paulbutler.org/archives/webfractal-web-based-fractal-explorer/</link>
		<comments>http://paulbutler.org/archives/webfractal-web-based-fractal-explorer/#comments</comments>
		<pubDate>Sat, 17 Feb 2007 16:12:28 +0000</pubDate>
		<dc:creator>Paul Butler</dc:creator>
				<category><![CDATA[Fractals]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Math]]></category>
		<category><![CDATA[Web Apps]]></category>

		<guid isPermaLink="false">http://www.paulbutler.org/blog/archives/webfractal-web-based-fractal-explorer/</guid>
		<description><![CDATA[Last weekend, I won a nice new Toshiba laptop in a local software competition. My entry was a web-based fractal explorer. I had a lot of fun making it, and it is fun to play with as well. I have &#8230; <a href="http://paulbutler.org/archives/webfractal-web-based-fractal-explorer/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Last weekend, I won a nice new Toshiba laptop in a local software competition. My entry was a web-based fractal explorer. I had a lot of fun making it, and it is fun to play with as well. I have decided to release it under an open-source license so that other people can play around with it (see the download link at the bottom of this post).</p>
<p>Unfortunately, I do not have access to a powerful Tomcat server with a lot of bandwidth, so I can&#8217;t host an online demo. If anyone has the resources and is interested in hosting it, please let me know.</p>
<p>Here are some screenshots of the application in action:</p>

<a href='http://paulbutler.org/archives/webfractal-web-based-fractal-explorer/fractal1/' title='fractal1'><img src="http://paulbutler.org/wp-content/uploads/2009/04/fractal1.jpg" class="attachment-thumbnail" alt="fractal1" title="fractal1" /></a>
<a href='http://paulbutler.org/archives/webfractal-web-based-fractal-explorer/fractal2/' title='fractal2'><img width="150" height="150" src="http://paulbutler.org/wp-content/uploads/2009/04/fractal2-150x150.jpg" class="attachment-thumbnail" alt="fractal2" title="fractal2" /></a>
<a href='http://paulbutler.org/archives/webfractal-web-based-fractal-explorer/fractal3/' title='fractal3'><img width="150" height="150" src="http://paulbutler.org/wp-content/uploads/2009/04/fractal3-150x150.jpg" class="attachment-thumbnail" alt="fractal3" title="fractal3" /></a>
<a href='http://paulbutler.org/archives/webfractal-web-based-fractal-explorer/fractal4/' title='fractal4'><img width="150" height="150" src="http://paulbutler.org/wp-content/uploads/2009/04/fractal4-150x150.jpg" class="attachment-thumbnail" alt="fractal4" title="fractal4" /></a>
<a href='http://paulbutler.org/archives/webfractal-web-based-fractal-explorer/fractal5/' title='fractal5'><img width="150" height="150" src="http://paulbutler.org/wp-content/uploads/2009/04/fractal5-150x150.jpg" class="attachment-thumbnail" alt="fractal5" title="fractal5" /></a>
<a href='http://paulbutler.org/archives/webfractal-web-based-fractal-explorer/fractal6/' title='fractal6'><img width="150" height="150" src="http://paulbutler.org/wp-content/uploads/2009/04/fractal6-150x150.jpg" class="attachment-thumbnail" alt="fractal6" title="fractal6" /></a>

<p>Since it is a web-based application, any supported web browser can be the client (see the documentation for a list of supported browsers; any modern Gecko-based browser is supported as well as IE and Opera.) The client interface is loosely based on Google Maps. The server is a Java Servlet run through Tomcat. You can read more about how it works in the documentation.</p>
<p>Downloads:</p>
<ul>
<li><a href="http://github.com/paulgb/webFractal/tarball/master" title="webFractal 1.0">webFractal 1.0 (zip file)</a></li>
<li><a href="http://github.com/paulgb/webFractal/raw/2f09d69e63088879ca7ad86f25a480cb8882b731/webFractal_Documentation.pdf" title="webFractal 1.0 Documentation PDF">webFractal 1.0 Documentation (pdf)</a></li>
<li><a href="http://github.com/paulgb/webFractal/blob/2f09d69e63088879ca7ad86f25a480cb8882b731/webFractal_Documentation.txt" title="webFractal 1.0 Documentation TXT">webFractal 1.0 Documentation (text)</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://paulbutler.org/archives/webfractal-web-based-fractal-explorer/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

