d3 language graphs a la marceau

About a year ago I came across this blog post from Guillaume Marceau graphing benchmark speed and size data from the Computer Language Benchmarks Game.

Given my interests in data visualization and programming languages, it is no surprise that these graphs tickled me. They are, however, a little old, and Marceau’s post leaves a few outstanding questions, so I thought I’d try to recreate his findings. And we may as well use D3 for it.

Like many D3 tutorials, we’ll be using CoffeeScript for its simplicity. To translate to javascript, just add a ton of parentheses, curly braces and semicolons, and throw in the words `var` and ‘function’ liberally.

In this post I’ll walk through building only one of Marceau’s graphs, but if you’re interested checkout this page for more of them. Also feel free to poke around the git repo for this project. This blog post was generated from the literate CoffeeScript file found there.

width = 250
height = 250
margin = 100

Let’s navigate over to the Benchmarks Game website and download the summary data. You can find it here. Load up the data with D3’s csv function.

d3.csv "data.csv", (data) ->
    for d in data
        d["cpu(s)"]  = parseFloat d["cpu(s)"]
        d["size(B)"] = parseFloat d["size(B)"]

the scatter plot

The first thing we notice is that the CSV reports absolute numbers, but Marceau graphs relative ones, the number of times larger or longer running a program is than the best solution. It’s easy enough to ask D3 to do this conversion for us.

    best = d3.nest()
        .key (d) -> d.name
        .rollup (v) ->
            "cpu(s)":  d3.min v, (d) -> d["cpu(s)"]
            "size(B)": d3.min v, (d) -> d["size(B)"]

    mins = best.map data

Calculate the minimum size and time for each benchmark, then scale the data accordingly.

    for d in data
        d["cpu(s)"]  = d["cpu(s)"]  / mins[d.name]["cpu(s)"]
        d["size(B)"] = d["size(B)"] / mins[d.name]["size(B)"]

Great. Let’s draw a basic scatter plot. We’ll scale the domain manually to exclude the outliers. Framing our plot this way highlights an interesting fact about the data: other than those outliers, the largest programs only take a few times the space of the most optimum solution. However, the slowest programs run for thousands of times longer than the fastest.

    scaleX = d3.scale.sqrt()
        .domain [1, 5000]
        .rangeRound [0,  width]
    scaleY = d3.scale.sqrt()
        .domain [1, 6]
        .rangeRound [height, 0]

    x = (d) -> scaleX d["cpu(s)"]
    y = (d) -> scaleY d["size(B)"]

Now add a dot for each benchmark data point. The x-coordinate is the relative speed, and the y-coordinate the relative size.

    focus = createCanvas()

    focus.selectAll ".benchmark"
        .data data 
        .enter().append "circle"
        .attr "class", "benchmark"
        .attr "r", 2
        .attr "transform", (d) ->
            "translate(#{x d},#{y d})"

performance stars

On top of the scatter plot we’ll draw a star showing the performance of a particular language. The center of the star is the average of the benchmarks, so first we’ll rollup the data points by language.

    average = d3.nest()
        .key (d) -> d.lang
        .rollup (v) ->
            "cpu(s)":  d3.mean v, (d) -> d["cpu(s)"]
            "size(B)": d3.mean v, (d) -> d["size(B)"]

    averages = average.map data

We’ll only show the star for the currently selected language, so let’s also map the benchmark results by language to make finding them easy.

    benchmarks = d3.nest()
        .key (d) -> d.lang
        .map data

Now we have everything we need to draw the star. We’ll create an SVG group to work with in a moment.

    star = focus.append "g"
        .attr "class", "star"

We’ll declare a local function so we can easily update the star when the language changes.

    showLanguageStar = (lang) ->

First we get the average performance for this language and move the star group to that position.

        avg = averages[lang]

        star.transition()
            .attr "transform", "translate(#{x avg},#{y avg})"

Then we append a spoke for each benchmark data point of the selected language, update existing ones, and remove spokes if there are some we don’t need. Read more about this process in Mike Bostock’s General Update Pattern tutorials.

        lines = star.selectAll "line"
            .data benchmarks[lang]

        lines.enter().append "line"

        lines.transition()
            .attr "x2", (d) -> x(d) - x(avg)
            .attr "y2", (d) -> y(d) - y(avg)

        lines.exit().remove()

Set the default language on page load.

    showLanguageStar "JavaScript V8"

legend

Finally we’ll create the ui. We want the user to be able to choose a language, so we’ll need a list of languages to select from.

    languageNames = (name for name of averages)
    languageNames.sort()

    languages = d3.select "body"
        .append "ul"
        .selectAll "li" 
        .data languageNames
        .enter().append "li"
        .text (d) -> d

Whenever we mouseover the name of a language, call showLanguageStar to redraw the star.

    languages.on "mouseover", showLanguageStar

conclusions

Now we can answer a few questions Marceau left. Would JavaScript V8 maintain its position? Yes, not just maintained, but improved. It has become one of the fastest languages in the rankings while remaining expressive. I’d be intrigued to see CoffeeScript on here.

There have been a few major langauge movements. Java 7 seems to have lost the edge that Java 6 had. Haskell, Fortran and Ada have all moved into the fastest column. I understand developers working on improving the Haskell programs, but I’m curious who’s hacking at these Fortran and Ada benchmark programs. Otherwise things are largely how they were four years ago.

Check out https://couchand.github.io/language-viz/ for more of Marceau’s visualizations recreated.

boilerplate

Create an svg canvas with a little frame and clip path.

createCanvas = ->

    svg = d3.select "body"
        .append "svg"
        .attr "width", width + 2*margin
        .attr "height", height + 2*margin
        .append "g"
        .attr "transform", "translate(#{margin},#{margin})"

    svg.append "defs"
        .append "clipPath"
        .attr "id", "clip"
        .append "rect"
        .attr "width", width
        .attr "height", height

    svg.append "g"
        .attr "clip-path", "url(#clip)"

a walk through the new google analytics code

Google is holding public beta for their new Universal Analytics product. The tracking code is an interesting read, and in any case it’s probably a good idea to know what’s happening with a script you include on a page.

Here it is in its entirety.

(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-XXXX-Y');
ga('send', 'pageview');

That’s a little much, so let’s break it down.

The immediately invoked function expression (pronounced “iffy”) is a common javascript idiom. By taking advantage of function level scope the code inside is protected from the rest of the page. Here the parameters to the function are single characters to minimize the code, aliasing longer variables like window, document, or the string 'script'.

(function(i,s,o,g,r,a,m) {
  …
})(window, document, 'script', '//www.google-analytics.com/analytics.js', 'ga');

I find it amusing the developers chose to spell out the word “isogram” with the function parameters, since their very use is an application of the principle Don’t Repeat Yourself.

If you counted you noticed there are seven parameters named but only five passed in. This little hack declares a and m to be local variables as well, but in four characters fewer than an explicit var statement.

  i['GoogleAnalyticsObject'] = r;    // r = 'ga'

Remember that i means window, so the first thing to do is set a global variable named GoogleAnalyticsObject that says the name of the real analytics object, which we’ll create next.

This fragment features another javascript idiom, the conditional assignment. The phrase a = a || b initializes a with b only if a is not already defined.

  i[r] = i[r] || function() {
    …
  },

Keeping in mind that r is the name of the analytics object ('ga'), we define it as a function in the global scope (a property on window). When the function ga is called, the arguments array is added to the queue (q), which is itself initialized if necessary.

// window.ga = function() {
    ( i[r].q = i[r].q || [] ).push( arguments )
// }

Now that we’ve initialized the analytics object, let’s set the load time of the page to the current time. The multiplication by one coerces the date into a regular number, a timestamp.

  i[r].l = 1 * new Date();

Now we’ll make use of the local variables a and m. They’ll both be script tags, a a new one and m another already on the page.

  a = s.createElement(o),         // o = 'script'
  m = s.getElementsByTagName(o)[0];

Set the new script tag to load Google’s analytics script (asynchronously, if supported) and then append the tag to the page.

  a.async = 1;
  a.src = g;    // g = 'http://url.of.analytics/script'
  m.parentNode.insertBefore(a,m)

Finally we make use of the global ga function defined above to add two items to the event queue: the creation of the tracker and the initial pageview.

ga('create', 'UA-XXXXXXXX-X', 'example.com');
ga('send', 'pageview');

At this point we can inspect ga and see the following structure.

window.ga.q = [
  ['create', 'UA-XXXXXXXX-X', 'example.com'],
  ['send', 'pageview']
]

The browser is now fetching Google’s script asynchronously. When it loads, the events in the queue will finally be sent to the server.

the increasing polarization of the u.s. house

The Washington Post made an admirable attempt to illustrate the forces leading to the shutdown. However, their graphic fails on two important levels: the chart doesn’t support their thesis and their thesis is wrong.

Washington Post vote share chart

Let’s address the technical issues first.

The vast majority of the graphic is a cartogram of the U.S. House districts. Maps may be a good tool to draw readers in, but they’re inappropriate for anything but a geographic thesis. If their argument were that the Northeast and Pacific coast are Democratic strongholds and the South and West Republican ones, it would be clearly demonstrated by the Post’s map (though it’s pretty much common knowledge these days). But they sought to prove the polarization of the parties, not their geographical distribution. It’s a mistake to devote so much ink to an inappropriate tool.

My next concern centers on the legend. The Post has categorized each district by the winner’s party and as safe or competitive based on the share of the vote obtained by the winner. Notwithstanding binning concerns (conclusions can be very sensitive to the choice of the bin cutoffs), this two-axis categorization poses a problem when graphed on a single axis. The problem rears its ugly head with the legend’s center point label: 47%.

Washington Post chart legend

Due to our first-past-the-post voting system, candidates can be elected with less than 50% of the vote, so it’s unwise to situate the Republican and Democratic vote share on two ends of a diverging scale. If the leftmost point is 100% Democrat (0% Republican) and the rightmost point is 100% Republican (0% Democrat), then the center point must be 50/50. Anything else is nonsense. Fortunately fixing this issue is easy.

Washington Post legend remixed

My biggest concern with the Post’s visualization is that it represents only a single moment in time. Attempting to explain an unprecedented situation without any historical context is just lunacy. If this situation is unique, the chart must show that.

So I went and grabbed the Federal Elections Commission data. The first thing I did was build a chart of the distribution of the House over time, binned into the same ranges as the Washington Post’s graphic (c02e9fc).

Distribution of U.S. House by share of votes won

Immediately I noticed something: the trend the Post claimed simply does not exist. We can see that over the past decade, the share of representatives in “safe” districts has varied considerably, but with no distinct trend. The Post argued that since “just 31 Republicans and 31 Democrats won their seats with 54 percent of the vote or less” that Congress is more partisan than usual. But compare that to election year 2002, when a mere 36 representatives total were elected in “close” races. Something is missing in the Post’s analysis – their thesis cannot fully explain the current political quagmire.

I decided a chart with more nuance would be needed. Since we want to examine the distribution of vote share, a natural chart type is a cumulative distribution chart. These can convey a significant amount of information in very little space. However, they are a bit unintuitive and can be hard for many to read. Such a chart is rarely the right option for a final graphic, but it can be a good tool for exploring a dataset.

The first cumulative distribution chart I built was simply the overall distribution broken down by years (a663246ab5). This is a bit hard to read, but shows a clear trend. Look just at the horizontal 50% line. This represents the median vote share, and the leftward year-on-year movement shows that it has been steadily decreasing over the past decade. This pretty directly refutes the Post’s thesis that gerrymandering has created safe districts.

Cumulative vote share distribution

But gerrymandering is a complex operation, and we’ll need even more nuance to see the full effect. The most common strategy is known as pack-and-crack. Opposition strongholds are cracked, distributing voters among favorable districts to eliminate strong, coordinated opposition. Small groups of opposition voters are packed into a single district to eliminate their effect on polls elsewhere. I’m not a political scientist, but my guess is that the net result would be to decrease the vote share of incumbents (who are more likely to win anyway) and increase the vote share of new representatives (who are carried into office on the back of a gerrymander).

To get a clearer idea of this I split the graphic into two charts, one limited to incumbents and the other showing freshmen representatives (e586a87690). I also only plot the years before and after redistricting, since these should most effectively show the gerrymander.

Vote share cumulative distribution by incumbency

Look at the second chart, showing non-incumbent representatives. Note that in elections before redistricting, they tend to be elected with 55% of the vote. After redistricting, the median freshman representative is elected with 58% percent of the vote. Gerrymandering is alive and well in the United States.

I also split it up by party but there wasn’t a discernible trend. I think it would be interesting to split based on control of the state government, since the states are ultimately responsible for redistricting. Perhaps I’ll look into that in the future, but for now I want to focus on the shutdown.

If gerrymandering isn’t enough to explain the political mess, we need to expand our search. One of my influences while working on this project was an xkcd cartoon showing the history of Congress. Go take a look at it, it’s awesome.

xkcd: Congress

In the fine print they note that their data comes from a group of political scientists: Poole, Rosenthal et al. These folks have a two-dimensional coordinate system for mapping congressional representatives called DW-NOMINATE. In the modern era, the first dimension roughly corresponds to the liberal-conservative spectrum of politics. One advantage of their model is that it is based solely on voting records, and thus this liberal-conservative axis is an emergent property, rather than being specified in advance. This lends a bit of support to the idea that there is some merit to their analysis. I’d recommend looking into their work, it’s very interesting.

Fortunately they make available datasets of their supercomputer-crunched coordinate system for every Congress throughout history. Since senators and representatives serve overlapping terms, they can put successive congresses on the same scale (allowing for individuals to evolve their views, i.e., move their position). So I built a chart showing the evolution of this liberal-conservative dimension in the modern era. My first take simply plotted every individual member of the House (9def35ddee). As you can see this is pretty tough to meaningfully read, and it’s slow to render, too.

DW-NOMINATE first dimension score

The natural thing is to do a little statistical mumbo-jumbo. It makes sense in particular to smooth this dataset since we’re not all that convinced that the individual scores are exactly right. A standard box plot shows the 25th and 75th percentiles and the median, and whiskers show the 1st and 99th or 5th and 95th or some other percentiles. A natural extension of the box plot over many time periods is an area chart like the one here. This is somewhat inspired by the ideas of Stephen Few.

Distribution of DW-NOMINATE first dimension scores by party

Here we see quite plainly a growing divide between the parties. For three decades, between 1952 and 1982, some Democrats were as conservative as the bulk of Republicans, and on several occasions more conservative than the median Republican. For sixty years, from 1930 to 1990, there was significant overlap in the ideologies of the members of the two parties. Since 1976, as the Democrats have grown only modestly more liberal, the Republicans have seen a steady, significant rightward drift.

And so today the difference between the median position of the parties is larger than it’s ever been in the modern era.

Check out the interactive version here: http://couchand.github.io/polarization.

essential federal employees, 1996 vs 2013

Amidst the federal government shutdown coverage I saw an interesting chart on Slate comparing the percentage of essential employees between the 1995-96 and current shutdowns across a number of federal departments. Here is the original chart by Emma Roller.

Original Bar Chart

Right off the bat I had a few thoughts about this. The bars seem pretty heavy, in particular considering they represent a proportion (the whole area above the bar is “filled” with non-essential employees). Using color to double-encode the year seems excessive: the bright blue and red stripes are hard to look at. I’m unable to discern any particular order that the departments are listed in. And why on earth do the percentage labels have two digits after the decimal point?

Here’s my first attempt at recreating this chart (e8140655d5), addressing the points above.

Small Multiples Lines

I like this a lot but there are still some problems with it. It’s hard to tell which department is which, especially for the ones with a high level of essential personnel. It’s also tough to compare departments to one another. I haven’t addressed the ordering issue. Oh, and the colors are still too busy.

I was thinking about how the data don’t seem to be arranged in any particular way, considering my options for ordering, when I remembered a neat little graph that can solve the several of these problems. It’s called a bump chart, and here one would show the two years as the two ends of a line chart, plotting all the categories on top of one another. It’s basically your standard line chart with only two data points.

To produce my first bump chart attempt (force-layout-labels) I spent too long tweaking a d3 force layout to make nice labels. The basic concept is to let d3’s force calculations space the labels out. They carry a little charge, preventing them from landing on top of each other. They have a link to the line they’re labeling to ensure they stay close.

Bumb Chart

Even after all of my effort the labels are still way off in critical sections of the diagram. I was about to declare defeat when I realized that a better strategy was to redefine it as victory. I shouldn’t be cramming all those labels in there in the first place.

The first rule is highlight what you want the reader to pay attention to. So for my next iteration (34f9788755), I finally get rid of those extra colors. The source article makes the point that all but three federal departments had a reduced level of essential employees, and the Department of Commerce had a drastic decline. It seems natural to highlight those four departments, leaving the others to the background.

Highlighting Important Data

There are still a few things I’d like to wrap up. The dots aren’t really necessary, and they make it hard to see some of the finer distinctions. There’s also a little thing that bothers me a whole lot here.

If you look closely, each of the highlighted lines has one background line actually running in front of it. This is a consequence of laziness and SVG draw order. The last thing you add to SVG is drawn on top, thus the line for NASA is drawn on top of the EPA.

Here is the latest (master), incorporating an easy change to remove the dots and an annoying kludge to ensure the highlighted lines are drawn last.

Essential Federal Employees

Simple and effective.

As always the current version of this visualization can be found live on my GitHub page.

a tough nut to crack

With the Spring ‘13 release of Salesforce.com comes an exciting new way of working with development metadata — the Tooling API.  This modern web service API, though still in its infancy, is expected eventually to replace the existing file-based API.  It will give third parties more flexibility building tools to support development on the Salesforce.com platform.  As is often the case with the release of new Salesforce features, the official documentation is rather scant.  Here are some pointers to help guide your experiments with the Tooling API.

The new Tooling API has a variety of interesting capabilities.  This includes the ability to overlay Apex or SOQL code on top of the system and view detailed debug log and heap dump information, which will facilitate rapid debugging.  Salesforce also provides detailed data on the structure of code, opening the door for tools to diagram your org, navigate your codebase, and highlight and autocomplete syntax. Utilizing such developer tools has the potential to significantly increase your team’s productivity, so it’s well worth making an investment in tooling.

The current Metadata API is primarily intended for deployments.  The Tooling API has been designed from the ground up to support the entire development lifecycle, including design, implementation, deployment, and maintenance.  This means many development tasks are much easier — things like incrementally modifying classes and debugging.

As with all of the REST-based APIs, the Tooling API requires that you authenticate before making requests.  While matters of authentication are outside the scope of this post, I found a tutorial from The Gazler to be quite helpful.  Once you’ve authenticated, it’s pretty straightforward to access and interact with the REST resources.

The examples below assume some familiarity with the command-line, but the ideas should be clear to anyone with experience developing on Force.com.  You can find the helper scripts we’ll use to make the low-level API calls on GitHub.

The Tooling API, as of the Spring ‘13 release, has support for only Apex classes, components, pages and triggers.  It is my assumption that future releases will support more types of metadata.  There are three categories of resources available from the Tooling API which represent the code files, deployment operations and support functions.

# list the Tooling API resources
./get.sh tooling/sobjects

The result of this call is the list of resources as JSON.

{
  "encoding": "UTF-8",
  "maxBatchSize": 200,
  "sobjects":
  [
    {
      "name": "ApexClassMember",
      "label": "Metadata Container Member",
      "keyPrefix":"400"
      …

Look at the field information for an sObject.

# describe a resource type
./get.sh tooling/sobjects/ApexClassMember/describe

Getting into the workflow of using the Tooling API takes a bit.  The objects provided do not directly represent the underlying Apex components, but rather a theoretical deployment item.  Creating an ApexClassMember, for instance, does not create the corresponding ApexClass immediately, but rather waits for the deployment to be executed.

# find a class and create a deployment item for it
./get.sh tooling/query?q=SELECT+Id,+Name,+Body+FROM+ApexClass
vim updatedClassInfo.json
./post.sh tooling/sobjects/ApexClassMember updatedClassInfo.json POST

That deployment execution is taken care of with two more resource types.  First is the MetadataContainer object, which simply holds all the components that will be deployed together.  These containers must have a unique name, but have no other properties.  Once you have created a container, you can begin to add various deployment components to it, and after adding the components create a ContainerAsyncRequest, which notifies the system that you are ready to deploy.

# create the container
echo ‘{ “name”: “My New Metadata Container” }’ > data.json
./post.sh tooling/sobjects/MetadataContainer data.json POST

The deployment request object has several important fields.  You may set the IsCheckOnly flag to indicate that the package should be validated but not deployed, and a reserved field named IsRunTests may be available in the future for ensuring that tests are run.

# create the deployment request
vim deploymentRequest.json
./post.sh tooling/sobjects/ContainerAsyncRequest deploymentRequest.json POST

The status of the request is obtained by querying for the newly inserted request object: the State field will change from ‘Queued’ to ‘Completed’ once the request has been successfully processed.  If there are any compilation failures or other errors, they will be recorded in the CompilerErrors and ErrorMsg fields.

# query for the results
./get.sh tooling/sobjects/ContainerAsyncRequest/REQUEST_ID

Once you’ve successfully validated or deployed a metadata container, the container contents are updated with some useful information.  On a successful deployment, the MetadataContainerId field will be updated with the id of the deployment request.  This means you cannot simply make further updates to the ApexClassMember and redeploy, you must insert a new one.

In addition, an exciting field called SymbolTable will be populated, which holds a JSON representation of all variable, function, and class references in the file.  Access to the symbol table enables developers to reliably determine line and column numbers of each symbol, which opens the door for syntax highlighters, code navigators, and tools that analyze the cohesion and coupling of Apex classes.

# query for the symbol table
./get.sh tooling/query?q=SELECT+Id,+ApexClass.Name,+SymbolTable+FROM+ApexClassMember

Using the Force.com Tooling API is not easy.  The API is still rough, and the documentation is terse.  However, the possibilities afforded by this new API may make the effort of investigation well worth it.

false economies

I’m a lazy person. I consider it one of my best qualities, and it’s because I’m lazy that I’m a good developer. This may seem odd at first, but consider that much of software development is abstraction, which is really just a nice way of saying “finding ways to reuse things you’ve done before to avoid having to do as much now.”

If I spend a coffee break today automating a common task and I’m able to shave even a few minutes off it, I may earn myself an extra coffee break tomorrow and every day after. This is just the concept of investment, which is already widely understood in business.

Or is it? False economies are all around us. One of the fastest ways to improve the work that you do is to be conscious in eliminating these false economies. Let’s look at a few examples.

Perhaps the most common false economy in development is copy-and-paste programming, also known as snarf and barf (I like to call it paste-driven development).  We all do this from time to time: a method above has a few lines of code that are almost what we need, so we copy them to where we’re working, make a few tweaks, and away we go. This false economy is very seductive (“why type something all over again if it’s mostly the same?”).  But it has three fatal flaws.

First, there’s often not as much in common as you initially think. If the passage is short there is probably less you will have to change, but there is also less reason not to just type it out. As the pasted section grows longer, the amount that must change increases, as does the chance of error.

In my experience, developers spend far more time dealing with trouble caused by copy-pasta errors than they would have spent simply typing the code in the first place. If you find that isn’t true spend some time improving your typing speed.

Second, duplicating the code is a replacement for the real solution, abstraction. It is all too common for classes to accrete by the gradual addition of many repetitive elements.  Fighting these forces requires constant refactoring and a disavowal of PDD.

A common programming idiom is Don’t Repeat Yourself.  Frequently developers will apply the Rule of Three: you can do something twice, but once you do it a third time you must abstract it. When a passage is pasted once, it is almost inevitable that it will be pasted again. This is what leads other developers to the Rule of Two: there are only three numbers: zero, one, and infinity.

Most importantly, pasting code numbs the mind, and a developer’s mind is her greatest asset. Pasting is a substitute for thinking. Part of this has to do with abstraction as mentioned above, but it goes far beyond that. When you paste code you actively prevent yourself from considering the problem you are trying to solve.  That is, after all, the very point of the clipboard.

A special case of paste-driven development is Google-and-paste programming. This unfortunate style removes the developer entirely from the development process.

Another common false economy in development is something I refer to as haste-driven development.  The conditions are all too prevalent: deadlines are imminent and a partial solution is in place. It seems like if we do things the down-and-dirty way “just this time” we can put a band-aid on the problem. Other times I see developers practicing haste-driven development as a matter of course, probably because they are so used to the process.

HDD manifests in many ways: writing lots of code and leaving the tests for later, growing a single method rather than composing multiple, failing to refactor code to reflect new information.

This last one is the worst for the long-term maintainability of your system. Software development is a learning process, and the software itself must come to embody the collective knowledge of the team. This means that as the team learns new information, it must be incorporated into the working software or inevitably it will be lost to the frailties of human memory or organizational change.

This essay was inspired by this page on Ward’s Wiki.

two tickets in two days

All I looked at during the first-ever jQuery Developer Summit were UI tickets #8644 and #8646, two annoying little bugs in the Tooltip widget. By the end, my work amounted to two pull requests (the method by which contributions make it into the jQuery projects). These two pull requests contain four commits, which, not counting the unit tests, amount to less than a dozen lines of code. Still, I’m calling it a success.

These are my first contributions to that library I so adore. Now that I’ve got my feet wet, I can’t wait to dive in.  From what I’ve heard, the same is true for most of the participants—the community of contributors just exploded. I’d bet the core team is also calling it a success.

The summit saw about two hundred folks with widely varying skill sets collaborating on every aspect of the project. jQuery is a library that makes writing web applications a breeze. It abstracts away cross-browser issues, and it makes for clean code. But don’t take my word for it: almost 60% of the top 10,000 sites on the web make use of it.

There were, of course, tables like mine, devoted to finding and fixing bugs. There were those working on new features and new widgets. There were plenty of lively discussions about particular issues and the design of the library in general. There was a table devoted to testing (darn right!).

But in true cross-functional form, teams also worked on documentation, something vital, yet often overlooked.  Several tables focused on the new look and feel of the jQuery project websites. There was even a table hacking big data, looking for trends in the tickets and commits, page views and downloads, and even IRC chat logs.

The most inspiring part of the summit was rubbing shoulders with the jQuery team—people I’ve seen on Twitter and GitHub, or from afar at the jQuery Conference in June. There’s nothing like getting a lesson in Grunt straight from “Cowboy” Ben Alman, or having Jörn Zaefferer (tech lead on QUnit) tell me (quite matter-of-factly) that I really need to let the post-commit hook close tickets on my behalf. Or arguing about filesystem security at two a.m. with Timo Tijhof and Daryl Koopersmith, or about widget architecture with Scott Gonzalez (tech lead for jQuery UI).

Open source projects take note: a dev summit like this can be a great way to make real progress while simultaneously building your contributor base. And if you’ve never contributed to open source before, now might be the time to ask yourself, “why not?”. If you’re worried you lack the technical chops, remember that every project needs documentation and a website.

And even if you don’t close any tickets yourself, you can do a lot just by opening some. When you notice an issue in a piece of software, don’t ignore it, rather file a ticket. All you have to do is find the bug tracker for the project (the jQuery projects’ are at bugs.jquery.com, bugs.jqueryui.com, or on GitHub). Clearly describe steps to reproduce the issue and the expected behavior. If you have example code or data, all the better—anything you can do to help the developers find and fix the bug will be much appreciated.

And so, to that end, I’d like to thank the reporters of UI tickets #8644 and #8646, who I know only by their handles, shnitz and josepsanzcamp. Thanks to you, I had a challenge on my plate at the jQuery Developer Summit, and together we’ve improved the library.