BLAST Graphic Viewer

While working at the Carnegie with Stanford Genomics I developed a new visualizer for genomic data. The standard bioinformatics search tool BLAST generates what is effectively ASCII-art output. An example of one small part of the report follows:

Query: 55     TTGTTTTCTGGTCATGACCT-CACTGATCTGGTTTGAAAGGATATGGTTG 7
              ||  |||||| | ||  ||| |||||   || |||||||||||||| |||
Sbjct: 343509 TTTATTTCTG-TAATCTCCTTCACTGCG-TGATTTGAAAGGATATGATTG 343556
The ASCII report is full of important details, but suffers from not giving the end-user a 5000-foot overview of the results.

Over the years, various tools have been developed to convert the ASCII output into something more succinctly graphical. (Such graphical outputs generally complement rather than replace the ASCII output since it is venerable and often contains detailed information left out of the graphical version.) These graphical tools were incorporated into web front-ends to BLAST.

Our refinements to the display included small-but-impactful graphical design changes, a new layout algorithm, and support for light-weight JavaScript-based annotations. Among those contributions I'd like to call out the layout algorithm as significant. It was developed via a process of iterative refinement with the Stanford Genomics group - the credit for the idea goes to them - and is designed to offer the user a condensed, purposefully limited, quick graphical overview of the ASCII data. It allows the user to navigate directly from the graphic to the details in the ASCII report. Prior to this algorithm, the graphical displays would show only the "top" hits. However, this only supports one use case. Some researchers are looking for distant yet related hits and would not see much in those older graphical layouts. By giving the user an overview of the entire result set, the newer viewer serves more tasks. Below is a static example of the layout algorithm's output. The important thing is that it has taken a large result set and extracted an overview by "bucketing" the results, and then taking top hits from each bucket. It keeps the image within certain size constraints, filling as much of that space as it can.

The original version, currently used live on various sites including Stanford and TAIR, is freely available through GMOD (which contains a link to the downloads). The code is written in Perl in an object-oriented style. It was a requirement to use Perl since that is a common language in both the web and bioinformatics worlds.

Of course, web technologies are rich enough and progressing at such pace that the version I developed is itself far behind the times. I have tinkered with a new version to address some concerns (although the topic clearly still offers a bevy of rich, untapped opportunities):

A web-based slide set describing the new (alpha) project is available online. (It requires JavaScript and worked well enough in Firefox for me to have presented it publicly at times. There are JavaScript based controls at the bottom right of each page, or you can use the keyboard arrow keys to go forward and back. Some of the slides are larger than the page if your web browser window is small, or your web browser text is large. That's what I get for fighting the PowerPoint power.) Also, the ever-in-progress source code for the new project is available online either as an archive or as browsable files.

Please note: As with just about any 'real' code, there are some real hum-dinger issues with both versions, most of which I probably know about (like the very silly ScientificNotation.pm file - really an abomination which can simply be replaced with things like printf). Sorry for the cruft! It's the journey, n'est-ce pas?