TSP+k Code for Rearrangement Clustering

Software: TSP+k Code for Rearrangement Clustering of Gene Expression Data

Sharlee Climer and Weixiong Zhang
Department of Computer Science and Engineering
Washington University in St. Louis, St. Louis, MO 63130, USA
email: sharlee at climer.us, zhang at cse.wustl.edu

Given a matrix of values, rearrangement clustering involves rearranging the rows of the matrix and identifying cluster boundaries within the linear ordering of the rows. The rearrangement and cluster boundaries are selected so as to maximize the sum of similarities between adjacent rows within the clusters. The TSP+k algorithm for rearrangement clustering is presented in Rearrangement Clustering: Pitfalls and Remedies and used to cluster a 2,467 yeast gene dataset. The implementation is summarized in A Traveling Salesman's Approach to Clustering Gene Expression Data. Tables listing functional groups found within the resultant clusters can be downloaded here: k=50, k=100, k=150, k=200, k=250, k=300, k=350 (where k is the number of clusters). Tables listing functional groups found using restricted partitioning (RP) can be downloaded here: k=44, k=77, and k=110. A list of functional groups in the clusters found by Eisen, Spellman, Brown, and Botstein [PNAS, 1998] are here.

There are three steps to clustering using TSP+k. First, the expression data is converted to a Traveling Salesman Problem (TSP), based on the desired number of clusters. Second, the TSP is solved. Third, the expression data is rearranged according to the TSP solution and cluster boundaries are indicated. Our code is used for the first and last steps and any TSP solver can be used for the second step. For citations to papers comparing TSP solvers and listings of publicly available software, see the references listed in this paper.

For all of our experiments, we used Concorde, an award winning TSP solver that has successfully solved a record 15,112-city TSP instance to optimality. The Concorde code is publicly available at the Concorde home page.

Our code uses the Pearson correlation coefficient for the similarity measure and can be easily revised for an alternate measure.

This code is written in C++ and we have run it on Linux. The package is a gzipped tar file. Use "gunzip code06.tar.gz" followed by "tar -xvf code06.tar". See the README file for details on installation and usage.

The package can be downloaded here.

To learn more about rearrangement clustering and the TSP+k algorithm, take a look at this paper.

This page has been accessed times since April 2006.