
Rearrangement Clustering code

Sharlee Climer, 2004
corrections and updates, April 2011

Rearrangement clustering is described in our paper "Rearrangement
Clustering: Pitfalls and Remedies" available at: www.climer.us.

This code has been developed and used on whitebox Linux.  

Follow these commands to create the executables:

gunzip rearrangement.tar.gz
tar -xvf rearrangement.tar
cd rearrangement
make

These commands should create the following executables in the
rearrangement folder:
conv
order1
order2
orderNames

This code is composed of two different parts.  The first part, 
"conv", converts a data matrix into a Traveling Salesman Problem 
(TSP).  This TSP is in the same format as those in TSPLIB, and 
can be used by many solvers, such as Concorde.  For usage, type
"conv".  

IMPORTANT:  The conversion code in this package uses the Pearson
correlation coefficient similarity measure.  The similarity metric 
that is used has a profound effect on the the clustering results
and should be carefully chosen to suit your application.    

The input file for "conv" should only contain the data values for 
the matrix.  Each row of the matrix corresponds to an object, and 
each column to a feature.  Missing values should be replaced by 
'1000'.  It is assumed that the input data are real values (negative 
values are OK) that are less than 1000.  A sample file for 5 objects
with 5 features is included as the file "gene5.txt".  To convert 
"gene5.txt" to a TSP for 3 clusters, type 
"conv gene5.txt gene5.tsp 3 5 5".  The file "gene5.tsp" will be
produced in TSPLIB format.   

The file "convert.cpp" is the source for "conv".  It is a simple code
that can be modified or easily rewritten to suit your data.  The 
product of the code is a matrix of the distances between every pair
of objects, including the dummy objects.  "convert.cpp" uses the 
Pearson correlation coefficient to find the similarity between each 
pair of objects.  This similarity is inverted to yield the distances 
between the objects.  The Pearson correlation coefficient values are
real numbers between -1 and 1, so the inverted values also are real
numbers between -1 and 1.  Since most TSP solvers require positive 
integral values, the inverted values are increased by an adequately
large constant, multiplied by an adequately large constant, and the
fractional portion is truncated.  This process produces positive  
integral distances between each pair.  

We used the TSPLIB format for the file that is output by "conv".  We
have included a sample file "gene5.tsp" which has 5 objects and k=3.
The first six lines of the file are the heading, and the distances 
follow in an upper-diagonal format.  The DIMENSION value in the third 
line is the number of cities.  Since there are 5 objects and 3 dummies 
(k=3), the number of cities is 8.  The first row of distance values 
contains the distances from object 0 to objects 1, 2, 3, and 4, 
respectively, followed by three 0s, which are the distances to the
dummies.  The second row of distance values correspond to the
distances from object 1 to objects 2, 3, 4, and the three dummies.
There are seven rows in this distance matrix, with the last containing
a single value -- the distance from the second-to-last dummy to the
last dummy.  This file is ready to be used as input to a TSP solver
that recognizes the TSPLIB format.

If you don't already have a TSP solver, NEOS is a current option
for this purpose.  Go to www.neos-server.org/neos/solvers/index.html
and select "concorde [TSP Input]" under "Combinatorial Optimization
and Integer Programming".  Then select "TSPLIB format file" and upload
the file output by the "conv" program (above).

After converting the data and solving the TSP, the solution for the
TSP is used to rearrange the data.  This is the second part of the
program.  Concorde numbers the cities from 0 to n-1, where n is the
number of cities.  However, Helsgaun numbers from 1 to n.  IMPORTANT: 
adjust the value of startCity in convert.h for the solver you use! 
 
There are three different options for this output.  For a rearranged 
data matrix, use order1.  For the same data with the clusters 
identified, use order2.  To rearrange the names of genes, use 
orderNames.  Type "order1", "order2", or "orderNames" for usage.  
The "tour.sol" file should contain the number of cities followed by
the tour ordering, as is done by Concorde.  For "orderNames", the 
"data.tex" file should only contain the names of the objects.  These 
names should be in the same order as the original data, as is
demonstrated in the sample file "gene5names.txt".

Please contact me if you have any questions, bugs, suggestions, etc.

sharlee@climer.us

