SGTlight

Spectral Graph Transducer

Author: Thorsten Joachims <thorsten@joachims.org>
Cornell University
Department of Computer Science

Version: 1.00
Date: 20.08.2003

Description

SGTlight is an implementation of a Spectral Graph Transducer (SGT) [Joachims, 2003] in C using Matlab libraries. The SGT is a method for transductive learning. It solves a normalized-cut (or ratio-cut) problem with additional constraints for the labeled examples using spectral methods. The approach is efficient enough to handle datasets with several ten-thousands of examples.

Source Code and Binaries

The program is free for scientific use. Please contact me, if you are planning to use the software for commercial purposes. The software must not be distributed without prior permission of the author. If you use SGTlight in your scientific work, please cite as

I would also appreciate, if you sent me (a link to) your papers so that I can learn about your research. The implementation was developed on Windows 2000 with mcc and Matlab R12, but compiles also on Linux and Solaris. The source code is available at the following location:

http://download.joachims.org/sgt_light/current/sgt_light.tar.gz

If you just want the binaries, you can download them for the following systems:

Please send me email <tj@cs.cornell.edu> and let me know that you got SGTlight. I will put you on my mailing list to inform you about new versions and bug-fixes.

Installation (Source Code)

To compile and install the source code of SGTlight you need to download sgt_light.tar.gz. Unpack it with

gunzip -c sgt_light.tar.gz | tar xvf -

You compile the source code with

make

which compiles all executables (type make info for additional options). Note that you have to have Matlab installed with the mcc compiler. You will have five executables:

sgt_buildknngraph  (creates a k-NN graph from feature vectors)
sgt_decompgraph    (compute first d eigenvectors of graph Laplacian)
sgt_predict        (predict labels of test examples)
sgt_traintestsplit (make random test/training split)
sgt_addgraphs      (adds adjacency matrices of two graphs)

To run the executables, you need the Matlab runtime library, which is available as the compressed archive mglinstaller. The runtime library is delivered with Matlab (see here). Copy mglinstaller into your sgt-light directory and call

mglinstaller

which creates two subdirectories bin and toolbox. See here for more details on how to obtain and install this library. Finally, you must extend your search path by executing

Solaris: setenv LD_LIBRARY_PATH ./bin/sol2:$PATH
Windows: set PATH=.\bin\win32;%PATH%
Linux: setenv LD_LIBRARY_PATH ./bin/glnx86:$PATH

to make sure the library is accessible. For Matlab R13 on Linux and Solaris, there seems to be a bug in the math library and sgt_decompgraph might not find some mex files. To fix the problem, execute

cp toolbox/matlab/datatypes/cellfun.* .
cp toolbox/matlab/sparfun/arpackc.* .

Installation (Binaries)

To install one of the binary archives of SGTlight (e.g. sgt_light_solaris.tar.gz) you need to download it and unpack it with

gunzip -c sgt_light_solaris.tar.gz | tar xvf -
gunzip -c sgt_light_windows.tar.gz | tar xvf -
gunzip -c sgt_light_linux.tar.gz | tar xvf -

You will have five executables:

sgt_buildknngraph  (creates a k-NN graph from feature vectors)
sgt_decompgraph    (compute first d eigenvectors of graph Laplacian)
sgt_predict        (predict labels of test examples)
sgt_traintestsplit (make random test/training split)
sgt_addgraphs      (adds adjacency matrices of two graphs)

To run the executables, you need the Matlab runtime libraries, which are included in the binary archive (also, see here for details on the Matlab libraries). Finally, you must extend your search path by executing

Solaris: setenv LD_LIBRARY_PATH ./bin/sol2:$PATH
Windows: set PATH=.\bin\win32;%PATH%
Linux: setenv LD_LIBRARY_PATH ./bin/glnx86:$PATH

to make sure the library is accessible.

How to use

This section explains how to use the SGTlight software. The background is given in [Joachims, 2003].

SGTlight consists of three main modules and two additional programs for convenience. The main modules are, first, a program (sgt_buildknngraph) for constructing the k nearest neighbor graph from data in the form of feature vectors given in SVMlight format (see here). Second, a program (sgt_decompgraph) for computing the d smallest eigenvectors of the Laplacian of a graph, and third a module (sgt_predict) for computing the acutal predictions for a given training and test set. Furthermore, there is a program (sgt_traintestsplit) for computing a random test/training split and a program (sgt_addgraphs) for adding two adjacency matrices (used for co-training). 

You can get a description of the parameters of each module by calling it with the option -?. The following examples illustrate how to use the different programs.

Example: Text Classification

You will find an example text classification problem at

http://download.joachims.org/sgt_light/examples/example1.tar.gz

Download this file into your sgt_light directory and unpack it with

gunzip -c example1.tar.gz | tar xvf -

This will create a subdirectory example1. Documents are represented as feature vectors. Each feature corresponds to a word stem (9947 features). The task is to learn which Reuters articles are about "corporate acquisitions". There are 300 positive and 300 negative examples in the file examples.dat. The feature numbers correspond to the line numbers in the file words. To run the example, execute the commands:

sgt_buildknngraph -k 50 example1/examples.dat example1/graph.gra
sgt_decompgraph -d 80 example1/graph.gra example1/graph.dcg

The first command takes the input points in example1/examples.dat and computes the weighted 50 nearest neighbor graph (using cosine similarity) which is stored in example1/graph.gra. See (svm_buildknngraph -?) for additional options. The second command computes the 80 first eigenvectors of the (normalized) Laplacian of example1/graph.gra and stores them in example1/graph.dcg. Both sgt_buildknngraph and sgt_decompgraph need to be called only once for each dataset, since their result is independent of the particular test/training split and the labels in the training set.

To generate a random test/training split with 10 examples in the training set, call the command

sgt_traintestsplit -n 10 example1/examples.dat example1/train.dat example1/test.dat

The files example1/train.dat and example1/test.dat contain the train and test labels in the same order as in the file example1/examples.dat. +1 stands for a positive label, -1 stands for a negative label, and 0 for unlabeled. To compute the predictions for this training/test split, call

sgt_predict -c 1000 -l example1/test.dat example1/train.dat example1/graph.dcg example1/pred

Since the true test labels are given (-l example1/test.dat), the prediction accuracy on the test set is computed and printed to stdout. The trade-off between training error and complexity is set via -c 1000 similar to an SVM. The individual predictions on the training and test examples are output to example1/pred, in the same order as in the file example1/train.dat. A positive value indicates prediction as a positive example, a negative value indicates a negative prediction.

Example: Co-Training

Sorry, not yet ready.

Questions and Bug Reports

If you find bugs or you have problems with the code you cannot solve by yourself, please contact me via email <tj@cs.cornell.edu>.

Disclaimer

This software is free only for non-commercial use. It must not be distributed without prior permission of the author. The author is not responsible for implications from the use of this software.

History

V1.00 

References

[Joachims, 2003a]

Thorsten Joachims, Transductive Learning via Spectral Graph Partitioning, Proceedings of the International Conference on Machine Learning (ICML), 2003.
[PDF] [Postscript (gz)]

Last modified August 22nd, 2003 by Thorsten Joachims <thorsten@joachims.org>