SVMlight: Support Vector Machine
Author: Thorsten Joachims
<thorsten@ls8.informatik.uni-dortmund.de>
Version: 0.91
Date: 08.12.97
Overview
SVMlight is an implementation of Support Vector Machines
(SVMs) in C. The main features of the program are:
-
fast optimization algorithm
-
handles many thousands of support vectors
-
handles several ten-thousands of training examples
-
supports polynomial, radial basis function, and sigmoid kernel functions
-
uses sparse vector representation
-
its lighter than ever
Description
SVMlight is a fully functional implementation of Vapnik's
Support Vector Machine [Vapnik/95]. The optimization algorithm used is
a refined version of the decomposition algorithm proposed in [Osuna, et
al., 1997]. It will be described in detail in a forthcoming paper. The
algorithm has modest memory requirements and can handle problems with many
thousands of support vectors efficiently. So far the code was mainly used
for learning text classifiers [Joachims, 1997]. Text classification
tasks have the property of sparse instance vectors. This implementation
makes use of this property which leads to a very compact and efficient
representation.
Getting SVMlight
The source code and binaries are available here. Both are limited to scientific
use! If you get the binaries, please be aware that they include code from
the DONLP2
optimization package written by P.
Spellucci. Make sure you read his copyright information. For now binaries
are available for the following platforms:
-
SunOS
4.1.3 (unpack with: gunzip -c svm_light_sunos413.tar.gz | tar xvf
-)
-
Solaris
2 (unpack with: gunzip -c svm_light_solaris.tar.gz | tar xvf -)
The SunOS 4.1.3 version is less memory efficient, since it uses the f2c-converted
version of DONLP2.
Please send me email and let me know that you got svm-light. I will put you on my mailing list to inform you about new versions and bug-fixes.
Source Code
The source code is available for scientific use only. It must not be modified
and distributed without prior permission of the author. The implementation
was developed on Solaris 2, but compiles also on SunOS 3.1.4. Although
I have not tried it yet, I do not see why it should not run on other platforms,
too. The source code is available at the following location:
SVMlight uses the DONLP2
optimization package written by P.
Spellucci for solving intermediate quadratic programming problems.
It can be downloaded from
Please read the copyright information. DONLP2 is written in Fortran, but
compiles and links just fine into C code using gcc. If this does not work
for you, you might want to download the f2c converted version. It is less
flexible and uses more memory, though:
Please send me email and let me know that you got svm-light. I will put you on my mailing list to inform you about new versions and bug-fixes.
Installation
To install SVMlight you need to download svm_light.tar.gz
and donlp2.tar.gz (or donlp2_c.tar.gz). Create a new
directory:
and move both files in there. Unzip and untar the files with the following
commands:
gunzip -c svm_light.tar.gz | tar xvf -
(mkdir donlp2; cd donlp2; gunzip -c ../donlp2.tar.gz | tar xvf -)
or
gunzip -c donlp2_c.tar.gz | tar xvf -
Now run the sh-batch file
which compiles the system and created the two executables
svm_learn (learning module)
svm_classify (classification module)
If the system does not compile because you do not have the f2c library,
check here.
How to use?
svm_learn is called with the following parameters:
svm_learn [options] example_file model_file
Available options are:
-h -> Help.
-v [0..3] -> Verbosity level (default 2) .
-i [0,1] -> Remove training examples from training
set for which the
upper bound C on the Lagrange multiplier is active (default 1).
After removal the SVM is trained from scratch on the remaining
examples.
-q [50..200]-> Maximum size of QP-subproblems (default 100). Different
values
may improve training speed, but do not influence the resulting
classifier.
-c float -> C: Trade-off between training error
and margin (default 1000).
-e float -> eps: Allow that error when fitting
constraints
|y [w*x+b] - 1| <= eps at KT-point (default 0.001).
Larger values speed up training, but may result in a worse
classifier.
-t int -> Type of kernel function:
0: linear K(a,b)= a*b
1: polynomial K(a,b)= (a*b+1)^d
2: radial basis function K(a,b)= exp(-gamma ||a-b||^2)
3: sigmoid K(a,b)= tanh(s a*b + c)
-d int -> Parameter d in polynomial
kernel .
-g float -> Parameter gamma in rbf kernel
.
-s float -> Parameter s in sigmoid kernel
.
-r float -> Parameter c in sigmoid kernel.
The input file example_file contains the training examples. The
first line contains contains comments and is ignored. Each of the following
lines represents one training example an is of the following format:
<class> .=. +1 | -1
<feature> .=. integer
<value> .=. real
<line> .=. <class> <feature>:<value> <feature>:<value>
... <feature>:<value>
The class label and each of the feature/value pairs are separated by a
space character. Feature/value pairs MUST be ordered by increasing feature
number. Features with value zero can be skipped.
The result of svm_learn is the model which is learned from
the training data in example_file. The model is written to model_file.
To classify test examples, svm_classify reads this file. svm_classify
is called with the following parameters:
svm_classify [options] example_file model_file output_file
Available options are:
-h -> Help.
-v [0..3] -> Verbosity level (default 2).
All test examples in example_file are classified and the predicted
classes are written to output_file. The example file has
the same format as the one for svm_learn. Additionally <class>
can have the value zero indicating unknown.
Getting started: an Example Problem
You will find an example text classification problem at
Download this file into your svm_light directory and unpack it with
gunzip -c example1.tar.gz | tar xvf -
This will create a subdirectory example1. Documents are represented
as feature vectors. Each feature corresponds to a word stem (9947 features).
The task is to learn which Reuters articles are about "corporate acquisitions".
There are 1000 positive and 1000 negative examples in the file train.dat.
The file test.dat contains 600 test examples. The feature numbers
correspond to the line numbers in the file words. To run the example,
execute the commands:
svm_learn example1/train.dat example1/model
svm_classify example1/test.dat example1/model example1/predictions
The accuracy on the test set is printed to stdout.
Questions and Bug Reports
If you find bugs or you have problems with the code you cannot solve by
yourself, please contact me via email <thorsten@ls8.informatik.uni-dortmund.de>.
History
V0.9 -> V0.91
- Fixed small bug which appears for very small C. Optimization did not converge.
References
[Joachims, 1997] T. Joachims,
Text Categorization with Support Vector
Machines, to be published,
http://www.cs.cmu.edu/~thorsten/tcatsvm.ps,
1997.
[Osuna, et al., 1997] E. Osuna, R. Freund, and F. Girosi,
An Improved Training
Algorithm for Support Vector Machines, IEEE NNSP, 1997.
[Vapnik, 1995]
V. Vapnik, The Nature of Statistical Learning Theory,
Springer, New York, 1995.
Last modified December 8th, 1997 by Thorsten
Joachims <thorsten@ls8.informatik.uni-dortmund.de>