Xref: utzoo sci.bio:2069 bionet.software:38 sci.research:982
Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!phri!roy
From: roy@phri.UUCP (Roy Smith)
Newsgroups: sci.bio,bionet.software,sci.research
Subject: Fast restriction analysis program available
Keywords: finite state machine
Message-ID: <3782@phri.UUCP>
Date: 27 May 89 01:35:50 GMT
Organization: Public Health Research Institute, NYC, NY
Lines: 54


	A program to perform restriction site analysis on dna sequences,
called renzyme, is now being made available.  The algorithm was described in
the paper cited below.  In a nutshell, it's about 20 times faster than other
common programs.  If you ever want to find restriction sites in sequences of
non-trivial length, you want this program.

	The source code is copyrighted, but is being made available at no
charge to non-profit, academic institutions for research purposes.  You can
get it by anonymous ftp from goober.phri.nyu.edu (128.122.136.10) in the
directory ~ftp/pub/seq.  You want to get the following files:

-rw-r--r--  1 roy          1146 May 22 17:52 Acknowledgements
-rw-r--r--  1 roy          1349 May 25 13:55 Copyright
-rw-r--r--  1 roy          5594 May 26 20:58 Setup
-rw-r--r--  1 roy        204645 May 26 20:53 renzyme.tar.Z
-rw-r--r--  1 roy         29321 May 22 17:52 seqlib.tar.Z

	The first 3 files are pretty obvious.  The two compressed tar files
are the sources for the renzyme program and some related utilities, and a
library of routines which you need to compile the other programs.  The source
trees are set up to maintain multiple binaries from a single set of sources
so the Makefiles are a tad complex; the Setup file should explain everything.
The code is all in C and is known to compile and run under SunOS-3.5 and
MtXinu 4.3BSD/NFS.  I don't expect any major problems porting to other
systems.  If you ftp this stuff, drop me a note so I know who has it.  For
details of the algorithm, read:

%A R. Smith
%T A finite state machine algorithm for finding restriction sites
and other pattern matching applications
%J CABIOS
%V 4
%N 4
%D 1988
%X Existing algorithms for finding restriction endonuclease recognition sites
use brute-force algorithms which run in time O(NM) where N is the number of
nucleotides in the sequence under analysis and M is the total number of
nucleotides in all the different sites being searched for.  This paper
presents a deterministic finite state machine algorithm which runs in time
O(N).  Memory use can be as high as O(M^4) but a slight modification to the
basic algorithm can impose a theoretical upper bound of O(M) at the cost of
some added complexity in the execution of the state machine.  The algorithm
can operate with a single pass through the sequence under analysis, with no
need to ever back up or (for non-circular sequences) store more than a single
input character at a time.  This type of algorithm can be adapted to many
pattern matching tasks and is simple enough to implement in hardware that it
could, for example, be built into a disk controller as part of a specialized
database machine.
-- 
Roy Smith, System Administrator
Public Health Research Institute
{allegra,philabs,cmcl2,rutgers,hombre}!phri!roy -or- roy@phri.nyu.edu
"The connector is the network"