Xref: utzoo sci.bio:2069 bionet.software:38 sci.research:982 Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!phri!roy From: roy@phri.UUCP (Roy Smith) Newsgroups: sci.bio,bionet.software,sci.research Subject: Fast restriction analysis program available Keywords: finite state machine Message-ID: <3782@phri.UUCP> Date: 27 May 89 01:35:50 GMT Organization: Public Health Research Institute, NYC, NY Lines: 54 A program to perform restriction site analysis on dna sequences, called renzyme, is now being made available. The algorithm was described in the paper cited below. In a nutshell, it's about 20 times faster than other common programs. If you ever want to find restriction sites in sequences of non-trivial length, you want this program. The source code is copyrighted, but is being made available at no charge to non-profit, academic institutions for research purposes. You can get it by anonymous ftp from goober.phri.nyu.edu (128.122.136.10) in the directory ~ftp/pub/seq. You want to get the following files: -rw-r--r-- 1 roy 1146 May 22 17:52 Acknowledgements -rw-r--r-- 1 roy 1349 May 25 13:55 Copyright -rw-r--r-- 1 roy 5594 May 26 20:58 Setup -rw-r--r-- 1 roy 204645 May 26 20:53 renzyme.tar.Z -rw-r--r-- 1 roy 29321 May 22 17:52 seqlib.tar.Z The first 3 files are pretty obvious. The two compressed tar files are the sources for the renzyme program and some related utilities, and a library of routines which you need to compile the other programs. The source trees are set up to maintain multiple binaries from a single set of sources so the Makefiles are a tad complex; the Setup file should explain everything. The code is all in C and is known to compile and run under SunOS-3.5 and MtXinu 4.3BSD/NFS. I don't expect any major problems porting to other systems. If you ftp this stuff, drop me a note so I know who has it. For details of the algorithm, read: %A R. Smith %T A finite state machine algorithm for finding restriction sites and other pattern matching applications %J CABIOS %V 4 %N 4 %D 1988 %X Existing algorithms for finding restriction endonuclease recognition sites use brute-force algorithms which run in time O(NM) where N is the number of nucleotides in the sequence under analysis and M is the total number of nucleotides in all the different sites being searched for. This paper presents a deterministic finite state machine algorithm which runs in time O(N). Memory use can be as high as O(M^4) but a slight modification to the basic algorithm can impose a theoretical upper bound of O(M) at the cost of some added complexity in the execution of the state machine. The algorithm can operate with a single pass through the sequence under analysis, with no need to ever back up or (for non-circular sequences) store more than a single input character at a time. This type of algorithm can be adapted to many pattern matching tasks and is simple enough to implement in hardware that it could, for example, be built into a disk controller as part of a specialized database machine. -- Roy Smith, System Administrator Public Health Research Institute {allegra,philabs,cmcl2,rutgers,hombre}!phri!roy -or- roy@phri.nyu.edu "The connector is the network"