Xref: utzoo comp.sys.ibm.pc:38725 comp.lang.pascal:2732 Path: utzoo!attcan!uunet!zephyr.ens.tek.com!tektronix!psueea!parsely!bucket!leonard From: leonard@bucket.UUCP (Leonard Erickson) Newsgroups: comp.sys.ibm.pc,comp.lang.pascal Subject: Re: ansi data compression in turbo pascal Keywords: tp 5.5 Message-ID: <1771@bucket.UUCP> Date: 24 Nov 89 06:00:00 GMT References: <2172@leah.Albany.Edu> Organization: Rick's Home-Grown UNIX; Portland, OR. Lines: 36 ppd491@leah.Albany.Edu (Peter P. Donohue) writes: > I am writing a program in turbo pascal 5.5 on a PC running MS-Dos >that does a reasonable amount of reading and writing of ascii data. I >was wondering if anyone could recommend a good alogarithm for >compressing ascii data so that the data files take up less disk space? >It doesn't necessarily have to be the best or most efficient; I'm >willing to give up some for programming ease. Thanks. This probably won't help much in your situation, but if you are doing a lot of work reading *sorted* strings, you can get quite an improvement by using the following trick: Save each string on a sperate "line" (ie use cr or cr/lf as a record seperator). Now "encode" each string as where count is the number of characters it has in common with the previous record, and substring is the portion of the string that differs. Example: if the previos record was "previous" and the string for the current record is "previously", store the current record as <8>. To avoid problems you'll need to do something to keep the count from "colliding" with the seperator, such as adding 32 or 128 to it. This would the record into: @ly (assuming +32 coding) The above trick is very useful for things like spell checkers where the data is sorted. And the compression ratio *improves* as the list gets longer! -- Leonard Erickson ...!tektronix!reed!percival!bucket!leonard CIS: [70465,203] "I'm all in favor of keeping dangerous weapons out of the hands of fools. Let's start with typewriters." -- Solomon Short