Path: utzoo!attcan!uunet!lll-winken!lll-tis!helios.ee.lbl.gov!pasteur!ames!pioneer!eugene From: eugene@pioneer.arpa (Eugene N. Miya) Newsgroups: comp.graphics Subject: Re: Optical Character Recognition software? Message-ID: <10160@ames.arc.nasa.gov> Date: 11 Jun 88 02:44:31 GMT References: <367@msn006.misemi> <11390012@hpldola.HP.COM> Sender: usenet@ames.arc.nasa.gov Reply-To: eugene@pioneer.UUCP (Eugene N. Miya) Organization: NASA Ames Research Center, Moffett Field, Calif. Lines: 78 This is a brief summary of my experiences in the OCR field up to today when I had yet another Demo of OCR systems. I've been asked to look at one more and will report on it some time. But first: My background: I've worked off and on for about 10 years in digital image processing for remote sensing and computer graphics. No, no, significant paper, but I understand the problems of IP, and I know how hard OCR is. So far I have tried the following systems: DEST /DataCopy(first recommended by one of our Cray people, in fact, one of those friend of a friend things) DataCopy 1215 Terra Bella Mountain View, CA 94043 Kurzweil (A Xerox Corporation, best know for all kinds of interesting equipment) 185 Albany St. Cambridge, MA 02139 and today: Transimage 910 Benicia Ave Sunnyvale, CA 94086 (408)-733-411 You normally can't just contact these companies, they refer you to places you can demo units. (Like Western Office Supply). They range in price from $3,000 for H/W and S/W to $20,000. The first two are sheet readers and the Trans image is hand held. The first two are 300 dot per inch resolution. The DEST/Datacopy uses fixed font information which keeps it cheap. The recognition miss rate on say variable space laser printer stuff is not helpful. It's designed for reading fixed size typewriter faces. The Kurzweil is a system wich requires training. It takes several passes. It can handle variable spaced text. Both I felt were not worth the cost, I think the Kuzweil was something like $7,000 and getting cheaper. Neither is super fast, but it was the error rate which turned me off. You have to deal with false positives as well as true negatives (i.e., is that a blot or a period?) I already mentioned the ells and 1s, zeroes and ohs. Context is a very difficult thing to deal with. Oh, I should say something about CPUs, all run using IBM PC clones ATs or XTs, no PS2/OS2 support, and the DEST is the only Mac connectable. Well as of today I was most impressed with the TransImage, but I suspect this is because it has 1,000 dots per inch res. Hand held is initially novel, but sucks in the long run. (PC based, BTW). It did my test sheet fairly well (including special characters). Now the reason why I posted this detailed explanation. SO I can read text in with some rate of errors, yes it something of a pain but it's only $3K. The question is how to quantity the recognition rate. If I have to change 1 character (or edit) for every 100, is this too high (able one per line? Well the DEST/Kur* were 4-7 errors per line on my sample text, much too high. Why, my typo rate is one per line and this is about what the TI1000 is. Now a page is about 60 lines. Should a the document just be handed to the secretary, here type this? or should we go OCR? I have to figure out error rate per 1000 or so characters, and in the case of the TI1000 factor time to scan as well. Comments on error rate measurements? Note: only a bureaucratic Agency or company really needs one of these things. I think WE can blow away $3K of taxplayers money to try it. Like 90 page documents, etc. Next Pallentair(sp). $30K Tak-eye! [sorry can't spell Japanese phoentically well]. Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "Mailers?! HA!", "If my mail does not reach you, please accept my apology." {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene "Send mail, avoid follow-ups. If enough, I'll summarize."