Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!wuarchive!cs.utexas.edu!tut.cis.ohio-state.edu!snorkelwacker!mintaka!spdcc!esegue!compilers-sender
From: meissner@osf.org
Newsgroups: comp.compilers
Subject: Disassembly
Keywords: assembler, debug
Message-ID: <9009121606.AA26236@curley.osf.org>
Date: 12 Sep 90 16:06:53 GMT
Sender: compilers-sender@esegue.segue.boston.ma.us
Reply-To: meissner@osf.org
Organization: Compilers Central
Lines: 49
Approved: compilers@esegue.segue.boston.ma.us
In-Reply-To: phorgan@cup.portal.com's message of 9 Sep 90 17:32:55 GMT

| The problem with disassembling arbitrary object code is that data bears a
| disturbing resemblance to code at times:) Even when running through code
| disassembling starting at known code, it's not always possible to
| determine when code stops and data begins.  Then it's not possible to tell
| when object code starts up again.

I discovered that the MIPS assembler has a 'solution' to this problem.
It doesn't prohibit you from putting constants in the text section,
but if you do, the line numbers for debugging are messed up.  I found
this when I had GCC putting the switch label array in .text.  The MIPS
people I talked to about this said it was a feature, and not a bug....

Thus on a MIPS system, you don't have to worry about data being in the
text section.... (and of course each instruction is exactly 32-bits,
so you don't have to worry about starting in the middle of an
instruction, like you do on CISC machines.

| 				     This is easy to see using most
| dissassemblers; when you hit the data, the unknown op-code indicator
| appears (typically ???), then random sequences of ??? and op-codes, then
| when the code starts, often the disassembler has just guessed wrong and
| includes the first byte or two of the 'real' op-code in a previous 'false'
| one.  It might take a while to 're-synchronize' and start showing 'real'
| op-codes.  The only time this isn't a problem would be with fixed single
| length op-codes with an alignment requirement.  It is possible to reduce
| the problem with an algorithm that looks ahead starting byte-by-byte and
| sees which one generates a most successful string of instructions.  From a
| 'good starting byte', you could disassemble in reverse to find a previous
| starting location.

You could always do a complete scan of the text, using a bitmap or
some such to identify every place that has an instruction.  It would
have to be a backtracking scan, so that you can mark both the fall
through case, and conditional branch target cases.  IMHO though, it
would be too slow and consume too much memory to be useful.

|		      Even this fails in many cases of self modifying code
| or in cases where strange things are done like overlapped code.  ...

Fortunately for this case, self modifying code seems to be mostly on
the decline, and only used where needed (or because you have a macho
hacker type that thinks self modifying code is neat).

--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.