Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!yale!mintaka!spdcc!esegue!compilers-sender
From: aglew@dwarfs.crhc.uiuc.edu (Andy Glew)
Newsgroups: comp.compilers
Subject: Re: Disassembly
Keywords: disassemble
Message-ID: <9009190548.AA08710@dwarfs.crhc.uiuc.edu.>
Date: 19 Sep 90 05:48:03 GMT
Sender: compilers-sender@esegue.segue.boston.ma.us
Reply-To: aglew@dwarfs.crhc.uiuc.edu (Andy Glew)
Organization: Compilers Central
Lines: 66
Approved: compilers@esegue.segue.boston.ma.us

Many people have mentioned following branches, etc., to guide
disassembly.
    Static is obvious.  You can also do it dynamically, using the
techniques used in generating profiling feedback for a compiler.  More
branches can be followed - eg. out of a jump table.  Moreover, several
simplifying assumptions may be useful: 
    (1) code is never executed "out of phase" - ie. if a code sequence
begins with the 4 byte instruction at address A, there is no code
sequence beginning at address A+1.
    (2) Code and data may be emulsified, but they aren't miscible -
ie.  addresses that are executed are not data; similarly, addresses
that are fetched as data are not code (may be some boundary effects
here).
    Hackers may break these assumptions, but if all you are trying to
do is run binaries from machine A on machine B, they may be enough for
you.

Conceptually, given a mixed code/data address space A, you can create
multiple code data spaces c1(A), c2(A), etc., for every possible
placement of address boundaries.  Or, rather, you can start of your
disassembly in the following manner:

    code_0x0000f561:	/* 4 byte instruction */  ADD ...
    	    	    	goto code_0x0000f565;

    code_0x0000f562:	/* 2 byte instruction */  MOV ...
    	    	    	goto code_0x0000f564;

    code_0x0000f563:    /* 2 byte instruction */ ...
    	    	    	goto code_0x0000f561;

    code_0x0000f564:    /* 1 byte instruction */ ...
    	    	    	goto code_0x0000f565;

    code_0x0000f565:    /* lots of possible paths converge here */
    	    	    	goto code_0x0000f565;

    data_0x0000f561:    ....
    data_0x0000f562:    ....
    data_0x0000f563:    ....
    data_0x0000f564:    ....
    data_0x0000f565:    ....

Static branch following eliminates some code and data entries; dynamic
profiling eliminates a few more.  Not all of the ambiguities may be
resolved, but the amount of replication will quickly fall to tolerable
levels.

The same approach might be used for data representation - eg. have
separate spaces for data addressed as bytes, words, etc - except that
data is much more frequently accessed by different packet sizes.  It's
almost easier to have your disassembly/reassembly support library
convert "load word at data_0x0000f562" into the required sequence of
loads and shifts (to handle byte ordering) than it is to attempt to
order the data "naturally" for the target machine. This produces a
run-time penalty for the translated binary, but hopefully the new
machine is that much faster anyway, and all you are trying to do is
gain access to the wonderful world of IBM PC/VAX/IBM 360 software?
(Or, for startup UNIX companies, maybe you're just trying to get
MIPS/SUN/Ultrix binaries running on your new hardware).

In any case, the hardest thing about binary translation is handling
the stuff that isn't being disassembled - namely OS calls.
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.