Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!yale!mintaka!spdcc!esegue!compilers-sender From: aglew@dwarfs.crhc.uiuc.edu (Andy Glew) Newsgroups: comp.compilers Subject: Re: Disassembly Keywords: disassemble Message-ID: <9009190548.AA08710@dwarfs.crhc.uiuc.edu.> Date: 19 Sep 90 05:48:03 GMT Sender: compilers-sender@esegue.segue.boston.ma.us Reply-To: aglew@dwarfs.crhc.uiuc.edu (Andy Glew) Organization: Compilers Central Lines: 66 Approved: compilers@esegue.segue.boston.ma.us Many people have mentioned following branches, etc., to guide disassembly. Static is obvious. You can also do it dynamically, using the techniques used in generating profiling feedback for a compiler. More branches can be followed - eg. out of a jump table. Moreover, several simplifying assumptions may be useful: (1) code is never executed "out of phase" - ie. if a code sequence begins with the 4 byte instruction at address A, there is no code sequence beginning at address A+1. (2) Code and data may be emulsified, but they aren't miscible - ie. addresses that are executed are not data; similarly, addresses that are fetched as data are not code (may be some boundary effects here). Hackers may break these assumptions, but if all you are trying to do is run binaries from machine A on machine B, they may be enough for you. Conceptually, given a mixed code/data address space A, you can create multiple code data spaces c1(A), c2(A), etc., for every possible placement of address boundaries. Or, rather, you can start of your disassembly in the following manner: code_0x0000f561: /* 4 byte instruction */ ADD ... goto code_0x0000f565; code_0x0000f562: /* 2 byte instruction */ MOV ... goto code_0x0000f564; code_0x0000f563: /* 2 byte instruction */ ... goto code_0x0000f561; code_0x0000f564: /* 1 byte instruction */ ... goto code_0x0000f565; code_0x0000f565: /* lots of possible paths converge here */ goto code_0x0000f565; data_0x0000f561: .... data_0x0000f562: .... data_0x0000f563: .... data_0x0000f564: .... data_0x0000f565: .... Static branch following eliminates some code and data entries; dynamic profiling eliminates a few more. Not all of the ambiguities may be resolved, but the amount of replication will quickly fall to tolerable levels. The same approach might be used for data representation - eg. have separate spaces for data addressed as bytes, words, etc - except that data is much more frequently accessed by different packet sizes. It's almost easier to have your disassembly/reassembly support library convert "load word at data_0x0000f562" into the required sequence of loads and shifts (to handle byte ordering) than it is to attempt to order the data "naturally" for the target machine. This produces a run-time penalty for the translated binary, but hopefully the new machine is that much faster anyway, and all you are trying to do is gain access to the wonderful world of IBM PC/VAX/IBM 360 software? (Or, for startup UNIX companies, maybe you're just trying to get MIPS/SUN/Ultrix binaries running on your new hardware). In any case, the hardest thing about binary translation is handling the stuff that isn't being disassembled - namely OS calls. -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.