Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!wuarchive!zaphod.mps.ohio-state.edu!tut.cis.ohio-state.edu!snorkelwacker!mintaka!spdcc!esegue!compilers-sender
From: markh@csd4.csd.uwm.edu (Mark William Hopkins)
Newsgroups: comp.compilers
Subject: Re: Help on disassembler/decompilers
Keywords: code, assembler, debug
Message-ID: <9009110328.AA02254@csd4.csd.uwm.edu>
Date: 11 Sep 90 03:28:07 GMT
Sender: compilers-sender@esegue.segue.boston.ma.us
Reply-To: Mark William Hopkins <markh@csd4.csd.uwm.edu>
Organization: Compilers Central
Lines: 127
Approved: compilers@esegue.segue.boston.ma.us

A while back I wrote a disassembler for the Intel 8052 family.  This is a
fairly straightforward task, so it only took a few hours.

However, there are some non-trivial aspects to even THIS process that brought
about complications which were primarily responsible for making the project
so long-winded (a few hours is too long ;) ).

(1) Recognizing data embedded within code segments.
To accomplish this, I basically 'told' the disassembler to treat everything
not accessible from the set of entry points as data.  This disassembler will
recursively search out all the code segments deriveable from the entry points
by "jumps"'s and "call"'s.   But ...

(2) Treating indirect jumps and calls.
... It could not locate the destinations of indirect jumps, since this
basically required a run-time analysis of the code.  The particular program I
was analysing, however, was of a special form where all the jumps were stored
in a jump table and could potentially be recognized as such by the
disassembler.  But I simply wasn't interested in making such an investment in
the effort at the time.  Consequentally, the code segment contained numerous
'gaps'.  One consolation, though, was that I could add these destinations by
hand to the initial set of entry points after doing my own analysis of the
code disassembled up to that point.

This particular program happened to be an interpreter, and thus was filled
with jump tables...

(3) Symbol table generation.
This is an obvious problem with any object code lacking a symbol table.
Ultimately, you're talking about an ideal application of 'natural language
generation' which is one of the most central issues of AI.  Your labels and
variable names and routine names should be reabable and should relate to
their purpose and use.

Of course, you can resort to the unimaginative alternative of having the
program give your decompiled code unimaginative names.  Or, as a compromise,
you could have it query the user for names...

------------------------------------------------------------

Recently, I've also developed a manual algorithm to allow me to disassemble
substantial portions of library routines embedded in compiled C code run on
this machine (whose processor is similar to the VAX).  The same kind of
disassembler can be written here as well.

In this case, however, (1) and (3) can often be resolved in part.  The format
of object files follows the a.out format typical of UNIX, and thus has neatly
separated data and code segments, and an embedded symbol table (for globals).

Problem (1) cannot be resolved completely though: there were often cases where
data was embedded in the code segment itself (for instance in our
implementation of the "as" assembler -- which is where I got the opcodes for
our machine from :) ).  Also, (3) cannot be resolved for 'hidden' local
variables that remain anonymous to files on the outside.

Similar issues, I assume, result in trying to disassemble object code output
to, say, the MicroSoft series of Quick compilers, or the Turbo compilers...

------------------------------------------------------------

In the process of doing the above, I also developed a partial algorithm for
'compiling' assembly into C.  When you talk about producing a 'decompiler'
you are actually confronting the task of writing a compiler: an unusual one,
though, that takes an unstructured language and creates from it a structured
language.

There is a standard UNIX utility (at least for the 4.3 bsd we're running) that
does something just like this: "struct".  This utility takes standard
Fortran-77 programs and generated from it Ratfor code.  Ratfor is a
'rationalized' Fortran that includes all the Algol-derived control structured.

Structuring code is not a really substantial issue, if you're talking about
generating loops from jumps.  Basically, you have to recognize combinations
of "segments" and "jumps" as being treatable by loops.  I've developed a set
of rules while generating structured BASIC from BASIC on a few occasions, or
while generating C from assembly.  All these rules will ultimately be
based on equivalences of the form:

	      while (Exp) Sta   <--->   x:
					   if (Exp) {
					      Sta; goto x;
                                           }

or
	       *: goto x;       <--->    *: goto y;
	       ...
	       x: goto y
or
	       <CODE SEGMENT>   <--->    <CODE SEGMENT>
	       goto x;                   x:
	       x:
and so on.

To generate control structures, you're teaching your decompiler an algebra of
jumps and segments.  I've often used the last rule in reverse to split off
segments in order to move them around in other parts of the code, for
instance.

There's also the fact that compilers will often produce code in very regular
ways, since they are often written around the syntax of the language.  So you
can spot out 'signatures' in the object code (if it was unoptimized) that
represent the output of certain kinds of structures.  While loops ALWAYS get
translated this way, functions ALWAYS get translated this way, etc.

However, there is the really substantial issue of generating *data*
structures, especially those dynamic structure involving pointers.  For
example, would your decompiler translate a jump table into a static array of
function pointers in C?  Would it create structured types if it finds enough
"instances" of it floating around in the unstructured code to justify this
abstraction?

Now you're getting into the issue of guessing what the programmer's intent was
from the code he or she wrote.  It's doable, I do it by hand all the time,
but very non-algorithmic.  This is AI pure and simple.

I can see the possibility of setting up a neural net which can be trained by
actual examples to 'recognize' data structures that may potentially lie in
the object code.  A decompiler with this kind of tool would not just be a
decompiler, it would potentially be a program that created improved versions
of the source code based on its analysis of the object code: it could, for
instance, pick up many data structures that eluded the programmer.

My claim is that you can't really have a decompiler that structures data
without having this extra feature.
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.