Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!wuarchive!zaphod.mps.ohio-state.edu!tut.cis.ohio-state.edu!snorkelwacker!mintaka!spdcc!esegue!compilers-sender From: markh@csd4.csd.uwm.edu (Mark William Hopkins) Newsgroups: comp.compilers Subject: Re: Help on disassembler/decompilers Keywords: code, assembler, debug Message-ID: <9009110328.AA02254@csd4.csd.uwm.edu> Date: 11 Sep 90 03:28:07 GMT Sender: compilers-sender@esegue.segue.boston.ma.us Reply-To: Mark William Hopkins Organization: Compilers Central Lines: 127 Approved: compilers@esegue.segue.boston.ma.us A while back I wrote a disassembler for the Intel 8052 family. This is a fairly straightforward task, so it only took a few hours. However, there are some non-trivial aspects to even THIS process that brought about complications which were primarily responsible for making the project so long-winded (a few hours is too long ;) ). (1) Recognizing data embedded within code segments. To accomplish this, I basically 'told' the disassembler to treat everything not accessible from the set of entry points as data. This disassembler will recursively search out all the code segments deriveable from the entry points by "jumps"'s and "call"'s. But ... (2) Treating indirect jumps and calls. ... It could not locate the destinations of indirect jumps, since this basically required a run-time analysis of the code. The particular program I was analysing, however, was of a special form where all the jumps were stored in a jump table and could potentially be recognized as such by the disassembler. But I simply wasn't interested in making such an investment in the effort at the time. Consequentally, the code segment contained numerous 'gaps'. One consolation, though, was that I could add these destinations by hand to the initial set of entry points after doing my own analysis of the code disassembled up to that point. This particular program happened to be an interpreter, and thus was filled with jump tables... (3) Symbol table generation. This is an obvious problem with any object code lacking a symbol table. Ultimately, you're talking about an ideal application of 'natural language generation' which is one of the most central issues of AI. Your labels and variable names and routine names should be reabable and should relate to their purpose and use. Of course, you can resort to the unimaginative alternative of having the program give your decompiled code unimaginative names. Or, as a compromise, you could have it query the user for names... ------------------------------------------------------------ Recently, I've also developed a manual algorithm to allow me to disassemble substantial portions of library routines embedded in compiled C code run on this machine (whose processor is similar to the VAX). The same kind of disassembler can be written here as well. In this case, however, (1) and (3) can often be resolved in part. The format of object files follows the a.out format typical of UNIX, and thus has neatly separated data and code segments, and an embedded symbol table (for globals). Problem (1) cannot be resolved completely though: there were often cases where data was embedded in the code segment itself (for instance in our implementation of the "as" assembler -- which is where I got the opcodes for our machine from :) ). Also, (3) cannot be resolved for 'hidden' local variables that remain anonymous to files on the outside. Similar issues, I assume, result in trying to disassemble object code output to, say, the MicroSoft series of Quick compilers, or the Turbo compilers... ------------------------------------------------------------ In the process of doing the above, I also developed a partial algorithm for 'compiling' assembly into C. When you talk about producing a 'decompiler' you are actually confronting the task of writing a compiler: an unusual one, though, that takes an unstructured language and creates from it a structured language. There is a standard UNIX utility (at least for the 4.3 bsd we're running) that does something just like this: "struct". This utility takes standard Fortran-77 programs and generated from it Ratfor code. Ratfor is a 'rationalized' Fortran that includes all the Algol-derived control structured. Structuring code is not a really substantial issue, if you're talking about generating loops from jumps. Basically, you have to recognize combinations of "segments" and "jumps" as being treatable by loops. I've developed a set of rules while generating structured BASIC from BASIC on a few occasions, or while generating C from assembly. All these rules will ultimately be based on equivalences of the form: while (Exp) Sta <---> x: if (Exp) { Sta; goto x; } or *: goto x; <---> *: goto y; ... x: goto y or <---> goto x; x: x: and so on. To generate control structures, you're teaching your decompiler an algebra of jumps and segments. I've often used the last rule in reverse to split off segments in order to move them around in other parts of the code, for instance. There's also the fact that compilers will often produce code in very regular ways, since they are often written around the syntax of the language. So you can spot out 'signatures' in the object code (if it was unoptimized) that represent the output of certain kinds of structures. While loops ALWAYS get translated this way, functions ALWAYS get translated this way, etc. However, there is the really substantial issue of generating *data* structures, especially those dynamic structure involving pointers. For example, would your decompiler translate a jump table into a static array of function pointers in C? Would it create structured types if it finds enough "instances" of it floating around in the unstructured code to justify this abstraction? Now you're getting into the issue of guessing what the programmer's intent was from the code he or she wrote. It's doable, I do it by hand all the time, but very non-algorithmic. This is AI pure and simple. I can see the possibility of setting up a neural net which can be trained by actual examples to 'recognize' data structures that may potentially lie in the object code. A decompiler with this kind of tool would not just be a decompiler, it would potentially be a program that created improved versions of the source code based on its analysis of the object code: it could, for instance, pick up many data structures that eluded the programmer. My claim is that you can't really have a decompiler that structures data without having this extra feature. -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.