Path: utzoo!mnetor!uunet!husc6!cca!g-rh From: g-rh@cca.CCA.COM (Richard Harter) Newsgroups: comp.lang.c Subject: Re: Tools for Reading Readable Code Message-ID: <22849@cca.CCA.COM> Date: 1 Jan 88 20:54:56 GMT References: <11067@brl-adm.ARPA> Reply-To: g-rh@CCA.CCA.COM.UUCP (Richard Harter) Organization: Computer Corp. of America, Cambridge, MA Lines: 124 [... long article about closure of calls and levels of calls ...] ... [... requesting information about possible implementations ...] My company (SMDS Inc.) sells software that includes this sort of functionality. This won't help if you want a public domain set of tools; however you can build this sort of functionality with PD tools without too much trouble (:-)). What you need are two sets of tools -- one to analyze source programs, and the second to manipulate sets. I will describe what each set of tools has to do and then mention fundamental problems with doing this. Source Analysis Tools Needed: (1) A source language scanner that determines, for each file, the functions contained in the file, and the functions called by each function. (2) A source language scanner that determines, for each file, the include files included in the file. One can put together (2) using standard unix tools; we have a C program to do this. It is a simple program to write. The source language scanner is a bit trickier: We have programs to do this. One can also kludge it together, using standard tools. However we don't do it "right" and it is a rather difficult proposition to "do it right". Some of the problems are: (2.a) The scanner should be able to distinguish between macro calls and function calls. (2.b) The structure of the software in a file can be changed by the arguments to the preprocesser, i.e. you can pass in "defines" through the cc command. (2.c) Pointers to functions can be passed as data, either through globals or through calling sequences. For example, suppose that an array of function pointers is global. Then any routine which can see this array can alter a pointer in the array. If routine X executes function n in the array we cannot tell, via static analysis, what routine it will be executing nor can we tell, by examining the file containing routine X, what functions it might possibly be executing. There are a lot of wrinkles on this problem and I won't go into it any further. (2.d) One must also take into accunt files included, since include files can introduce new function references that are not apparent in the original source. Again, the actual references induced depend upon the "ifdefs" used. (2.e) There are subtle issues involved with name scoping and duplication. Function name foo may occur in several different files and be different functions in different places. Given the descriptive relationship data one also needs tools to manipulate sets. This breaks down into two parts. The first is the mechanics of reading the lists of relationships from the scanners and handling the representation of them. This is a nontrivial problem for large programs. The set tools needed are (a) Given a relationship and a set of functions (files) you need to be able to get the set bearing that relationship to your original set. In our jargon you need to be able to say things like P2 = calls P1 where P1 is a list of functions and P2 is the list of all functions calling any function in list P1. You also need to be able to do P2 = P1 calls where P2 is the list of all functions called by any function in list P1. You also need to be able to find unions and intersections of sets. A function call closure algorithm runs as follows: P1 = main P2 = main calls while (P2 is not empty) begin P2 = P2 - P1 P1 = P1 + P2 P2 = P2 calls end In this loop P1 is the closure list (the list of all routines directly or indirectly called by main) and P2 is the list of all new routines introduced in the next level down. Note that this closure algorithm only produces the list of routines directly or indirectly called by main. It does not necessarily produce the load list (the mimimum list of files needed for a complete link and load.) I.e. we can't blithely say F1 = contains P1 where F1 is the list of files containing all of the entry points in list P1 because a particular file can contain definitions of functions that are not in list P1. To produce a minimum closure load list we need an outer loop which adds the functions which are referenced but not called, add their closure to the list, add the files needed for them, until no new files are introduced. ---- Note that this entire approach does not take globals which are not functions into account. An alternative approach in the data analysis phase which does take globals into account is to work with the symbol tables of the relocatable object files. This also works (the set manipulation tools are the same) but is subject to the objection that it doesn't tell you where in the source files the globals are referenced. To do the whole works really right is a non-trivial problem. The real problem, as I have indicated, is to extract all of the relationship data, taking into account the vagaries of the preprocessor, the possible search paths, the effect on the preprocessor of arguments to cc, and the effects of the use of pointers to globals. Hope this is of some help. -- In the fields of Hell where the grass grows high Are the graves of dreams allowed to die. Richard Harter, SMDS Inc.