Xref: utzoo comp.editors:1209 gnu.emacs:2067 comp.unix.wizards:19969 Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ames!think!zaphod.mps.ohio-state.edu!usc!cs.utexas.edu!sun-barr!newstop!sun!flam!jpayne From: jpayne%flam@Sun.COM (Jonathan Payne) Newsgroups: comp.editors,gnu.emacs,comp.unix.wizards Subject: Re: GNU Emacs, memory usage, releasing Keywords: GNU emacs malloc memory working set gap editor Message-ID: <129799@sun.Eng.Sun.COM> Date: 2 Jan 90 21:45:25 GMT References: <1558@aber-cs.UUCP> Sender: news@sun.Eng.Sun.COM Reply-To: jpayne@sun.UUCP (Jonathan Payne) Organization: Sun Microsystems, Mountain View Lines: 132 All this talk about emacs and buffer gap vs. linked list is interesting, and I just have to put in my two cents worth. I've already implemented my own version of emacs using linked lists, so that's my experience. I read the emacs cookbook for the redisplay algorithm before I started JOVE. I picked linked list because I knew I could not store the files in memory simply because I wrote JOVE for the pdp11/70. But since I moved to bigger machines I have also put off using buffer gap because I could not quite see how to make the redisplay algorithm be as smart and cheap as JOVE's. My first exposure to buffer gap was checking out Gosling's emacs (that thing was so full of good ideas!) and the first thing I noticed was how amazingly smart the redisplay algorithm was, but also how amazingly expensive it was. I couldn't see how to make a smart redisplay algorithm be fast with buffer gap. I have since then learned, and now I am writing another version of JOVE using buffer gap. I will never use a linked list of lines approach again. Buffer gap is SO much better in lots ways from an implementor's point of view and often from the user's point of view. Here are the reasons that come to mind: o File i/o is blindingly fast. It's standard to implement read-file the following way: - make sure the buffer gap is big enough for the whole file - make one call to read to read the whole file into the buffer That's it. Simple as anything, FAST as anything. On my Sun 3/60 I can read in a megabyte in a second. Writing files is done in two chunks, the data to the left of the gap and then the data to the right. o A buffer position is represented as an integer. That's very convenient in lots of ways. Forward character is simply b->point += 1; with some bounds checking. No special casing being at the end or beginning of a line. You can then define marks, which are easy to update for insertions. There are hacks you can use so that marks don't need updating after every insertion, but rather after every time the gap is moved, which is relatively rare. You can define marks with lengths associated with them, and mark them as dirty when changes are made within the span of the mark. All are possible with linked list, but the code is harder to write, less reliable, and the implementation will be slower. o Insert and delete operations of ANY kind are trivial to implement. Move the gap to the point of insertion, and either make the gap smaller (for insertion) or make the gap bigger for deletion. No splicing together lines in a linked list and garbage like that. o No line length hassles. o Undo/redo is trivial to implement in this scheme. I just implemented undo/redo in a prototype I am working on, very easy to understand, very nice semantics. I am not crazy about undo in VI, Gosling's emacs, or GNUemacs - I'd be curious to see what you think. o Regular expression search is MUCH cleaner. I ripped the code out of JOVE and converted it to buffer gap, and it's fast and much smaller, and definitely much easier to understand, and is more functional (it searches across newline bounderies). Lessee. You complained about the memory usage and the slowness of buffer gap. First of all, I think the average file in UNIX is much less than 100k. Actually I KNOW the average file is much less than that, but for the hacker types, I bet the average size of a source files is also less than 100k, more like 50k and below. The point is, moving the gap around is not that big a deal. The amount of copying done is directly related to how far the gap is moving, not how big the file is, and most of the time the gap does not move very far! It just doesn't. I thought paging algorithms were geared towards sequential access, and that some versions of UNIX provided a system call to tell the pager not to try and be smart about paging in adjacent pages for certain processes like LISP processes during garbage collection. But ... I could be wrong. >7) Most interestingly when the gap closes, because we have inserted that >much worth of data, the *entire* buffer contents is realloc()ated. If the >buffer is not the top one in virtual memory, or it is but you have a a >realloc() that will not attempt just extending a block, but simply allocates >a new block and copies the old into the new, you are in for extra overheads. >They are no longer proportional to the amount of work you do, but to the >total size of the file as well, which may be bad news. >8) most interestingly, realloc()ating the buffer will leave large holes in >your virtual address space, that will expand forever, taking up if anything >swap space. The holes are also likely to get fragmented as your session >progresses (remember, GNU emacs has such high overheads that you are >supposed to start it up once as a server and then use emacsclient as the >"real" editor), and therefore the virtual memory profile of your session >will worsen steadily. These are the main problems with buffer gap. >9) GNU emacs also has some other interesting overheads. The screen display >procedure is no speed daemon either... The redisplay algorithm can be kept smart and cheap. >The better solution, made relatively easy by the reasonably modular and >layered structure of GNU emacs, would be to accept The Emacs Cookbook >recommendation (adopted by Jove and the MicroEmacs/Gnu family of editors) and >implement a linked list system. I would suggest just reading (or map, on the >operating systems that allow it) the file to be edited as a large lump of >virtual address space, then building a linked list of pointers to lines >therein, and then to allocate modified lines from ageneral purpose arena. >Informing the OS of the peculiar access patterns would also help, if >possible. Again, I think this would muck up the rest of the code in the editor. How would you access consecutive pieces of the buffer? In buffer gap, you do CharAt(pos). What would that look like in your new design? At the least you would need accessor functions (not macros, either) to deal with boundery conditions. OR, you have to move the knowledge of how the buffer is stored into the places that need good performance. I had to do that for the redisplay algorithm in JOVE and for the regular expression searches. I'm not saying all this isn't possible. It's just my belief that it's not clearly a win. In fact, in my eyes, for 99% of the editing that I do, a buffer gap solution makes the most sense. Just my two cents... Jonathan Payne