Xref: utzoo comp.editors:1209 gnu.emacs:2067 comp.unix.wizards:19969
Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ames!think!zaphod.mps.ohio-state.edu!usc!cs.utexas.edu!sun-barr!newstop!sun!flam!jpayne
From: jpayne%flam@Sun.COM (Jonathan Payne)
Newsgroups: comp.editors,gnu.emacs,comp.unix.wizards
Subject: Re: GNU Emacs, memory usage, releasing
Keywords: GNU emacs malloc memory working set gap editor
Message-ID: <129799@sun.Eng.Sun.COM>
Date: 2 Jan 90 21:45:25 GMT
References: <1558@aber-cs.UUCP>
Sender: news@sun.Eng.Sun.COM
Reply-To: jpayne@sun.UUCP (Jonathan Payne)
Organization: Sun Microsystems, Mountain View
Lines: 132


All this talk about emacs and buffer gap vs. linked list is interesting,
and I just have to put in my two cents worth.  I've already implemented
my own version of emacs using linked lists, so that's my experience.  I
read the emacs cookbook for the redisplay algorithm before I started
JOVE.  I picked linked list because I knew I could not store the files in
memory simply because I wrote JOVE for the pdp11/70.  But since I moved
to bigger machines I have also put off using buffer gap because I could
not quite see how to make the redisplay algorithm be as smart and cheap
as JOVE's.  My first exposure to buffer gap was checking out Gosling's
emacs (that thing was so full of good ideas!) and the first thing I
noticed was how amazingly smart the redisplay algorithm was, but also how
amazingly expensive it was.  I couldn't see how to make a smart redisplay
algorithm be fast with buffer gap.  I have since then learned, and now I
am writing another version of JOVE using buffer gap.

I will never use a linked list of lines approach again.  Buffer gap is SO
much better in lots ways from an implementor's point of view and often
from the user's point of view.  Here are the reasons that come to mind:

	o File i/o is blindingly fast.  It's standard to implement
	  read-file the following way:
	  	- make sure the buffer gap is big enough for the whole
	 	  file
		- make one call to read to read the whole file into the
		  buffer
	  That's it.  Simple as anything, FAST as anything.  On my Sun
	  3/60 I can read in a megabyte in a second.

	  Writing files is done in two chunks, the data to the left of
	  the gap and then the data to the right.

	o A buffer position is represented as an integer.  That's very
	  convenient in lots of ways.  Forward character is simply
	  	b->point += 1;
	  with some bounds checking.  No special casing being at the end
	  or beginning of a line.

	  You can then define marks, which are easy to update for
	  insertions.  There are hacks you can use so that marks don't
	  need updating after every insertion, but rather after every
	  time the gap is moved, which is relatively rare.

	  You can define marks with lengths associated with them, and
	  mark them as dirty when changes are made within the span of the
	  mark.

	  All are possible with linked list, but the code is harder to
	  write, less reliable, and the implementation will be slower.

	o Insert and delete operations of ANY kind are trivial to
	  implement.  Move the gap to the point of insertion, and either
	  make the gap smaller (for insertion) or make the gap bigger for
	  deletion.  No splicing together lines in a linked list and
	  garbage like that.

	o No line length hassles.

	o Undo/redo is trivial to implement in this scheme.  I just
	  implemented undo/redo in a prototype I am working on, very easy
	  to understand, very nice semantics.  I am not crazy about undo
	  in VI, Gosling's emacs, or GNUemacs - I'd be curious to see
	  what you think.

	o Regular expression search is MUCH cleaner.  I ripped the code
	  out of JOVE and converted it to buffer gap, and it's fast and
	  much smaller, and definitely much easier to understand, and is
	  more functional (it searches across newline bounderies).

Lessee.  You complained about the memory usage and the slowness of buffer
gap.  First of all, I think the average file in UNIX is much less than
100k.  Actually I KNOW the average file is much less than that, but for
the hacker types, I bet the average size of a source files is also less
than 100k, more like 50k and below.  The point is, moving the gap around
is not that big a deal.  The amount of copying done is directly related
to how far the gap is moving, not how big the file is, and most of the
time the gap does not move very far!  It just doesn't.

I thought paging algorithms were geared towards sequential access, and
that some versions of UNIX provided a system call to tell the pager not
to try and be smart about paging in adjacent pages for certain processes
like LISP processes during garbage collection.  But ... I could be wrong.

>7) Most interestingly when the gap closes, because we have inserted that
>much worth of data, the *entire* buffer contents is realloc()ated. If the
>buffer is not the top one in virtual memory, or it is but you have a a
>realloc() that will not attempt just extending a block, but simply allocates
>a new block and copies the old into the new, you are in for extra overheads.
>They are no longer proportional to the amount of work you do, but to the
>total size of the file as well, which may be bad news.

>8) most interestingly, realloc()ating the buffer will leave large holes in
>your virtual address space, that will expand forever, taking up if anything
>swap space. The holes are also likely to get fragmented as your session
>progresses (remember, GNU emacs has such high overheads that you are
>supposed to start it up once as a server and then use emacsclient as the
>"real" editor), and therefore the virtual memory profile of your session
>will worsen steadily.

These are the main problems with buffer gap.

>9) GNU emacs also has some other interesting overheads. The screen display
>procedure is no speed daemon either...

The redisplay algorithm can be kept smart and cheap.

>The better solution, made relatively easy by the reasonably modular and
>layered structure of GNU emacs, would be to accept The Emacs Cookbook
>recommendation (adopted by Jove and the MicroEmacs/Gnu family of editors) and
>implement a linked list system. I would suggest just reading (or map, on the
>operating systems that allow it) the file to be edited as a large lump of
>virtual address space, then building a linked list of pointers to lines
>therein, and then to allocate modified lines from ageneral purpose arena.
>Informing the OS of the peculiar access patterns would also help, if
>possible.

Again, I think this would muck up the rest of the code in the editor.
How would you access consecutive pieces of the buffer?  In buffer gap,
you do CharAt(pos).  What would that look like in your new design?  At
the least you would need accessor functions (not macros, either) to deal
with boundery conditions.  OR, you have to move the knowledge of how the
buffer is stored into the places that need good performance.  I had to do
that for the redisplay algorithm in JOVE and for the regular expression
searches.

I'm not saying all this isn't possible.  It's just my belief that it's
not clearly a win.  In fact, in my eyes, for 99% of the editing that I
do, a buffer gap solution makes the most sense.

Just my two cents...

Jonathan Payne