Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!rutgers!uwvax!oddjob!mimsy!chris
From: chris@mimsy.UUCP (Chris Torek)
Newsgroups: comp.unix.wizards
Subject: Is Unix stdio slow?
Keywords: RMS, file system, stream files, stdio
Message-ID: <13733@mimsy.UUCP>
Date: 25 Sep 88 16:18:59 GMT
References: <411@marob.MASA.COM> <178@arnold.UUCP> <3442@crash.cts.com> <3951@psuvax1.cs.psu.edu>
Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742
Lines: 143

[moved from comp.os.vms]
In article <3951@psuvax1.cs.psu.edu> schwartz@shire (Scott Schwartz) writes:
>I've seen unix programs (things like a grep replacement) that got
>similar speedups by replacing stdio calls with read/write and a large
>buffer.  I wonder how much of that 2-6x is from overhead in stdio,
>rather than in the filesystem.  

Quite a bit.  The overhead is not all in stdio itself, but in inefficient
expansions of stdio macros:

>	while ((c = getchar()) != EOF)
>		putchar(c);

If you are not going to inspect each character, it seems a bit silly
to bring them in and put them out one at a time, but let us take a
look at what the traditional dumb Unix C compiler generates for this:

		jmp	while_test
	loop:
		<<code for putchar>>
	while_test:	| while ((c = getchar()) != EOF)
		sub	$1,stdin_cnt	| --cnt >= 0 ?
		jls	refill
		mov	stdin_ptr,r1	|	*(unsigned char *)p++
		clr	r0
		movb	@r1+,r0
		mov	r1,stdin_ptr
		jmp	merge
	refill:	push	stdin		| : _filbuf(&stdin)
		call	_filbuf
		tst	@sp+		| (some sort of pop)
	merge:	mov	r0,C(fp)	| c = ...
		cmp	r0,$-1		| c == EOF?
		jne	loop		| no, loop
		<<code to exit>>

Now, stdio reads in blocks, typically 512 to 65536 characters at
a time, so the most common path through this code is

	sub $1,stdin_cnt; no_jmp; mov stdin_ptr,r1	| 3
	clr r0; movb @r1+,r0; mov r1,stdin_ptr		| 3
	jmp merge; merge:				| 1
	mov r0,C(fp); cmp r0,$-1; jmp loop		| 3 = 10 instrs

Most of this is unnecessary!  If we put stdin_ptr in a register
in advance, and note that r0 (and hence C) can never be -1 since it
is zero-extended from a byte, we should see something like

	sub $1,stdin_cnt; no_jmp			| 2
	clr r0; movb @p+,r0				| 2
	mov r0,C(fp); jmp loop				| 2 = 6 instrs

The code to call `filbuf' has to save this extra pointer, but that is
no big deal.  If BUFSIZ is 1024, we saved 4*1023=4092 instructions at
the expense of one instruction when calling filbuf, for a total savings
of 4091 instructions per read!  We might save `only' ~2000 instructions
if stdin_ptr must be left global (which is true if the loop is
nontrivial).

What about putchar?  It is even worse:

	| putchar(C(fp))
		sub	$1,stdout_cnt	| --cnt >= 0 ?
		jls	flush
		mov	stdout_ptr,r0	| *p = c,
		mov	C(fp),@r0
		tstbit	$7,stdout_flg	| flag&IS_LINE_BUFFERED
		jz	no_nl_flush
		cmp	@r0,$10		| && *p == '\n' ?
		jne	no_nl_flush
		push	$10		|	_flsbuf(stdout, '\n')
		jmp	fls0		| (merge)
	no_nl_flush:
		clr	r1		| : *(unsigned char *)p++
		movb	@r0+,r1
		mov	r0,stout_ptr
		mov	r1,r0
		jmp	merge
	flush:	push	C(fp)		| : _flsbuf(stdout, c)
	fls0:	push	stdout
		call	_flsbuf
		cmp	@sp+,@sp+
	merge:

The most common case is, again, that the buffer does *not* fill.
Unfortunately, this case still has to test for line buffered
output, and the flag can (and in fact does) change after calls
to _flsbuf, so the test cannot easily be hoisted out of the
loop (it can be done, by splitting and cross-branching in the
flush code, but this is not often a good optimisation: this is
a peculiar case).

Eliminating line-buffered output descriptors shrinks putchar()
considerably:

	| p is `unsigned char *'
	| --cnt >= 0 ? *p++ = c : _flsbuf(stdout, c)
		sub	$1,stdout_cnt
		jls	flush
		mov	stdout_ptr
		mov	C(fp),@r1
		clr	r0
		mov	@r1+,r0
		mov	r1,stdout_ptr
		jmp	merge
	flush:	push	C(fp)
		push	stdout
		call	_flsbuf
		cmp	@sp+,@sp+
	merge:

The copy-a-buffer-at-a-time loop looks like this:

		jmp	while_tst
	loop:
		push	n	| write(1, buf, n)
		push	buf
		push	1
		call	write
		add	$6,sp	| (pop)
	while_tst:
		push	size	| n = read(0, buf, size)
		push	buf
		push	0
		call	read
		add	$6,sp
		mov	r0,n
		tst	r0	| while (... > 0)
		jgt	loop

which has *no* per-character overhead, compared with a minimum 6+11
(getchar+putchar) instruction per-character overhead for stdio.  When
the buffer size is large (e.g., 1024), that is 17000 instructions of
overhead per read and write call.  read and write are each thousands of
instructions, but they no longer swamp the overhead in stdio.  In
programs that do real work with each character, at least some of the
stdio `overhead' is in fact necessary, so you lose less.

Using fread and fwrite can help, if they are intelligently written.
(They are NOT so written in old BSD releases.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris