Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!rutgers!uwvax!oddjob!mimsy!chris From: chris@mimsy.UUCP (Chris Torek) Newsgroups: comp.unix.wizards Subject: Is Unix stdio slow? Keywords: RMS, file system, stream files, stdio Message-ID: <13733@mimsy.UUCP> Date: 25 Sep 88 16:18:59 GMT References: <411@marob.MASA.COM> <178@arnold.UUCP> <3442@crash.cts.com> <3951@psuvax1.cs.psu.edu> Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742 Lines: 143 [moved from comp.os.vms] In article <3951@psuvax1.cs.psu.edu> schwartz@shire (Scott Schwartz) writes: >I've seen unix programs (things like a grep replacement) that got >similar speedups by replacing stdio calls with read/write and a large >buffer. I wonder how much of that 2-6x is from overhead in stdio, >rather than in the filesystem. Quite a bit. The overhead is not all in stdio itself, but in inefficient expansions of stdio macros: > while ((c = getchar()) != EOF) > putchar(c); If you are not going to inspect each character, it seems a bit silly to bring them in and put them out one at a time, but let us take a look at what the traditional dumb Unix C compiler generates for this: jmp while_test loop: <> while_test: | while ((c = getchar()) != EOF) sub $1,stdin_cnt | --cnt >= 0 ? jls refill mov stdin_ptr,r1 | *(unsigned char *)p++ clr r0 movb @r1+,r0 mov r1,stdin_ptr jmp merge refill: push stdin | : _filbuf(&stdin) call _filbuf tst @sp+ | (some sort of pop) merge: mov r0,C(fp) | c = ... cmp r0,$-1 | c == EOF? jne loop | no, loop <> Now, stdio reads in blocks, typically 512 to 65536 characters at a time, so the most common path through this code is sub $1,stdin_cnt; no_jmp; mov stdin_ptr,r1 | 3 clr r0; movb @r1+,r0; mov r1,stdin_ptr | 3 jmp merge; merge: | 1 mov r0,C(fp); cmp r0,$-1; jmp loop | 3 = 10 instrs Most of this is unnecessary! If we put stdin_ptr in a register in advance, and note that r0 (and hence C) can never be -1 since it is zero-extended from a byte, we should see something like sub $1,stdin_cnt; no_jmp | 2 clr r0; movb @p+,r0 | 2 mov r0,C(fp); jmp loop | 2 = 6 instrs The code to call `filbuf' has to save this extra pointer, but that is no big deal. If BUFSIZ is 1024, we saved 4*1023=4092 instructions at the expense of one instruction when calling filbuf, for a total savings of 4091 instructions per read! We might save `only' ~2000 instructions if stdin_ptr must be left global (which is true if the loop is nontrivial). What about putchar? It is even worse: | putchar(C(fp)) sub $1,stdout_cnt | --cnt >= 0 ? jls flush mov stdout_ptr,r0 | *p = c, mov C(fp),@r0 tstbit $7,stdout_flg | flag&IS_LINE_BUFFERED jz no_nl_flush cmp @r0,$10 | && *p == '\n' ? jne no_nl_flush push $10 | _flsbuf(stdout, '\n') jmp fls0 | (merge) no_nl_flush: clr r1 | : *(unsigned char *)p++ movb @r0+,r1 mov r0,stout_ptr mov r1,r0 jmp merge flush: push C(fp) | : _flsbuf(stdout, c) fls0: push stdout call _flsbuf cmp @sp+,@sp+ merge: The most common case is, again, that the buffer does *not* fill. Unfortunately, this case still has to test for line buffered output, and the flag can (and in fact does) change after calls to _flsbuf, so the test cannot easily be hoisted out of the loop (it can be done, by splitting and cross-branching in the flush code, but this is not often a good optimisation: this is a peculiar case). Eliminating line-buffered output descriptors shrinks putchar() considerably: | p is `unsigned char *' | --cnt >= 0 ? *p++ = c : _flsbuf(stdout, c) sub $1,stdout_cnt jls flush mov stdout_ptr mov C(fp),@r1 clr r0 mov @r1+,r0 mov r1,stdout_ptr jmp merge flush: push C(fp) push stdout call _flsbuf cmp @sp+,@sp+ merge: The copy-a-buffer-at-a-time loop looks like this: jmp while_tst loop: push n | write(1, buf, n) push buf push 1 call write add $6,sp | (pop) while_tst: push size | n = read(0, buf, size) push buf push 0 call read add $6,sp mov r0,n tst r0 | while (... > 0) jgt loop which has *no* per-character overhead, compared with a minimum 6+11 (getchar+putchar) instruction per-character overhead for stdio. When the buffer size is large (e.g., 1024), that is 17000 instructions of overhead per read and write call. read and write are each thousands of instructions, but they no longer swamp the overhead in stdio. In programs that do real work with each character, at least some of the stdio `overhead' is in fact necessary, so you lose less. Using fread and fwrite can help, if they are intelligently written. (They are NOT so written in old BSD releases.) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris