Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/18/84; site brl-tgr.ARPA
Path: utzoo!watmath!clyde!burl!ulysses!mhuxr!mhuxt!houxm!whuxl!whuxlm!harpo!decvax!genrad!panda!talcott!harvard!seismo!brl-tgr!tgr!BillW@SU-SCORE.ARPA
From: BillW@SU-SCORE.ARPA (William Chops Westfield)
Newsgroups: net.micro
Subject: Re: Standard, What standard??? (size vs. speed)
Message-ID: <9614@brl-tgr.ARPA>
Date: Sat, 30-Mar-85 21:13:50 EST
Article-I.D.: brl-tgr.9614
Posted: Sat Mar 30 21:13:50 1985
Date-Received: Tue, 2-Apr-85 05:59:57 EST
Sender: news@brl-tgr.ARPA
Lines: 70

Al Filipski <al%mot.uucp@BRL-TGR.ARPA> says...

    Funny, I've always thought that size and speed were INVERSELY related.
    Take sort algorithms, f'rinstance.  One of the smallest sorts you can
    write is one of the worst-- the bubble....

From an abstract point of view, he is correct.  However, we are not
really talking about small algorithms vs large algorithms here.  Rather,
we are talking about the implementation of a given algorithm.  In this
latter case, it is true that a small implementation of an algorithm is
is faster than a large implementation.  You could call implementation a
sort of sub-algorithm.  Consider the following two pieces of code (part
of a TCP checksum routine for a 8088 processor.  Assume we want a 1's
complement 16 bit checksum of CX words pointed to by BX:)

---------------
Example 1:

lpchk:	add	ax,(bx)		| add value and carry	2 bytes, 18 cycles
	adc	ax,*0		|			3 bytes,  8 c
	inc	bx		| bump pointer		1 byte,   2 c
	inc	bx		|			1 byte,   2 c
	loop	lpchk		| do it again		2 bytes, 17 c

Example #2

	mov	si,bx
lpchk:	lodw			|get next word		1 byte, 16 cycles
	adc	bx,ax		|overlap adding last CY	2 bytes, 3 c
	loop	lpchk		|next word		2 bytes, 17 c
	mov	ax,bx		| put where results have to go
	adc	ax,*0		| add final carry bit.
---------------

Although both example use essentailly the same algorithm (there aren't a
whole lot of ways to calculate such a checksum!), the second example
is smaller and faster - It makes better use of the 8088 architecture.

Which brings up another point:  When trying to optimize code, I like
to divide optimization into three different levels: Algorithmic
optimization, architectural optimization, and system level optimization.

Algorithmic optimization is what is most frequently talked about. (for
exampe going from bubble to quick sort).  Unfortunately, timings in
this area are rather abstract.  Sorts tend to be rated by the number of
data comparisons that have to be done, for example.  Which is usually a
pretty good idea, but what if a compare is very fast compared to
instruction fetches (for example)?  Which brings us to:

Architectural optimization is where you take a look at your processor,
and try to make the fast parts of your code use fast instructions, and
such.  This is usually where us assembly language hackers gain speed
over HLL programmers.  In the examples given above, I might try to
ensure that the words being checksummed inside the loop were being
fetched froma word boundry if it were running on an 8086 proceessor,
since this would require half the bus cycles, for example...

Systems Level optimization is when you take into account nasty things
like the operating system and such.  For example, using the key as
an address may sound like a very fast way to sort things, but under
most operating systems, you are likely to spend most of the time
you might have saved down somewhere in the pager, moving pieces of
your address space on and off of disk, meanwhile letting other users
run as your console time climbs above that which a simple bubble sort
might have taken (run-time vs console time optimizations are a very
interesting instance of SLO problems...)


Sorry to have been so long winded...
Bill Westfield