Path: utzoo!utgpu!news-server.csri.toronto.edu!clyde.concordia.ca!uunet!ogicse!schaefer
From: schaefer@ogicse.ogi.edu (Barton E. Schaefer)
Newsgroups: comp.lang.perl
Subject: Re: Ordering one array by another
Message-ID: <8274@ogicse.ogi.edu>
Date: 28 Mar 90 00:12:14 GMT
References: <8251@ogicse.ogi.edu> <7561@jpl-devvax.JPL.NASA.GOV>
Organization: Oregon Graduate Institute (formerly OGC), Beaverton, OR
Lines: 194

In article <7561@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
} In article <8251@ogicse.ogi.edu> schaefer@ogicse.ogi.edu (Barton E. Schaefer) writes:
} : I have a sorted list of keywords, and a file (well, stdin actually, but
} : no matter) consisting of lines each of which contains at most one of the
} : keywords.  There may be keywords not in the file, and lines with zero
} : keywords (which are ignored), but no pair of lines both contain the same
} : keyword.  What I want to do is reorder the input lines to match the order
} : in the list of keywords.
} 
} We really don't have enough info to decide the fastest algorithm, because
} we'd need to know the number of keywords, the relative number of lines not
} containing any keyword, and the usual location of the keyword in each line,
} and the frequency distribution of characters in the keys and lines.

The keyword is almost always the last word on the line.  Statistics on
a sample keys/lines pair, from wc:

	keys:         228     228    2646
	lines:        231    2076   15744

However, I the number of keywords and lines may vary a lot from trial to
trial, and I may need to do this many times for different keys/lines sets
in each run.  I do know that the number of keywords is almost always very
close to the number of lines, but there may be either a few more keys than
lines or a few more lines than keys.

} However, on the assumption that pattern compilation and matching is going
} to dominate the timing, I'd suspect the following of being QUITE fast:

[Algorithm deleted, it's included again below.]

It is indeed very fast (see below), but my tests revealed a possible bug
(see also below).

} On the other hand, the associative array costs you something in storage and
} efficiency.  Try it, and see if it's faster.

Here are some statistics for running Randal's suggested algorithm, my
original one, and Larry's over the keys/lines input sample wc'd above:

	Algorithm			Time in Seconds
	---------			---------------
	Randal's:			 45.85
	Randal's, using grep:		 45.75
	Original:			 40.60
	Original, as nested grep:	100.85
	Larry's:			 11.70
	Larry's, including build:	 11.85

Now, an important note here: these are all averaged over twenty runs,
all done by the program below, except for Larry's algorthm.  To test
Larry's, I had to take out the 1..20 loop and run the whole program
20 times, because any time I tried to clear the associative array used
by Larry's algorithm, successive runs wouldn't print anything!  And
if I didn't clear the array each time, every successive run would be
faster than the one before it, because it doesn't have to reallocate the
array each time.

Any idea why the "undef %found;" breaks this?  (Patchlevel 15.)

Here's the program:

#! /usr/bin/perl

# Whomp everything in ahead of time so tests aren't biased
# Then initialize locals from these values for each test

@keys = split("\n",`cat keys`);		# Read the keys
$init_lines = `cat lines`;		# Read the lines, in Randal's format
@init_lines = split("\n",$init_lines);	# Make an array of lines
$init_lines = "\n".$init_lines;		# Add the extra newline Randal needs

#
# Set up locals and time a single run of the current method.
# Sends the output to /dev/nul -- all we want are stats.
# Note that the time does not include initalizing the locals,
# nor any of the I/O manipulations.
#
sub Time_it {
    local($lines,@lines) = ($init_lines,@init_lines);
    open(HIDE, "> /dev/null") || die;
    select(HIDE);
    local($time) = time;
    do $method();
    $time = time - $time;
    close(HIDE);
    select(STDOUT);
    $time;
}

#
# Randal's search-and-replace algorithm
#
sub Randal {
    for $k (@keys) { $lines =~ s/(\n)(.*\b$k\b.*\n)/print($2),$1/e; }
}

#
# Randal's algorithm, grep replacing "for $k ..."
#
sub Randal_as_grep {
    grep(($lines =~ s/(\n)(.*\b$_\b.*\n)/print($2),$1/e), @keys);
}

#
# My original splicing algorithm
#
sub Original {
    for $k (@keys) {
	$i = $[;
	for $l (@lines) {
	    if ($l =~ /\b$k\b/) {
		print splice(@lines,$i,1)."\n";
		last;
	    }
	    $i++;
	}
    }
}

#
# The silly double-grep variant of the above, to show what NOT to do
#
sub Original_as_grep {
    grep(($k=$_,$i=0,$j=1,
	grep(($j && ++$i && /\b$k\b/ &&
		print(splice(@lines,$i-1,1)."\n"), --$j),
	    @lines)),
	@keys);
}

#
# Construct the on-the-fly program for Larry's algorithm.
# There's a little extra overhead for tacking back on the "\n" that
# was split() off to form @init_lines, but otherwise this is what
# Larry posted.  The idea is to construct an explicit test for each
# key (unroll the foreach $key loop), and then apply those tests to
# a line that has been prepared for search by the study command.
#
sub build_Larry {
$prog = <<EOF;
    while (\$_ = shift(\@lines)) { \$_ .= "\\n"; study;
EOF
foreach $key (@keys) {
    $prog .= <<EOF;
	\$found{'$key'} = \$_ if ?\\b$key\\b?;
EOF
}
$prog .= <<EOF;
    }
EOF
}

#
# Larry's algorithm -- eval the program and print the associative array
# (why the grep?  Isn't "print @found{@keys};" sufficient?).
# Note the problem with "undef %found" -- if that line is there, only
# the first pass through this algorithm prints anything, even though
# the eval'd program should be reconstructing $found{'key'} each time.
#
sub Larry {
    eval $prog; die $@ if $@;
    print grep($_,@found{@keys});
    # undef %found;			# This breaks 2nd+ passes
    # %found = ();			# This doesn't work either
}

#
# For fairness, we also check the time to rebuild the program each pass.
# 
sub Larry_with_build {
    &build_Larry;
    &Larry;
    undef $prog;
}

for $method (
		'Randal','Randal_as_grep',
		'Original','Original_as_grep',
		'Larry','Larry_with_build'
	    ) {
    $sum = 0;
    if ($method eq 'Larry') { &build_Larry; }
    else { undef $prog; }
    for (1..20) { $sum += &Time_it; }
    print "$method:  " . $sum / 20 . "\n";
}

# END
-- 
Bart Schaefer          "EARTH: Surrender IMMEDIATELY or we PICKLE DAN QUAYLE"

                                                                    "THPPFT!"
schaefer@cse.ogi.edu (used to be cse.ogc.edu)