Path: utzoo!utgpu!news-server.csri.toronto.edu!clyde.concordia.ca!uunet!ogicse!schaefer From: schaefer@ogicse.ogi.edu (Barton E. Schaefer) Newsgroups: comp.lang.perl Subject: Re: Ordering one array by another Message-ID: <8274@ogicse.ogi.edu> Date: 28 Mar 90 00:12:14 GMT References: <8251@ogicse.ogi.edu> <7561@jpl-devvax.JPL.NASA.GOV> Organization: Oregon Graduate Institute (formerly OGC), Beaverton, OR Lines: 194 In article <7561@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes: } In article <8251@ogicse.ogi.edu> schaefer@ogicse.ogi.edu (Barton E. Schaefer) writes: } : I have a sorted list of keywords, and a file (well, stdin actually, but } : no matter) consisting of lines each of which contains at most one of the } : keywords. There may be keywords not in the file, and lines with zero } : keywords (which are ignored), but no pair of lines both contain the same } : keyword. What I want to do is reorder the input lines to match the order } : in the list of keywords. } } We really don't have enough info to decide the fastest algorithm, because } we'd need to know the number of keywords, the relative number of lines not } containing any keyword, and the usual location of the keyword in each line, } and the frequency distribution of characters in the keys and lines. The keyword is almost always the last word on the line. Statistics on a sample keys/lines pair, from wc: keys: 228 228 2646 lines: 231 2076 15744 However, I the number of keywords and lines may vary a lot from trial to trial, and I may need to do this many times for different keys/lines sets in each run. I do know that the number of keywords is almost always very close to the number of lines, but there may be either a few more keys than lines or a few more lines than keys. } However, on the assumption that pattern compilation and matching is going } to dominate the timing, I'd suspect the following of being QUITE fast: [Algorithm deleted, it's included again below.] It is indeed very fast (see below), but my tests revealed a possible bug (see also below). } On the other hand, the associative array costs you something in storage and } efficiency. Try it, and see if it's faster. Here are some statistics for running Randal's suggested algorithm, my original one, and Larry's over the keys/lines input sample wc'd above: Algorithm Time in Seconds --------- --------------- Randal's: 45.85 Randal's, using grep: 45.75 Original: 40.60 Original, as nested grep: 100.85 Larry's: 11.70 Larry's, including build: 11.85 Now, an important note here: these are all averaged over twenty runs, all done by the program below, except for Larry's algorthm. To test Larry's, I had to take out the 1..20 loop and run the whole program 20 times, because any time I tried to clear the associative array used by Larry's algorithm, successive runs wouldn't print anything! And if I didn't clear the array each time, every successive run would be faster than the one before it, because it doesn't have to reallocate the array each time. Any idea why the "undef %found;" breaks this? (Patchlevel 15.) Here's the program: #! /usr/bin/perl # Whomp everything in ahead of time so tests aren't biased # Then initialize locals from these values for each test @keys = split("\n",`cat keys`); # Read the keys $init_lines = `cat lines`; # Read the lines, in Randal's format @init_lines = split("\n",$init_lines); # Make an array of lines $init_lines = "\n".$init_lines; # Add the extra newline Randal needs # # Set up locals and time a single run of the current method. # Sends the output to /dev/nul -- all we want are stats. # Note that the time does not include initalizing the locals, # nor any of the I/O manipulations. # sub Time_it { local($lines,@lines) = ($init_lines,@init_lines); open(HIDE, "> /dev/null") || die; select(HIDE); local($time) = time; do $method(); $time = time - $time; close(HIDE); select(STDOUT); $time; } # # Randal's search-and-replace algorithm # sub Randal { for $k (@keys) { $lines =~ s/(\n)(.*\b$k\b.*\n)/print($2),$1/e; } } # # Randal's algorithm, grep replacing "for $k ..." # sub Randal_as_grep { grep(($lines =~ s/(\n)(.*\b$_\b.*\n)/print($2),$1/e), @keys); } # # My original splicing algorithm # sub Original { for $k (@keys) { $i = $[; for $l (@lines) { if ($l =~ /\b$k\b/) { print splice(@lines,$i,1)."\n"; last; } $i++; } } } # # The silly double-grep variant of the above, to show what NOT to do # sub Original_as_grep { grep(($k=$_,$i=0,$j=1, grep(($j && ++$i && /\b$k\b/ && print(splice(@lines,$i-1,1)."\n"), --$j), @lines)), @keys); } # # Construct the on-the-fly program for Larry's algorithm. # There's a little extra overhead for tacking back on the "\n" that # was split() off to form @init_lines, but otherwise this is what # Larry posted. The idea is to construct an explicit test for each # key (unroll the foreach $key loop), and then apply those tests to # a line that has been prepared for search by the study command. # sub build_Larry { $prog = <