Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!watmath!clyde!caip!elbereth!rutgers!husc6!seismo!umcp-cs!chris From: chris@umcp-cs.UUCP (Chris Torek) Newsgroups: net.emacs Subject: Much ado about regular expressions Message-ID: <3956@umcp-cs.UUCP> Date: Tue, 21-Oct-86 17:01:48 EDT Article-I.D.: umcp-cs.3956 Posted: Tue Oct 21 17:01:48 1986 Date-Received: Wed, 22-Oct-86 06:34:02 EDT References: <1168@peregrine.UUCP> Reply-To: chris@umcp-cs.UUCP (Chris Torek) Organization: University of Maryland, Dept. of Computer Sci. Lines: 117 Summary: vi's g/re/d is easy in Gosling and Gnu (Warning: the following article will tell you more than you ever wanted to know about playing with regular expressions.) In article <1168@peregrine.UUCP> someone writes: >Since I have switched from vi to EMACS, there is one thing that I missed >more than anything else. The ability to perform an operation on all >the lines that met a particular criteria(specified by a regular expression). >For instance in vi, I could type in "/[A-Z][a-z]*/d" to delete all lines >that met the specified criteria or I could type in >"/\([A-Za-z][A-Za-z]*(\).*\()\)/s//\1\2". How would I do similar operations >in EMACS? (You left out the `g': `g/[A-Z][a-z]*/d'.) Some of these operations are best done by writing MLisp or elisp code, but note that a global delete operation is trivial due to the way regular expressions work, with the addition that Emacs can match newlines explicitly. Simply add `.*' at the front of your R.E., and add `.*<^J>' at the end: x : re-replace-string Old pattern: .*[A-Z][a-z]*.*<^Q><^J> New string: (Note that this should be done after moving to the top of the buffer, since Emacs's replace operations work from wherever you are now to the end of the buffer.) Since `.' matches any character but newline, and `{class}*' matches the longest possible sequence of {class}, this will always match full lines containing at least one [A-Z]. The pattern can be simplified as well. The [a-z]* part is unnecessary, as it matches zero or more `a's, `b's, ..., `z's. Yet the implied `.*' in vi's global, or the explicit one in Emacs, subsumes this: Old pattern: .*[A-Z].*<^Q><^J> There is one final possible optimisation that is very useful when dealing with large files. Emacs's search code runs faster when it can do an `anchored search'. (I am not using `anchored' in quite the same sense as Snobol here. There may be a better term, but I cannot think of it offhand.) By this I mean that a first character that is considered `literal' speeds the matching operation. For example, searching for `[A-Z][A-Z]*' is slow, but searching for `A[A-Z]*' is fast. The reason is that a literal match (the first `A' here) is a common case, and has been optimised by having the search code first find one `A' before trying the full-blown regular expression match operation. But look at this: our original pattern is required to match a full line! It must start at the beginning of a line, find one character in [A..Z], match the rest of the line, then pick up a newline. So we should be able to `anchor' it to the beginning of a line. What begins a line? Well, `^' in a regular expression should do this. We could use the pattern ^.*[A-Z].*<^J> Unfortunately, this does not run any faster. Peeking at the innards of the regular expression matcher shows why: `^' is not considered a literal character. Curses! (No, not the library.) But lo! there is another way to denote the beginning of a line. Every line begins after the previous line ends, and every previous line ends with a newline! We can use instead the pattern <^J>.*[A-Z].*<^J> But---oops!---we forgot something. The very first line does not have a previous line. Now what can we do? When all else fails, cheat: Add a blank line at the top of the file. Now we have a previous line, and can use our modified pattern: Old pattern: <^Q><^J>.*[A-Z].*<^Q><^J> New string: Whoops, that seems to have deleted all the newlines as well. That anchor we added came from the previous line, so we must put it back: New string: <^Q><^J> But this is not necessary. Since we know all about how .* matches everything it can, we simply notice that that final newline on the original pattern is not necessary. If we leave it out, Emacs will not match the newline between the line we wanted to delete and the next. But that is all right: If we have Emacs leave that newline behind, it will make up for the newline we stole from the previous line. Thus the final pattern is: Old pattern: <^Q><^J>.*[A-Z].* New string: Of course, when we are all done we have to clean up: we stuck an extra blank line at the top of the buffer so that we could cheat. The ultimate sequence of commands, then, is ESC-< (top of buffer) ^O (add that extra blank line) ESC-x re-replace-string (do the replace) ^Q ^J .*[A-Z].* RET (type in the old pattern) RET (specify a blank new string) ^D (delete that extra blank line) And lo! Emacs deletes every line containing an uppercase letter. Not only that, it even does it faster than vi! :-) (Actually, chances are that typing ESC-< ^@ ESC-> ESC-x filter-region egrep -v "[A-Z]" RET is just as fast, and easier to remember. We can use a wrench as a hammer, but having the hammer too is nice.) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690) UUCP: seismo!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@mimsy.umd.edu