Xref: utzoo comp.unix.questions:22849 comp.lang.perl:1442 Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!cs.utexas.edu!usc!elroy.jpl.nasa.gov!jpl-devvax!lwall From: lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) Newsgroups: comp.unix.questions,comp.lang.perl Subject: Re: Regular Expression tool Message-ID: <8353@jpl-devvax.JPL.NASA.GOV> Date: 11 Jun 90 23:32:30 GMT References: <1990Jun8.174056.15313@icc.com> Reply-To: lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) Organization: Jet Propulsion Laboratory, Pasadena, CA Lines: 106 In article <1990Jun8.174056.15313@icc.com> wdm@icc.com (Bill Mulert) writes: : Consider the following statements containing regular expressions: : : echo "`expr \"$1\" : \"^[^=]*=\(.*\)\"`" : : df_usr=`df | sed -n '/^\/usr[ ]/s/[^)]*):[ ]*\([^ ]*\).*/\1/p'` : : sed -e 's/\([!:]\)\([0-9]\)/\1 \2/' \ : -e '/!/s/^\([^ ][^ ]*\).*[,-][,-]*\([0-9][0-9]*\)$/\1 1-\2/' \ : < .newsrc.old > .newsrc : : sed 's/^\([^:! ]*\).*$/\1/' $ACTIVE | sort > $TMPFILE.1 : : Do you have a headache, now? I do. I find any but the simplist regular : expressions to be "write only". They are rather like C's declarations : that so often cause even veteran programmers to look askance. : Fortunately, we have cdecl to help create and decode the C declarations. : : I wish there were something similar for regular expressions. I would : like to have a tool, call it regex, that would allow me to say: : : regex ' "^[^=]*=\(.*\)\" ' : and have regex say, in plain language, what the expression means. : : Is there anything like that in existance? Any ideas on how large : a project like that might be? It's not likely to be too practical, for a couple of reasons. First, there a number of different standards out there. For instance, sed and expr use \( ... \) to indicate grouping, while egrep and perl use ( ... ) for grouping, and \( and \) to indicate real parens. (I'm of course prejudiced in favor of the latter, but I think it's more readable on the whole, since you do grouping a lot more often than you match real parens.) On top of that, when are ?, +, |, { and } metacharacters? They are in some programs, and aren't in others. Are you going to have a switch? regex -sed ' "^[^=]*=\(.*\)\" ' regex -expr ' "^[^=]*=\(.*\)\" ' regex -egrep ' "^[^=]*=\(.*\)\" ' regex -perl ' "^[^=]*=\(.*\)\" ' regex -ed ' "^[^=]*=\(.*\)\" ' regex -emacs ' "^[^=]*=\(.*\)\" ' regex -vi ' "^[^=]*=\(.*\)\" ' Second, your big problem is not so much the regular expressions themselves as it is all the quoting you have to put around them because of the paucity of quoting mechanisms. Take your first example: echo "`expr \"$1\" : \"^[^=]*=\(.*\)\"`" If we blame the gobbldygookiness on the backslashes, we see that half the problem is that we are quoting three deep, so we have to use \", and the other half of the problem is that \( ... \) are the grouping metacharacters. I think the following is more readable simply because of the absence of \, which is simply too heavily overloaded in Unix: perl -e 'print shift =~ /^[^=]*=(.*)/' "$1" Using /PATTERN/ to search filenames forces you to backslash all the slashes in the pattern: df_usr=`df | sed -n '/^\/usr[ ]/s/[^)]*):[ ]*\([^ ]*\).*/\1/p'` ^^ It helps to have an alternate pattern delimiting method. sed lets you have an alternate delimiter on substitutions, but not on pattern matches. (Perl gives you both.) Even in sed, you could write the above as: df_usr=`df | sed -n 's#^/usr[ ][^)]*):[ ]*\([^ ]*\).*#\1#p'` That gets rid of one backslash, anyway. Other filename patterns will benefit more. Filename patterns are the primary reason I added m#PATTERN# to perl, where # can be any delimeter. Similarly, we see a lot of cruft is there simply because of the overly minimalistic implementations of some regexps. Such as having to repeat character classes because there's no +, or having to use uninterpretable whitespace because there's no alternate way to specify spaces and tabs. Compare : sed -e 's/\([!:]\)\([0-9]\)/\1 \2/' \ : -e '/!/s/^\([^ ][^ ]*\).*[,-][,-]*\([0-9][0-9]*\)$/\1 1-\2/' \ : < .newsrc.old > .newsrc to perl -p -e 's/([!:])([0-9])/$1 $2/' \ -e '/!/ && s/^(\S+).*[,-]+([0-9]+)$/$1 1-$2/' \ < .newsrc.old > .newsrc Actually, I'd probably write that as perl -pe 's/:\s*/: /; s/!.*\D(\d+)$/! 1-$1/;' .newsrc.old >.newsrc Whatever. For the most part, I don't think the problem with understanding regular expressions is the regular expressions themselves, but all the claptrap surrounding them. And that will be very difficult to write a decoder for. Unix is not a simple language. Larry Wall lwall@jpl-devvax.jpl.nasa.gov