Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!mcsun!unido!mikros!mwtech!martin From: martin@mwtech.UUCP (Martin Weitzel) Newsgroups: comp.unix.shell Subject: Understanding the Bourne Shell (was Re: Finding the last arg) Keywords: Bourne shell arguments Message-ID: <1033@mwtech.UUCP> Date: 7 Jan 91 20:11:20 GMT References: <18476@shlump.nac.dec.com> <1990Dec27.060903.1604@onion.pdx.com> <1020@mwtech.UUCP> <443@minya.UUCP> Reply-To: martin@mwtech.UUCP (Martin Weitzel) Organization: MIKROS Systemware, Darmstadt/W-Germany Lines: 308 In article <443@minya.UUCP> jc@minya.UUCP (John Chambers) writes: >> What ALLWAYS works in the Bourne-Shell is this: >> >> for last do :; done > >Wow! A one-liner that works for more than 9 args! Of course, there's >the question as to whether this loop is actually faster than starting >a subprocess that just does puts(argv[artc-1]), but at least there's >a way to do it that is portable. I have compared the alternatives here on my 386 box and as you might guess the differences in speed depends on the length of the argument list. For ~25 arguments the for-loop is the fastest, above that up to ~100 arguments there's few difference, but the for loop uses more usr-time and the sub-process more sys-time. There seem to be minor differences between what is called as sub-process, i.e. a specialized C program (as the poster suggested) or another shell-script (as Maarten Litmaath posted earlier in this thread). For the rather untypical size of 250 arguments there still isn't much difference but sometimes the sub-process is faster (the results vary over some range and I didn't go into the efforts to calculate the average). My general experience with the 386 is that it starts sub-processes really fast, so I think the for-do method will even win even for more than 250 arguments on a lot of systems. (BTW: I've learned by my experiments that the shell internally limits the number of arguments that can be passed to a sub process to 254. I allways thought the only limit were the space supplied by the OS to pass the stuff to the sub-process, which is typically several KByte for the *contents* of arguments + environment. I never noticed the limit on the *number* of arguments before.) >That comment isn't worth wasting the bandwidth, of course; my motive >for this followup is a bit of bizarreness that I discovered while >testing this command. The usual format of a for loop is 3 lines: > for last > do : > done >Usually when I want to collapse such vertical code into a horizontal >format, I follow the rule "Replace the newlines with semicolons", and >it works. For instance, > if [ ] > then > else > fi >reduces to > if [ ];then ;else ;fi >which I can do in vi via a series of "Jr;" commands. With the above >for-loop, this gives > for last;do :;done >which doesn't work. The shell gives a syntax error, complaining about >an unexpected ';' in the line. Myself, I found this to be a somewhat >unexpected error message. It appears my simple-minded algorithm for >condensing code doesn't work in this case. > >So what's going on here? What the @#$^&#( is the shell's syntax that >makes the semicolon not only unneeded, but illegal in this case? Funny, I stumbled over the same thing when I "invented" my for-do method for accessing the last argument some years ago. The explanation is a bit longer, so all who aren't interested in the details should leave at this point. The syntax for the "for" statment is more or less the following (I stick to the "yacc"-style here, but include keywords into single quotes even if they are longer than one character, what is not allowed with "yacc"): for_stmt : 'for' NAME 'in' word_list SEP 'do' cmd_list 'done' | 'for' NAME 'do' cmd_list 'done' ; word_list: WORD | word_list WORD cmd_list : cmd arg_list SEP | cmd_list cmd arg_list SEP ; arg_list : /*empty*/ | arg_list WORD ; SEP : ';' | '\n' ; (The meaning of NAME and WORD should be obvious - I don't want to go into the syntactic details too far. I have further left out an undocumented shell feature, that allows you to replace "do" and "done" with "{" and "}"; note that the latter is only true for for-do-done, not for while-do-done and until-do-done!) Note that white space is allowed everywhere in between the tokens and nonterminals. But SEP is a mandatory seperator (which can be a newline or a semicolon). The reason for requiring a separator in some cases is simple: There is the possibility that some keywords of the shell might also be used as regular argument to commands or within a word_list - we'll come back to this in a moment. The shell detects the two forms of the "for" statement simply by looking at what follows the loop-variable. If it is an "in" then there must also follow a word_list, which in turn must be terminated by a mandatory seperator, as explained above. If there follows a "do" there is no wordlist. If there follows a semicolon after the loop-variable, this is against the syntax (this was what the poster puzzled). Of course, Mr. Bourne could have made the syntax to allow for it by changing the RHS of the rule for the "for" statement without "in" into 'for' NAME SEP 'do' cmd_list 'done' but IMHO the difficulties of the poster (and many more, me included) have some other reason, that has something to do with the difference between - mandatory command separators resp. terminators and - optional white space before commands and keywords and - spaces as separators of command and argument list and - the semicolon beeing allowed only in the first case and - the newline beeing allowed in the first and second case, - space characters beeing allowed in the second and third. In a simple command, i.e. a programm name that is followed by some arguments, there's not much of a problem as it seems "natural" for most users to type spaces to separate the arguments and newlines to terminate commands and it seems obvious that the two can not be used interchangable, as this either would terminate the argument list prematurely (if you try to separate arguments with a newline) or it doesn't properly end your command (if you don't type newline). Now let's consider the more complex shell statements. Some very stupid users might in fact expect that the shell can read their mind, but all the others will understand that the shell must either treat ALL keywords (and maybe even all the commands) special, not allowing them as regular arguments, or needs some other separator as the one used between arguments, if there shall follow a keyword after a command (or there shall be two commands) in the same line. The logic can be applied to most keywords regardless if they introduce some complex command or if they mark the beginning of the next part of the command (like "then" or "else" in an "if" statement). More puzzling is that the shell also ALLOWS newlines in place of spaces where it's clear that a complex command isn't complete%. One place where this occurs is when you start a "for" statement and have not yet supplied the matching "done". For example for var in foo bar do cmd done is all allowed, though seldom used, except for exactly one newline in the place marked (2). Note that the newlines before and after "cmd" here can not simply be seen as "empty commands", because if they could, the following would be legal: for var in foo bar do done which IS NOT, since there is at least ONE command necessary between "do" and "done" (please refer to the syntax given above). Note further that a semicolon by itself is NOT an empty command, as for var in foo bar do ; done does not work - you need at least the colon here: for var in foo bar do : done ------ %: More puzzling is that the shell does only allow it in some places. E.g. "for " is a syntax error while "for i " patiently waites for the "in" or "do". ------ >One of the real hassles I keep finding with /bin/sh (and /bin/csh is >even worse ;-) is that the actual syntax regarding things like white >space, newlines, and semicolons seems to be a secret. It often takes >a lot of experimenting to find a way to get these syntax characters >right. Is there any actual documentation on sh's syntax? Is it truly >as ad-hoc as the above example implies? For all I know the C-shell is more or less "ad-hoc", but for the Bourne shell (which, until now and for the rest of this article, I allways mean when I speak of "the shell") you can find a formal syntax allready in a very ancient document, the "Bell Systems Technical Journal" (BSTJ in short) from July/August 1978, ISSN0005-8580. The grammar starts on page 1987 as Appendix A of an article written by S.R. Bourne himself. Though it fails to mention some of the finer points (like the space/newline problems just discussed) it may serve as a start for you and I found that it could even be fed to yacc without much problems (I never tried to fill in the actions to make it work as a "real" shell ...) >Is there perhaps some logical >structure underlying it all that would explain why > for last do :; done >and > for last > do : > done >both work but > for last;do :;done >doesn't? Well, "logic" is not so much an absolut value as many of us think, as it often depends on what you expect. This is so because we may think we have recognized something as a "rule" and tend to see all withstanding observations as "illogical", where just the examples we studied were too limited to recognize that we had only a seen special case (in this generality that may also be true for the things we consider to be the "universal laws" or "laws of nature" - but this brings us away from the topic.) Now, what you observed were that newline and semicolon are interchangable in all the examples you looked at and have tried before you came to that "for" statement. (Remember I told you in the beginning that I had the same problem with this - so it can not be said that your expectations were without reason.) A bit more experimentation could also have shown that in general the both are not really interchangable. E.g. if you type a single newline nothing happens (except the shell prompts again), if you type two newlines still nothing happens but if you type a semicolon + a newline this is a syntax error. Hence semicolon and newline are not so much interchangable as it seemed on first glance. Now, having a little more experience we can come up with some other explanation: - commands can not be empty (they consist at least of an external or builtin command; the ":" is the builtin command which does nothing but evaluate its arguments) - a semicolon or a newline% terminates a command - a command list is a non-empty sequence of commands, all of which must be properly terminated - a semicolon or a newline terminates the word list of the "in" part of the "for" statement - space characters and newlines are allowed before commands - nearly all the keywords of the shell are only recognized if they are found in the position of a command, i.e. if there is a previous command or a word list of a "for" statement there MUST be a separator and their CAN be some space characters or newlines - the most important exceptions from the above are "in" (as well for the "for" statement as for the "case" statement) and "do". But as the word list in the "in" part of a "for" statement (or the command list after the "while" or "until" in such a statement) must be properly terminated, a "do" NOT in command position can only occur in a "in"-less "for" statement. ----- %: There are other valid command separators/terminators that are recognized together with the semicolon, but this doesn't matter here. ----- In some sense, this are the "laws of nature" as derived from observing the shell's behaviour. As the shell is not really nature but the outcome of the thoughts of some human beeing, we could of course complain now that this is "illogical" (compared to our sense logic!) or that there are "too many exceptions" and that it could be simplified with fewer, but more general rules. But when thinking how to smoothen things out by using fewer rules, we often do not recognize all the consequences that this would have. Assume for a momemt we would treat both, newline and semicolon, as statement terminator. Have you really considered what this would mean? Typing a newline (at your terminal or as empty line in a shell script) would be a syntax error (sic!) as a single semicolon is. Quite simple I hear you say, then we allow for an empty statement to be really empty, which would allow for single newlines as well as single semicolons. But be careful! We then must think about the exit status of such a statement. Should it allways be true as the colon command? But then you must be very careful inserting empty lines into a script, because the following two would have different semantics if | if cmd cmd | then | then and you must never separate command execution and accessing $? by a newline, since the empty command "newline" destroys the value of any previous command's exit status. Again I hear you say, we make the empty statment special - it shall leave the status of the "real" command that was executed last. But now the following will become dangerous while do done as it depends on the last command BEFORE the loop when the loop is entered the first time, and after that on the last command executed WITHIN the loop. So, step by step we may introduce more special casing for something that looked like a trivial change in the first place! I hope you have gained a little more understanding for the syntax of the shell now. It isn't really as strange as it might seem on first glance, though I admit a few things are not so obvious and it's easy to come to some wrong conclusions if you have insufficient experience. (If this article hadn't become that long I could write a little more on it - maybe some other time.) -- Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83