Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!ucsd!ucbvax!tut.cis.ohio-state.edu!purdue!haven!aplcen!bink From: bink@aplcen.apl.jhu.edu (Ubben Greg) Newsgroups: comp.unix.questions Subject: Re: sed script to remove cr/lf except at paragraph breaks Summary: an explanation and 2 solutions Keywords: sed msdos Message-ID: <1292@aplcen.apl.jhu.edu> Date: 21 May 89 18:52:01 GMT References: <119@sherpa.UUCP> Reply-To: bink@aplcen.apl.jhu.edu (Greg Ubben) Distribution: na Organization: The Johns Hopkins University, Baltimore MD Lines: 62 In article <119@sherpa.UUCP> rac@sherpa.UUCP (Roger A. Cornelius) writes: > I'm in need of a sed script to remove MSDOS cr/lf (actually replace each > cr/lf combination with one space) except at the start of a paragraph. > i.e. only the cr/lf preceding a paragraph break should remain. Paragraphs > are marked only by four leading spaces and nothing else. > > Here's where I am now: > > N > h > /\n /{ > P > D > } > s/^M\n/ /g The h here is useless, because you never use G, g, or x to get the text back. The problem with using N to gather an arbitrary number of lines in the pattern space is that SED doesn't keep the pattern space between cycles (unless you can make the D command work out), so you must code an explicit loop: : loop $q N /\n /{ P; D; } s/^M\n/ / b loop Also, the $q is needed because SED will stop dead without printing the pattern space if an N (or n) is attempted on the last line of the input. If you don't care for "gotos" (or correctness), here's an alternative method that makes use of the hold space and SED's natural cycle for looping: /^ /!{ H; $!d; } x 1d s/^M\n/ /g Since this algorithm is based on the transition BETWEEN two paragraphs, the 1d and $! are necessary to handle the special cases of the first and last lines (and even then it doesn't work right when the first line is not the beginning of a paragraph or the last line IS the beginning of a paragraph). This problem requires a 1-line look-ahead, and in general, the x command is a good way to implement this in SED. > This works correctly for the first match, ie beginning of a paragraph, > but for all other lines, the substitution of a space for cr/lf only > works correctly for the first occurrance in the line (the g flag seems > to have no effect). But there are two occurrances due to the N function. Because you're never gathering more than 2 lines in the pattern space at once, due to ending the cycle as explained above. > How can I match (and substitute for) the terminating nl in the pattern > space? The sed man pages concerning addresses say you can't. What am > I missing or how can I get around this? The terminating newline can only be matched by a $ because it is not really there -- it is always tacked on when the line is output. -- Greg Ubben "A SED fanatic" bink@aplcen.apl.jhu.edu