Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!ncar!midway!quads.uchicago.edu!goer From: goer@quads.uchicago.edu (Richard L. Goerwitz) Newsgroups: comp.lang.icon Subject: sgml stripping Message-ID: <1990Dec6.224619.22998@midway.uchicago.edu> Date: 6 Dec 90 22:46:19 GMT Sender: news@midway.uchicago.edu (News Administrator) Organization: University of Chicago Lines: 489 Mad as this sounds, I occasionally find it expedient simply to strip out <>-style tags from a SGML-encoded file. Here's a very simple program to do this. I dunno, but I'd expect that it would work with the perverted variant of SGML mandated in the U of Chi- cago guide to electronic manuscripts. I posted this elsewhere, but since it's Icon code, I figured someone might here might want to take a look at it. -Richard ---- Cut Here and feed the following to sh ---- #!/bin/sh # This is a shell archive (produced by shar 3.49) # To extract the files from this archive, save it to a file, remove # everything above the "!/bin/sh" line above, and type "sh file_name". # # made 12/06/1990 07:11 UTC by goer@sophist.uchicago.edu # Source directory /u/richard/Stripsgml # # existing files will NOT be overwritten unless -c is specified # This format requires very little intelligence at unshar time. # "if test", "cat", "rm", "echo", "true", and "sed" may be needed. # # This shar contains: # length mode name # ------ ---------- ------------------------------------------ # 3170 -r--r--r-- stripsgml.icn # 2615 -r--r--r-- stripunb.icn # 1915 -r--r--r-- readtbl.icn # 2084 -r--r--r-- slashbal.icn # 981 -rw-r--r-- README # 659 -rw-r--r-- Makefile.dist # if test -r _shar_seq_.tmp; then echo 'Must unpack archives in sequence!' echo Please unpack part `cat _shar_seq_.tmp` next exit 1 fi # ============= stripsgml.icn ============== if test -f 'stripsgml.icn' -a X"$1" != X"-c"; then echo 'x - skipping stripsgml.icn (File already exists)' rm -f _shar_wnt_.tmp else > _shar_wnt_.tmp echo 'x - extracting stripsgml.icn (Text)' sed 's/^X//' << 'SHAR_EOF' > 'stripsgml.icn' && X############################################################################ X# X# Name: stripsgml.icn X# X# Title: Strip (or translate) simple SGML tags from a file X# X# Author: Richard L. Goerwitz X# X# Version: 1.7 X# X############################################################################ X# X# This program, stripsgml, may be used either to strip SGML tags X# from a file, or to translate them into some other format (or perhaps X# some combination of the two). Note that it only handles very X# simple SGML codes, either stripping or translating set strings. X# This is a VERY simple program, merely intended to satisfy a need X# many have expressed for being able to remove, or perform simple X# manipulations on, files containing <>-style tags. X# X# In its basic mode, you would simply have stripsgml read the X# standard input (an SGML-marked file). Stripsgml would then write X# an SGML-free text on the standard output. Used in this way, X# stripsgml is just a simple stripping program. X# X# If you want some or all of the SGML codes translated into another set X# of codes, simply create a file in which each line has 1) the name of X# the SGML code, and then 2) the way you want that code translated on X# both initialization and completion. The completion specification is X# optional. Put succinctly, the format is: X# X# code initialization completion X# X# A tab or colon separates the fields. If you want to use a tab or colon X# as part of the text (and not as a separator), place a backslash before X# it. The completion field is optional. There is not currently any way X# of specifying a completion field without an initialization field. X# X# In its translation mode, stripsgml is invoked with one argument (the X# name of the file containing the translation information). As before, X# the standard input is expected to contain an SGML encoded file: X# X# stripsgml translation_file < SGML-file X# X# To the standard output is written a SGML-free text. X# X# Note that, if you are translating SGML code into font change or escape X# sequences, you may get unexpected results. This isn't stripsgml's X# fault. It's just a matter of how your terminal or WP operate. Some X# need to be "reminded" at the beginning of each line what mode or font X# is being used. Note also that stripsgml assumes < and > as delimiters. X# If you want to put a greater-than or less-than sign into your text, X# put a backslash before it. This will effectively "escape" the spe- X# cial meaning of those symbols. There is currently no way to change X# the default delimiters. X# X############################################################################ X# X# Links: slashbal.icn ./stripunb.icn ./readtbl.icn X# X############################################################################ X X Xprocedure main(a) X X usage := "usage: stripsgml [map-file]" X *a > 1 & stop(usage) X X map_file := open(a[1]) & t := readtbl(map_file) X X every line := !&input do X write(stripunb('<','>',line,&null,&null,t)) X X # last_k is the stack used in stripunb.icn X if *\last_k ~= 0 then X stop("Unexpected EOF encountered. Expecting ", pop(last_k), ".") X Xend SHAR_EOF true || echo 'restore of stripsgml.icn failed' rm -f _shar_wnt_.tmp fi # ============= stripunb.icn ============== if test -f 'stripunb.icn' -a X"$1" != X"-c"; then echo 'x - skipping stripunb.icn (File already exists)' rm -f _shar_wnt_.tmp else > _shar_wnt_.tmp echo 'x - extracting stripunb.icn (Text)' sed 's/^X//' << 'SHAR_EOF' > 'stripunb.icn' && X############################################################################ X# X# Name: stripunb.icn X# X# Title: Strip unbalanced material X# X# Author: Richard L. Goerwitz X# X# Version: 1.3 X# X############################################################################ X# X# This routine strips material from a line which is unbalanced with X# respect to the characters defined in arguments 1 and 2 (unbalanced X# being defined as bal() defines it, except that characters preceded X# by a backslash are counted as regular characters, and are not taken X# into account by the balancing algorithm). X# X# One little bit of weirdness I added in is a table argument. Put X# simply, if you call stripunb() as follows, X# X# stripunb('<','>',s,&null,&null,t) X# X# and if t is a table having the form, X# X# key: "bold" value: outstr("\e[2m", "\e1m") X# key: "underline" value: outstr("\e[4m", "\e1m") X# etc. X# X# then every instance of "" in string s will be mapped to X# "\e2m," and every instance of "" will be mapped to "\e[1m." X# Values in table t must be records of type output(on, off). When X# "" is encountered, stripunb will output the .off value for the X# preceding .on string encountered. X# X############################################################################ X# X# Links: slashbal.icn X# X############################################################################ X Xglobal last_k Xrecord outstr(on, off) X X Xprocedure stripunb(c1,c2,s,i,j,t) X X # NB: Stripunb() returns a string - not an integer (like find, X # upto). X X local lookinfor, bothcs, s2, k X #global last_k X initial last_k := list() X X /c1 := '<' X /c2 := '>' X bothcs := c1 ++ c2 X lookinfor := c1 ++ '\\' X c := &cset -- c1 -- c2 X X /s := \&subject | stop("stripunb: No string argument.") X /i := \&pos | 1 X /j := *s + 1 X X s2 := "" X s ? { X tab(i) | fail X while s2 ||:= tab(upto(lookinfor)) do { X if ="\\" & any(bothcs) then { X &pos+1 > j & (return s2) X s2 ||:= move(1) X next X } X else { X &pos > j & (return s2) X any(c1) | X stop("stripunb: Unbalanced string, pos(",&pos,").\n",s) X k := tab(slashbal(c,c1,c2,&null,&null,&null,1)) | tab(0) X if \t then { X k ?:= 2(="<", tab(find(">")), =">", pos(0)) X if k ?:= (="/", tab(0)) then { X compl := pop(last_k) | stop("Unclosed <>, ",&subject) X if k == "" X then k := compl X else k == compl | stop("Incorrectly paired <>, .") X s2 ||:= \(\t[k]).off X } X else { X s2 ||:= \(\t[k]).on X push(last_k, k) X } X } X } X } X s2 ||:= tab(0) X } X X return s2 X Xend SHAR_EOF true || echo 'restore of stripunb.icn failed' rm -f _shar_wnt_.tmp fi # ============= readtbl.icn ============== if test -f 'readtbl.icn' -a X"$1" != X"-c"; then echo 'x - skipping readtbl.icn (File already exists)' rm -f _shar_wnt_.tmp else > _shar_wnt_.tmp echo 'x - extracting readtbl.icn (Text)' sed 's/^X//' << 'SHAR_EOF' > 'readtbl.icn' && X############################################################################ X# X# Name: readtbl.icn X# X# Title: Read user-created stripsgml table X# X# Author: Richard L. Goerwitz X# X# Version: 1.1 X# X############################################################################ X# X# This file is part of the stripsgml package. It does the job of read- X# ing option user-created mapping information from a file. The purpose X# of this file is to specify how each code in a given input text should X# be translated. Each line has the form: X# X# SGML-designator start_code end_code X# X# where the SGML designator is something like "quote" (without the quota- X# tion marks), and the start and end codes are the way in which you want X# the beginning and end of a ...<\quote> sequence to be transla- X# ted. Presumably, in this instance, your codes would indicate some set X# level of indentation, and perhaps a font change. If you don't have an X# end code for a particular SGML designator, just leave it blank. X# X############################################################################ X# X# Links: stripsgml.icn X# X############################################################################ X X Xprocedure readtbl(f) X X local t, line, k, on_sequence, off_sequence X X /f & stop("readtbl: Arg must be a valid open file.") X X t := table() X X every line := trim(!f,'\t ') do { X line ? { X k := tabslashupto('\t:') & X tab(many('\t:')) & X on_sequence := tabslashupto('\t:') | tab(0) X tab(many('\t:')) X off_sequence := tab(0) X } | stop("readtbl: Bad map file format.") X insert(t, k, outstr(on_sequence, off_sequence)) X } X X return t X Xend X X X Xprocedure tabslashupto(c,s) X X POS := &pos X X while tab(upto('\\' ++ c)) do { X if ="\\" then { X move(1) X next X } X else { X if any(c) then { X suspend &subject[POS:.&pos] X } X } X } X X &pos := POS X fail X Xend SHAR_EOF true || echo 'restore of readtbl.icn failed' rm -f _shar_wnt_.tmp fi # ============= slashbal.icn ============== if test -f 'slashbal.icn' -a X"$1" != X"-c"; then echo 'x - skipping slashbal.icn (File already exists)' rm -f _shar_wnt_.tmp else > _shar_wnt_.tmp echo 'x - extracting slashbal.icn (Text)' sed 's/^X//' << 'SHAR_EOF' > 'slashbal.icn' && X############################################################################ X# X# Name: slashbal.icn X# X# Title: Bal() with backslash escaping X# X# Author: Richard L. Goerwitz X# X# Version: 1.4 X# X############################################################################ X# X# I am often frustrated at bal()'s inability to deal elegantly with X# the common \backslash escaping convention (a way of telling Unix X# Bourne and C shells, for instance, not to interpret a given X# character as a "metacharacter"). I recognize that bal()'s generic X# behavior is a must, and so I wrote slashbal() to fill the gap. X# X# Slashbal behaves like bal, except that it ignores, for purposes of X# balancing, any c2/c3 char which is preceded by a backslash. Note X# that we are talking about internally represented backslashes, and X# not necessarily the backslashes used in Icon string literals. If X# you have "\(" in your source code, the string produced will have no X# backslash. To get this effect, you would need to write "\\(." X# X# BUGS: Note that, like bal() (v8), slashbal() cannot correctly X# handle cases where c2 and c3 intersect. X# X############################################################################ X# X# Links: none X# X############################################################################ X Xprocedure slashbal(c1, c2, c3, s, i, j) X X local twocs, allcs, chr2, count X X /c1 := &cset X /c2 := '(' X /c3 := ')' X twocs := c2 ++ c3 X allcs := c1 ++ c2 ++ c3 ++ '\\' X X /s := \&subject | stop("slashbal: No string argument.") X /i := \&pos | 1 X /j := *s + 1 X X count := 0 X s ? { X tab(i) | fail X while tab(upto(allcs)) do { X chr := move(1) X if chr == "\\" & any(twocs) then { X chr2 := move(1) X &pos > j & fail X if any(c1, chr) & count = 0 then X suspend .&pos - 2 X if any(c1, chr2) & count = 0 then X suspend .&pos - 1 X } X else { X &pos > j & fail X if any(c1, chr) & count = 0 then X suspend .&pos - 1 X if any(c2, chr) then X count +:= 1 X else if any(c3, chr) then X count -:= 1 X } X } X } X Xend SHAR_EOF true || echo 'restore of slashbal.icn failed' rm -f _shar_wnt_.tmp fi # ============= README ============== if test -f 'README' -a X"$1" != X"-c"; then echo 'x - skipping README (File already exists)' rm -f _shar_wnt_.tmp else > _shar_wnt_.tmp echo 'x - extracting README (Text)' sed 's/^X//' << 'SHAR_EOF' > 'README' && XRe: stripsgml.icn & associated files X XThis program is documented in the various source files, most notably Xstripsgml.icn. Please look them over, even if you are not an Icon Xprogrammer. X XIn order to compile this program, you will need an Icon interpreter X(or compiler). If you do not have it, get it. It is free, and can Xbe obtained via ftp from cs.arizona.edu. If you do not have access Xto the internet, drop a line to the icon-project@arizona.edu, and Xthey will fill you in on what to do. X XIf you are working on a Unix system, you can simply mv Makefile.dist Xto Makefile, and then make. Users on other systems will need to Xtype: X X icont -o stripsgml readtbl.icn slashbal.icn stripsgml.icn stripunb.icn X XAs I said above, see the file stripsgml.icn for more information on how Xto use this program. This program is not fancy, and handles only the Xsimplest <>-style markup. It is in no way an attempt to handle the full Xmetalanguage! X X-Richard (goer@sophist.uchicago.edu) SHAR_EOF true || echo 'restore of README failed' rm -f _shar_wnt_.tmp fi # ============= Makefile.dist ============== if test -f 'Makefile.dist' -a X"$1" != X"-c"; then echo 'x - skipping Makefile.dist (File already exists)' rm -f _shar_wnt_.tmp else > _shar_wnt_.tmp echo 'x - extracting Makefile.dist (Text)' sed 's/^X//' << 'SHAR_EOF' > 'Makefile.dist' && XPROGNAME = stripsgml X X# Please edit these to reflect your local file structure & conventions. XDESTDIR = /usr/local/bin XOWNER = bin XGROUP = bin X XSRC = $(PROGNAME).icn stripunb.icn readtbl.icn slashbal.icn X X$(PROGNAME): $(SRC) X icont -o $(PROGNAME) $(SRC) X X# Pessimistic assumptions regarding the environment (in particular, X# I don't assume you have the BSD "install" shell script). Xinstall: $(PROGNAME) X @sh -c "test -d $(DESTDIR) || (mkdir $(DESTDIR) && chmod 755 $(DESTDIR))" X cp $(PROGNAME) $(DESTDIR)/ X chgrp $(GROUP) $(DESTDIR)/$(PROGNAME) X chown $(OWNER) $(DESTDIR)/$(PROGNAME) X @echo "\nInstallation done.\n" X Xclean: X -rm -f *~ .u? X -rm -f $(PROGNAME) SHAR_EOF true || echo 'restore of Makefile.dist failed' rm -f _shar_wnt_.tmp fi exit 0 Brought to you by Super Global Mega Corp .com