[mdlug] List archive and long lines

David Favro mdlug at meta-dynamic.com
Thu Jan 11 07:01:19 EST 2007


Raymond McLaughlin wrote:
> Below is the code I used to process the archive. It relies on a
> semaphore, :+:+ as a stand in for blank spaces so that whole lines will
> be treated as single arguments. It will mess up any posts, ascii
> graphics for example, that may contain the string :+:+. If there's a
> better way I'd like to learn it.
>   
Perhaps you missed my post here?:
http://mdlug.org/pipermail/mdlug/2007-January/000548.html

I think your script is still collapsing whitespace that is internal to
the line.  I wrote a much smaller, simpler gawk script (posted
previously) that (I think) does the same as your script but avoids
collapsing internal whitespace (for the most part).  Ascii art is
sometimes important! :-)

Mine is just a filter, input -> output: I didn't include the stuff to
iterate through each file and replace it, but this is trivial, I can add
it if you like.

I also only modified the HTML between the <PRE> and </PRE> tags because
I think that going all the way to the <!-- end article --> line will
cause further distortions than just the line-wrapping to how it
currently looks in pipermail.

Otherwise, your script should work, but it seems overly complicated and
obscure to me -- maybe I'm missing something, but I think it can be
replaced by a functionally equivalent script that is much smaller and
simpler.

For example, "$(awk '{ print }' $i) | XXX)" seems to me to be the same
as "$(XXX  < $i)"

Actually, I think your script might have problems with very large
articles.  You are putting the whole article on one line for the shell
to interpret.  There used to be very short limits on the command-line, I
don't know whether they are larger now or completely gone.  You can
redirect input to an entire loop in bash (after the 'done'), but I think
a better way is just to redirect input to sed from the file itself,
rather than using "echo".  This also solves the word-splitting problem
you're having.  It will also run *way* faster because you're firing up a
separate 'sed' for each line and appending them to the output file.  But
better yet, just run my gawk script on each file.

Regarding the way to get bash to treat strings with spaces as single
arguments, use double quotes: you can use "${parm}" or "${@}" or put a
list of space-containing strings into an array and use "${parm[@]}",
whichever is appropriate for what you're trying to do.

Example:
<CODE> (oh, yeah, I'm not allowed, darn!)

    while read input_line
       do
       whatever "${input_line}"
       done < "file"

</CODE> (there I go again)

Finally, I know this is way overkill, but just as a tip on scripting in
general, don't forget concurrency and reentrance -- try mktemp(1) or
embed the shell's PID ($$) in temporary file names.  This script is fine
for the email archives, but you could lose valuable data if you
accidentally run two copies at the same time.

-- David




More information about the mdlug mailing list