perl from procmailNEW: I wrote a perl
script which is MIME-aware. Note that there are lots of other
scripts out there written by other people and which perform more
or less the same sorts of things.
It is possible to "bug" an e-mail notwithstanding so-called "return receipt" features built in to the e-mail protocols. All it takes is a (valid) assumption that the recipient uses an HTML-aware e-mail client which has an always-on internet connection.
This one goes beyond the MIME parts of messages,
and actually rewrites the contents of the message. This rule assumes
that the only message part which would contain portions worth
rewriting would be text/html MIME types. The
rule doesn't check for the message type, it just blithely rewrites
the contents of any message part containing match strings of interest.
What it does, presuming that the string of interest is contained
in a text/html part (otherwise the results are undetermined)
is to rewrite all self-loading graphics tags (which would tip
you off to the sender) so that they are hyperlinks you need to
click on. At least, that's the idea.
perl is not exactly resource-lite. I would imagine
that most ISPs would balk at your invoking perl from a procmail
script. (Actually, this is turning out not to be
the case. -- FWM, 03-Jun-2002)
Given the following input,
<HTML> <HEAD> <TITLE>Test</TITLE> </HEAD> <BODY> <P>This body has an image embedded in it: <IMG SRC="quizzical.jpg"> </BODY> </HTML>
Joe's script looks like this:
s/\<((.+)?)IMG([ ]+)SRC.([^>]+)\>/\<removed image\>/gi
and produces this output:
<HTML> <HEAD> <TITLE>Test</TITLE> </HEAD> <BODY> <P>This body has an image embedded in it: <removed image> </BODY> </HTML>
Never content to leave well-enough alone, this is the script I actually have in my .procmailrc (at least until my indulgent but put-upon ISP complains about the system load):
s/\<((.+)?)IMG([ ]+)SRC([^"]+)"([^"]+)[^\>]+>/\<A HREF="\5"\>removed: \5\<\/A\>/gi
which produces this output:
<HTML> <HEAD> <TITLE>Test</TITLE> </HEAD> <BODY> <P>This body has an image embedded in it: <A HREF="quizzical.jpg">removed: quizzical.jpg</A> </BODY> </HTML>
I like this better, because the URL of the suspect tag is now displayed in-line as a hyperlink, and you can click on it if you still wish to view the original tag. Burley!
The actual rule in the .procmailrc looks like this:
:0 B h b
*IMG.*SRC
{
:0 f b w
|/usr/bin/perl -p -e 's/\<((.+)?)IMG([ ]+)SRC([^"]+)"([^"]+)[^\>]+>/\<A HREF="\5"\>removed: \5\<\/A\>/gi'
}
Spy vs Spy: In Mad Magazine
from many years ago there was a cartoon strip called Spy
vs Spy; the theme was measures and countermeasures. Indeed,
there is an "arms race" of sorts in regards to this
sort of filtering. For instance, the above rule assumes that the
SRC attribute directly follows the IMG
keyword in the tag, which ain't necessarily so: it's good practice
to put the SRC attribute first, before such things
as WIDTH, ALT, etc. keywords, but since
when have spammers engaged in good practices? Furthermore, an
increasing number of text/html parts are coming through
encoded as quoted-printable (and even base 64) which evades the
search-and-replace string.
I still think it's about time for a processing agent which understands something of the logical structure of e-mails!
By the way, it's really handy to be
able to run perl on the command line to test your rules. Presuming
you have a file named test.html which you want to
test your rules on, the command line (and output) might look like
this:
inwa.net:~/test$ cat test.html | /usr/bin/perl -p -e 's/\<((.+)?)IMG([ ]+)SRC([ ^"]+)"([^"]+)[^\>]+>/\<A HREF="\5"\>removed: \5\<\/A\>/gi' <HTML> <HEAD> <TITLE>Test</TITLE> </HEAD> <BODY> <P>This body has an image embedded in it: <A HREF="quizzical.jpg">removed: quizzical.jpg</A> </BODY> </HTML>