perl from procmail


NEW: I wrote a perl script which is MIME-aware. Note that there are lots of other scripts out there written by other people and which perform more or less the same sorts of things.

 

It is possible to "bug" an e-mail notwithstanding so-called "return receipt" features built in to the e-mail protocols. All it takes is a (valid) assumption that the recipient uses an HTML-aware e-mail client which has an always-on internet connection.

This one goes beyond the MIME parts of messages, and actually rewrites the contents of the message. This rule assumes that the only message part which would contain portions worth rewriting would be text/html MIME types. The rule doesn't check for the message type, it just blithely rewrites the contents of any message part containing match strings of interest.

What it does, presuming that the string of interest is contained in a text/html part (otherwise the results are undetermined) is to rewrite all self-loading graphics tags (which would tip you off to the sender) so that they are hyperlinks you need to click on. At least, that's the idea.

perl is not exactly resource-lite. I would imagine that most ISPs would balk at your invoking perl from a procmail script. (Actually, this is turning out not to be the case. -- FWM, 03-Jun-2002)

Given the following input,

<HTML>
<HEAD>
<TITLE>Test</TITLE>
</HEAD>
<BODY>
<P>This body has an image embedded in it:
<IMG SRC="quizzical.jpg">
</BODY>
</HTML>

Joe's script looks like this:

s/\<((.+)?)IMG([ ]+)SRC.([^>]+)\>/\<removed image\>/gi

and produces this output:

<HTML>
<HEAD>
<TITLE>Test</TITLE>
</HEAD>
<BODY>
<P>This body has an image embedded in it:
<removed image>
</BODY>
</HTML>

Never content to leave well-enough alone, this is the script I actually have in my .procmailrc (at least until my indulgent but put-upon ISP complains about the system load):

s/\<((.+)?)IMG([ ]+)SRC([^"]+)"([^"]+)[^\>]+>/\<A HREF="\5"\>removed: \5\<\/A\>/gi

which produces this output:

<HTML>
<HEAD>
<TITLE>Test</TITLE>
</HEAD>
<BODY>
<P>This body has an image embedded in it:
<A HREF="quizzical.jpg">removed: quizzical.jpg</A>
</BODY>
</HTML>

I like this better, because the URL of the suspect tag is now displayed in-line as a hyperlink, and you can click on it if you still wish to view the original tag. Burley!

The actual rule in the .procmailrc looks like this:

:0 B h b
*IMG.*SRC
{   
    :0 f b w
    |/usr/bin/perl -p -e 's/\<((.+)?)IMG([      ]+)SRC([^"]+)"([^"]+)[^\>]+>/\<A HREF="\5"\>removed: \5\<\/A\>/gi'
}

Spy vs Spy: In Mad Magazine from many years ago there was a cartoon strip called Spy vs Spy; the theme was measures and countermeasures. Indeed, there is an "arms race" of sorts in regards to this sort of filtering. For instance, the above rule assumes that the SRC attribute directly follows the IMG keyword in the tag, which ain't necessarily so: it's good practice to put the SRC attribute first, before such things as WIDTH, ALT, etc. keywords, but since when have spammers engaged in good practices? Furthermore, an increasing number of text/html parts are coming through encoded as quoted-printable (and even base 64) which evades the search-and-replace string.

I still think it's about time for a processing agent which understands something of the logical structure of e-mails!

 

By the way, it's really handy to be able to run perl on the command line to test your rules. Presuming you have a file named test.html which you want to test your rules on, the command line (and output) might look like this:

inwa.net:~/test$ cat test.html | /usr/bin/perl -p -e 's/\<((.+)?)IMG([ ]+)SRC([
^"]+)"([^"]+)[^\>]+>/\<A HREF="\5"\>removed: \5\<\/A\>/gi'
<HTML>
<HEAD>
<TITLE>Test</TITLE>
</HEAD>
<BODY>
<P>This body has an image embedded in it:
<A HREF="quizzical.jpg">removed: quizzical.jpg</A>
</BODY>
</HTML>


rev. date: 03-Jun-2002
rev. by: Fred Morris