I work in a school, and part of my job entails considering eSafety, as the bulk of users are children under 18. This usually manifests itself in the form of content filtering; not only sites with inappropriate content but also those that fall under the "timewasting" category.
If you're familiar with parental control applications like NetNanny or CYBERsitter, you may already know that one way to circumvent a filter is to use a proxy or anonymizer... that is, until the one you're using is blocked. They spring up all the time, so blocking them is very much a game of cat and mouse.
I have been adding to my list of blocked anonymizers for quite a while now because relying on community-maintained blocklists sometimes isn't enough for up to the minute anonymizers. Recently I noticed that the bulk of anonymizers I was blocking actually all used the same server software/script and website layout - just at different URLs and IPs. The domain names used seem to follow a trend as well: moonpirate.info, swordfruit.com, goodchicken.info, bowlingcat.com, glueflower.com, plus many many more. I decided to look into it further.
As with any decent filtering software, DansGuardian allows you to configure string matching within website content in addition to the standard URL blocklists. There are plenty of phrases present on the page that could be used to detect and block the page if found, for example "This is a site for bypassing internet censorship." I decided to add this to DansGuardian's configuration but it didn't appear to detect the string. The first thing I did was check the source code for the anonymizer in question - this is what I saw:
The author of the page has clearly thought about how to prevent the page from being detected by content filters by using JavaScript to output the entire HTML code for the page "on the fly" only two or three characters at a time, whilst spacing out the text with a load of blank variables (i.e. no visible difference to the user). The page the browser renders is the same, but because the content filter only looks at the code within the page it's filtering, breaking up the text into pieces is enough to thwart string filtering. If you would like to look at the code for yourself, here are the original and clean HTML versions (I removed all the JavaScript and kept the code in case I could find any information to trace the script back to an author - there doesn't appear to be anything indicative of this apart from a Google Analytics site ID).
Not only that, but looking at other anonymizers using the same script, the blocks of text are (seemingly) random lengths or two or three, in a random order. So for example, the word "filter" could be written as any of these on a number of different sites:
"fi" + "lt" + "er" "fi" + "lte" + "r" "fil" + te" + "r"
So all I needed to do here is pick a word on the page and create filter rules for each way it can be expressed. Any single word should be alright, because it's safe to say no website that isn't trying to circumvent filters would output text in this way.
Here is some code to place in DansGuardian's banned phrase list, to detect the word "bypassing". Each line is treated individually, but all comma-separated values on any given line must be present for that rule to pass as 'true':
<+ ' b' +>,<+ 'yp' +>,<+ 'as' +>,<+ 'si' +>,<+ 'ng' +> <+ ' b' +>,<+ 'yp' +>,<+ 'as' +>,<+ 'si' +>,<+ 'ng ' +> <+ ' b' +>,<+ 'yp' +>,<+ 'as' +>,<+ 'sin' +>,<+ 'g ' +> <+ ' b' +>,<+ 'yp' +>,<+ 'as' +>,<+ 'sin' +>,<+ 'g I' +> <+ ' b' +>,<+ 'yp' +>,<+ 'ass' +>,<+ 'in' +>,<+ 'g ' +> <+ ' b' +>,<+ 'yp' +>,<+ 'ass' +>,<+ 'in' +>,<+ 'g I' +> <+ ' b' +>,<+ 'ypa' +>,<+ 'ss' +>,<+ 'in' +>,<+ 'g '+> <+ ' b' +>,<+ 'ypa' +>,<+ 'ss' +>,<+ 'in' +>,<+ 'g I'+> <+ ' b' +>,<+ 'ypa' +>,<+ 'ss' +>,<+ 'ing' +> <+ ' b' +>,<+ 'ypa' +>,<+ 'ssi' +>,<+ 'ng' +> <+ ' b' +>,<+ 'ypa' +>,<+ 'ssi' +>,<+ 'ng ' +> <+ 'by' +>,<+ 'pa' +>,<+ 'ss' +>,<+ 'ing' +> <+ 'by' +>,<+ 'pa' +>,<+ 'ssi' +>,<+ 'ng' +> <+ 'by' +>,<+ 'pa' +>,<+ 'ssi' +>,<+ 'ng ' +> <+ 'by' +>,<+ 'pas' +>,<+ 'si' +>,<+ 'ng' +> <+ 'by' +>,<+ 'pas' +>,<+ 'si' +>,<+ 'ng ' +> <+ 'by' +>,<+ 'pas' +>,<+ 'sin' +>,<+ 'g ' +> <+ 'by' +>,<+ 'pas' +>,<+ 'sin' +>,<+ 'g I' +> <+ ' by' +>,<+ 'pa' +>,<+ 'ss' +>,<+ 'ing' +> <+ ' by' +>,<+ 'pa' +>,<+ 'ssi' +>,<+ 'ng' +> <+ ' by' +>,<+ 'pa' +>,<+ 'ssi' +>,<+ 'ng ' +> <+ ' by' +>,<+ 'pas' +>,<+ 'si' +>,<+ 'ng' +> <+ ' by' +>,<+ 'pas' +>,<+ 'si' +>,<+ 'ng ' +> <+ ' by' +>,<+ 'pas' +>,<+ 'sin' +>,<+ 'g ' +> <+ ' by' +>,<+ 'pas' +>,<+ 'sin' +>,<+ 'g I' +> <+ 'byp' +>,<+ 'as' +>,<+ 'si' +>,<+ 'ng' +> <+ 'byp' +>,<+ 'as' +>,<+ 'si' +>,<+ 'ng ' +> <+ 'byp' +>,<+ 'as' +>,<+ 'sin' +>,<+ 'g ' +> <+ 'byp' +>,<+ 'as' +>,<+ 'sin' +>,<+ 'g I' +> <+ 'byp' +>,<+ 'ass' +>,<+ 'in' +>,<+ 'g ' +> <+ 'byp' +>,<+ 'ass' +>,<+ 'in' +>,<+ 'g I' +> <+ 'byp' +>,<+ 'ass' +>,<+ 'ing' +> <+ 'r b' +>,<+ 'yp' +>,<+ 'as' +>,<+ 'si' +>,<+ 'ng' +> <+ 'r b' +>,<+ 'yp' +>,<+ 'as' +>,<+ 'si' +>,<+ 'ng ' +> <+ 'r b' +>,<+ 'yp' +>,<+ 'as' +>,<+ 'sin' +>,<+ 'g ' +> <+ 'r b' +>,<+ 'yp' +>,<+ 'as' +>,<+ 'sin' +>,<+ 'g I' +> <+ 'r b' +>,<+ 'yp' +>,<+ 'ass' +>,<+ 'in' +>,<+ 'g ' +> <+ 'r b' +>,<+ 'yp' +>,<+ 'ass' +>,<+ 'in' +>,<+ 'g I' +> <+ 'r b' +>,<+ 'ypa' +>,<+ 'ss' +>,<+ 'in' +>,<+ 'g '+> <+ 'r b' +>,<+ 'ypa' +>,<+ 'ss' +>,<+ 'in' +>,<+ 'g I'+> <+ 'r b' +>,<+ 'ypa' +>,<+ 'ss' +>,<+ 'ing' +> <+ 'r b' +>,<+ 'ypa' +>,<+ 'ssi' +>,<+ 'ng' +> <+ 'r b' +>,<+ 'ypa' +>,<+ 'ssi' +>,<+ 'ng ' +>
Now when I try and access an anonymizer that I hadn't previously blocked, I am presented with the DansGuardian block page. One week after inserting these rules I've not needed to block any more anonymizers... Until the script is modified of course