Webpage Content Analyser

At work we use Dansguardian to provide content based filtering for our users. We found in a few instances Dansguardian’s default phraselists was blocking content that we wanted our users to have access to, so we needed to modify them.

We had a problem though, we didn’t know what words over a set of pages were common so that we could use to modify our current phraselists. So I wrote a tool, WPA, that would take a webpage strip all the formating and give a list of most common phrases and words.

To use WPA do the following:

java -jar WPA.jar site1 site2 site3 …

You can also use it to see what words you should be banning if you have a set of bad pages you don’t want your users to be viewing. Using your blocked domains list is good for this to prevent access to similar content:

cat /etc/dansguardian/lists/global-block/domains | java -jar WPA.jar

All you need to do then is decide what phrases to add to your weighted phrase lists!

Of course the beauty of this program is that it’s cross platform thanks to Java, so you’ll be able to run this on Linux, Windows or Mac. If you want the source code for this then it’s all bundled up in the jar.

To get hold of the jar Click Here.

Leave a Reply

Your email address will not be published. Required fields are marked *