Webpage Content Analyser

A tool to analyse the use of words on a set of webpages and return a count of common words and phrases. It was built with the intention of generating information to manipulate weighted phrase lists for dansguardian but it certainly has its oither uses and can be extended.

Version 0.1 produces a list of words and phrases with a count of the number of times that word or phrase appears over the set of lists. Coded using Java’s HashMaps it’s quite quick although the downloading if using a a set of over 100 sites will take a bit of time. I’ve tested WPA with a list of over 13000 so it should meet end users needs!

Version 0.2 will have threading support so a page can be analysed whilst content is being downloaded in the background and the option of formatting the output in dansguardian weighted phrase list format so you can copy and paste the output.

To use WPA do the following:

java -jar WPA.jar site1 site2 site3 …

You can also use it to see what words you should be banning if you have a set of bad pages you don’t want your users to be viewing. Using your blocked domains list is good for this to prevent access to similar content:

cat /etc/dansguardian/lists/global-block/domains | java -jar WPA.jar

All you need to do then is decide what phrases to add to your weighted phrase lists!

Of course the beauty of this program is that it’s cross platform thanks to Java, so you’ll be able to run this on Linux, Windows or Mac, as long as you have the Java run time installed for your operating system. If you want the source code for this then it’s all bundled up in the jar.

To get hold of the jar Click Here.

Leave a Reply

Your email address will not be published. Required fields are marked *