Categories
Computing Programming Projects Webpage Content Analyser Work

Webpage Content Analyser v0.1.1

I’ve just updated the webpage content analyser as the version I originally uploaded was a devlopment version and didn’t actually work!

Anyway the new version still needs tweaking but it gives you an idea of the kind of thing it can produce. Syntax of invocation is still the same as is the download link, both of which can be found on the web content analyser page on the right hand menu.

Categories
Computing Programming Webpage Content Analyser Work

Webpage Content Analyser

At work we use Dansguardian to provide content based filtering for our users. We found in a few instances Dansguardian’s default phraselists was blocking content that we wanted our users to have access to, so we needed to modify them.

We had a problem though, we didn’t know what words over a set of pages were common so that we could use to modify our current phraselists. So I wrote a tool, WPA, that would take a webpage strip all the formating and give a list of most common phrases and words.

To use WPA do the following:

java -jar WPA.jar site1 site2 site3 …

You can also use it to see what words you should be banning if you have a set of bad pages you don’t want your users to be viewing. Using your blocked domains list is good for this to prevent access to similar content:

cat /etc/dansguardian/lists/global-block/domains | java -jar WPA.jar

All you need to do then is decide what phrases to add to your weighted phrase lists!

Of course the beauty of this program is that it’s cross platform thanks to Java, so you’ll be able to run this on Linux, Windows or Mac. If you want the source code for this then it’s all bundled up in the jar.

To get hold of the jar Click Here.