How I implemented language aware web scraper

I needed a huge sample of live Slovenian language. The simplest way would be to use digitized books, but that comes with it's own problems. Many books are old, written in somewhat archaic language and a lot of talkative words and slang are not found in books. There is also trouble with accessing huge volumes of it since many might have some copyright limitations attached. Another thing I could do is parse known news sites, but I'd still be left with slang usage problem. The problem called for a customized web scraper, which would allow me to access vast amounts of people written content on the web. One of the problems in writing such a program was language detection , since I only needed words from specific language, otherwise I would only get useless mess out of it.

I called the scraper itself Prasker and it is available on GitHub for free (MIT licensed).

I wrote the scraper in Python. I approached the problem by using dictionaries of all the words from specific language, Slovenian in my case. I found it on the web, but I could also obtain it by the very script I wrote, only guided to web portals in known language and with language detection feature turned off. When the program starts, it reads all the words to a set and retains them there until the program is finished.

Then it reads the first web page content from the first URL that is available. The page scrapes the content into one or more text blocks. For instance, one comment might become one text block. The reason for this is how the parser works, but also the following. Parser also checks for duplicate text blocks, which is why text blocks shouldn't be too large, for example joined into one big block consisting of all the text from whole page. This is certainly something that would be easy to do, but then the text samples would have much of the text repeated.

Just think of how many links load same (dynamic) content of the page, that will get parsed again and again. Think of the text on Twitter, for example: "Have an account? Sign in". It is present on top of each twitter member timeline and each twit page. Texts like this might represent a significant amount of final result and could spoil it's statistical value.

On the other hand, the blocks shouldn't be so small, that they would consist of individual HTML tags. This is because various "style" tags that also carry text, for instance <u> or <i>, would break into many little blocks, obscuring what the sentences were in thereoriginally. This could possibly interfere in some statistical analysis that needs to search for whole sentences.

So now I have a content of a page, divided into one or more arbitrary long text blocks. Each block is then separately tested for language. Each word in block is compared to all the words in given dictionary and if 80% of them have a match, the block is considered to be written in that language. The 80%thresholdis set by guessing and couple of experiments, but it is not systematically obtained. The systematic analysis of what a "sweet spot" should be would also depend on language, specific dictionary file and point-in-time, so I didn't bother. After I found one that it "kinda works" I left it be.

The reason why there is a need for a specific threshold at all is that if it is too low or too high, the parser would basically stop working the way it is intended, that is, to give the sample of live language. There is a significant chance, that the dictionary doesn't contain all the words of a spoken language at that point in time, none does. But even on top of that, live languages have tendency to evolve, so even if the perfect dictionary was released today, tomorrow it would not be accurateany more Also, and this might be just the feature of smaller languages such as Slovenian, there is a substantial use of phrases inculturallydominant language, which is English in Slovenia. Arbitrary words or phrases are written in English language even when the general text is in Slovenian. If the dictionary contained words of two languages simultaneously as an alternative solution to this, then it wouldn't be able to separate between one or the otherany more And the final reason for introducing the threshold is that such a system allowssloppinessin assembling dictionary, which is also very helpful in real life projects. An approximate dictionary will do just fine in practice.

I enumerated many reasons for lowering the threshold, but on the other hand, the threshold shouldn't be too low neither. In extreme setting of 0% it is the same as if there was no dictionary at all. Beyond that, the higher you set it, the more accurate the result is. This is probably not so important for scraping Slovenian texts from the internet, because if small percentage of Slovene words appear, it is still very likely that the page or links from it are basically in Slovenian language. Why would anyone use obscure Slovene words? For a dominant language such as English the probability is not in parser'sfavour.Short Slovenian text just might consist of 50% English words, and I probably won't want that gibberish in my, presumably, English sample. I do want it in Slovenian sample though, it is the use of live language after all.

So, after the block of text is analysed it is either found out to be written in dictionary language or not. It's language status is reported back to the parsing part of the program. If the block is not in correct language, it is ignored. If it is, program appends to the output file. Furthermore, parsing part of the program collects language analysis statuses of all blocks in current web page and makes a statistic decision, again using 80% threshold. If 80% or more words cumulatively in all text blocks on the page are in given language, the page itself, not just particular text block, isfinallyrecognised as being written in that language. In such a case the links from that page are extracted and added to a queue from which they will eventually be removed and processed. Pages that are not dominantly written in specified language are ones, where the recursive scraping stops. Links from such pages are not picked up. This way I prevented the program toaimlesslybrowse wide multilingual internet in vein, rejecting one block of text after another, not realising it is useless to seek Slovenian text on Guardian web portal, for instance.

I also implemented one extra feature based on described implementation. The main question is what can you do with words that are not found in dictionary, but are placed in a dictionary specified language text block anyway. These are potentially interesting words, that are filtered out for free while I searched for texts. If the program has an output file available for these, it stores them in separate file, for later reviewing. Common words, that might appear, but you don't want them saved in unknown words collection, can be specified via "ignore dictionary". The way this works is the following. Ignore dictionary is loaded to set on program start, just as the normal dictionary is. Every word that is found not to be in the dictionary is then sought in "ignore dictionary" if it is not found, it is saved to specified new words file.

Previous: Why the name red-black tree?
Next: C# Implementations of Binary Heaps