Word Search is Useless for Japanese

As mentioned on moongift, Hatta's indexed search doesn't seem to work for the Japanese language. Actually it does work, but in a completely useless way.

It seems that the Japanese don't usually use spaces to separate words – the grammar lets them see where words end without it, similar to how it worked ancient Latin. But Hatta's word indexer uses spaces and punctuation to separate words for indexing – so Japanese sentences, or even whole paragraphs, get indexed as single words. This is obviously pretty useless.

I can think of several possible solutions, but none of them seem to be within my reach without a help from someone who knows Japanese or similar languages:

I would be very interested in any information about how it is usually done in Japanese and other languages/scripts that don't use spaces for separating words. – Radomir Dopieralski


The ejSplitter was very efficient for COREblog (a blog engine for Zope). It should be trivial to port ejSplitter to Hatta. – Klaus Alexander Seistrup


Thank you, this is exactly what I was looking for. Unfortunately the whole ejSplitter required some Zope modules, so I just cut out the code I needed and put it in SplitJapanese.py file. Whenever Hatta finds this file in its import path, it enables indexing of Japanese text. – Radomir Dopieralski


Fixed Bugs