As mentioned on moongift, Hatta's indexed search doesn't seem to work for the Japanese language. Actually it does work, but in a completely useless way.
It seems that the Japanese don't usually use spaces to separate words – the grammar lets them see where words end without it, similar to how it worked ancient Latin. But Hatta's word indexer uses spaces and punctuation to separate words for indexing – so Japanese sentences, or even whole paragraphs, get indexed as single words. This is obviously pretty useless.
I can think of several possible solutions, but none of them seem to be within my reach without a help from someone who knows Japanese or similar languages:
- Write (or find a ready one) a word splitter that can handle Japanese text based on the grammar rules (probably too hard to do).
- Only index Kanji characters, and treat them all as separate words.
- Treat Kanji and runs of Hiragana and Katakana as separate words.
I would be very interested in any information about how it is usually done in Japanese and other languages/scripts that don't use spaces for separating words. – Radomir Dopieralski
The ejSplitter was very efficient for COREblog (a blog engine for Zope). It should be trivial to port ejSplitter to Hatta. – Klaus Alexander Seistrup
Thank you, this is exactly what I was looking for. Unfortunately the whole ejSplitter required some Zope modules, so I just cut out the code I needed and put it in SplitJapanese.py file. Whenever Hatta finds this file in its import path, it enables indexing of Japanese text. – Radomir Dopieralski
![[Home]](/+download/logo.png)