Qizx/Open v0.3

net.xfra.qizxopen.util
Class DefaultWordExtractor

java.lang.Object
  |
  +--net.xfra.qizxopen.util.DefaultWordExtractor
All Implemented Interfaces:
WordExtractor

public class DefaultWordExtractor
extends java.lang.Object
implements WordExtractor

A default word extractor suitable for European languages compatible with ISO-8859-1.

By default, words start on a letter, accept letters/digits inside. Characters are folded to lowercase and - unless setKeepAccents(true) is called - accented letters to the corresponding non-accented letters (e.g eacute maps to 'E'.) This behavior can be redefined in subclasses by redefining isWordStart, isWordPart and mapChar.


Constructor Summary
DefaultWordExtractor()
           
 
Method Summary
 char charAt(int ahead)
          Returns the character at current position + ahead, or 0 if after end.
 boolean isWordPart(char c)
          Returns true if a word may contain this character.
 boolean isWordStart(char c)
          Returns true if a word may begin with this character.
static void main(java.lang.String[] args)
           
 char mapChar(char c)
          Normalizes a character (belonging to a word)
 char nextChar()
          Moves to next character and return it, returns 0 if at end.
 char[] nextWord()
          Gets the next normalized word, or null if no more words.
 void setKeepAccents(boolean keep)
           
 void start(char[] text, int length)
          Starts the analysis of a new text chunk.
 int wordLength()
          Returns the original length of the last word returned by nextWord.
 int wordOffset()
          Returns the offset of the last word returned by nextWord.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DefaultWordExtractor

public DefaultWordExtractor()
Method Detail

start

public void start(char[] text,
                  int length)
Description copied from interface: WordExtractor
Starts the analysis of a new text chunk.

Specified by:
start in interface WordExtractor

isWordStart

public boolean isWordStart(char c)
Returns true if a word may begin with this character.

Specified by:
isWordStart in interface WordExtractor

isWordPart

public boolean isWordPart(char c)
Returns true if a word may contain this character.

Specified by:
isWordPart in interface WordExtractor

mapChar

public char mapChar(char c)
Description copied from interface: WordExtractor
Normalizes a character (belonging to a word)

Specified by:
mapChar in interface WordExtractor

nextWord

public char[] nextWord()
Description copied from interface: WordExtractor
Gets the next normalized word, or null if no more words. Must return a new char array for each word.

Specified by:
nextWord in interface WordExtractor

charAt

public char charAt(int ahead)
Description copied from interface: WordExtractor
Returns the character at current position + ahead, or 0 if after end.

Specified by:
charAt in interface WordExtractor

nextChar

public char nextChar()
Description copied from interface: WordExtractor
Moves to next character and return it, returns 0 if at end.

Specified by:
nextChar in interface WordExtractor

wordOffset

public int wordOffset()
Description copied from interface: WordExtractor
Returns the offset of the last word returned by nextWord.

Specified by:
wordOffset in interface WordExtractor

wordLength

public int wordLength()
Description copied from interface: WordExtractor
Returns the original length of the last word returned by nextWord. (Most often equal to the length of the returned token).

Specified by:
wordLength in interface WordExtractor

setKeepAccents

public void setKeepAccents(boolean keep)

main

public static void main(java.lang.String[] args)

 Copyright Xavier FRANC 2003-2004