public abstract class StatefulTokenizer
extends java.lang.Object
putRules(String, Rule[])
, e.g. for string processing. Each rule
in a set produces one token with a type name and it can switch to another
state or switch back to the previous state with the special state
"#pop"
. The is a list of tokens with arbitrary type.
The list of produced tokens can be filtered: tokens of same type can be
joined by adding the type with addJoinedType(Object)
and tokens can
be omitted from the result for easier post-processing by adding with
addIgnoredType(Object)
.Modifier and Type | Class and Description |
---|---|
protected static class |
StatefulTokenizer.Rule
A regular expression based rule for building a parsing grammar.
|
static class |
StatefulTokenizer.Token
A token that designates a certain section of a text input.
|
Modifier and Type | Field and Description |
---|---|
protected static java.lang.String |
INITIAL_STATE
The name of the initial state.
|
Modifier | Constructor and Description |
---|---|
protected |
StatefulTokenizer()
Initializes the internal data structures of a new instance.
|
Modifier and Type | Method and Description |
---|---|
protected void |
addIgnoredType(java.lang.Object tokenType)
Adds a token type to the set of tokens that should be ignored in the
tokenizer output.
|
protected void |
addJoinedType(java.lang.Object tokenType)
Adds a token type to the set of tokens that should get joined in the
tokenizer output.
|
protected void |
putRules(StatefulTokenizer.Rule... rules)
Sets the rules for the initial state in the grammar.
|
protected void |
putRules(java.lang.String name,
StatefulTokenizer.Rule... rules)
Sets the rules for the specified state in the grammar.
|
java.util.List<StatefulTokenizer.Token> |
tokenize(java.lang.String data)
Analyzes the specified input string using different sets of rules and
returns a list of token objects describing the content structure.
|
protected static final java.lang.String INITIAL_STATE
protected StatefulTokenizer()
protected void addJoinedType(java.lang.Object tokenType)
tokenType
- Type of the tokens that should be joined.protected void addIgnoredType(java.lang.Object tokenType)
tokenType
- Type of the tokens that should be ignored.protected void putRules(StatefulTokenizer.Rule... rules)
rules
- A sequence or an array with rules to be added.protected void putRules(java.lang.String name, StatefulTokenizer.Rule... rules)
name
- A unique name to identify the rule set.rules
- A sequence or an array with rules to be added.public java.util.List<StatefulTokenizer.Token> tokenize(java.lang.String data)
data
- Input string.