EControl Syntax Editor SDK
Token rules (parser)

Common rule properties are described in "Common for Styles and Rules" part. 

 

Token is an text element. Token have start and end position in text. Tokens can not intersect each other. 

They are defined by the parsing procedure, after which token array is created. Token array is sequential, i.e. end position of token will be less start position of the next token. 

 

Token rules are checked sequentially. After first successful checking of rule loop is broken. 

 

Token rules are intended for detecting tokens. You can control token rules in the page "Parser" of the "Syntax Lexer" dialog. 

 

Properties of token rule: 

Token type - integer value that will be assigned to all tokens detected by means of this rule. It's required only if you want use this tokens in the block detection algorithm. To simplify token type assignment "Token type names" is used. 

 

Token style - style that will be applied to all tokens detected by means of this rule. 

 

Rule expression - regular expression for detecting token. Regular expression is applied to current analysis position. If it match current position is incremented ont the length of found token and token rules loop is broken. See "Syntax of Regular Expressions" to get more information. 

 

Default modifiers for regular expression: (?imxr-s), i.e. ignore case, multi-line (^, $ - start, end of line), supports comments, full Russian char set, any metacharacter "." (dot) does not include end of line "\n". 

 

Rule expression is most robust operation in lexer configuration. Below several example are given. 

 

Identifier 

 

[a-z_]\w* 

 

first character is literal other is word character (including digits). If first character is literal rule text match rule. 

 

String 

 

'.*?('|$) 

 

from the single quote to the next single quote or to the end of line. Multiple any char is "non-greedy" to check next condition ('|$) before incrementing position. ('|$) means either single quote or end of line. 

 

Float const 

 

#with exp. dot is optional 

\d+ \.? \d+ e [\+\-]? \d+ | 

#without exp. dot is required 

\d+ \. \d+ 

 

\d+ - means at least one digit. 

\.? - means optional dot 

[\+\-]? - means optional "+" or "-". 

 

Integer const 

 

\d+ 

 

Note: integer const must be after float const because of any float will be treated as integer until dot or exponent symbols. 

 

Single line comment 

 

//.* 

 

Any charecter from // to end of line. Metacharacter dot does not include end of line, so it is optional place $ end of line metacharacter at the end of expression. 

 

Multiline comment 

 

(?s)\{.*?(\}|\Z) 

 

We turn on modifier "s" to include end of line in metacharacter "." (dot). It will select text from symbol "{" to the symbol "}" or to the end of text. If we skip end of text, i.e. ( (?s)\{.*?\} ) it will cause that comment is not detected. 

 

Multiline define 

 

(?-i)\#define
(.*\\\s*\n)* # lines with line folding
.*           # last line

 

This is an example of complex expression for preprocessor directive "define". 

\#define - means that token starts with #define 

(.*\\\s*\n)* - means multiple any character until symbol "\", then it skips spaces, includes end of line. All this expression can be repeated some times (or not once). 

.* - means all characters until end of line. 

Copyright (c) 2004-2011. All rights reserved.
What do you think about this topic? Send feedback!