srctools.tokenizer

Parses text into groups of tokens.

This is used internally for parsing KV1, text DMX, FGDs, VMTs, etc. If available this will be replaced with a faster Cython-optimised version.

The BaseTokenizer class implements various helper functions for navigating through the token stream. The Tokenizer class then takes text file objects, a full string or an iterable of strings and actually parses it into tokens, while IterTokenizer allows transforming the stream before the destination receives it.

Once the tokenizer is created, either iterate over it or call the tokenizer to fetch the next token/value pair. Lookahead is supported, accessed by the BaseTokenizer.peek() and BaseTokenizer.push_back() methods. Tokenizers also track the current line number as data is read, letting you raise BaseTokenizer.error(...) to easily produce an exception listing the relevant line number and filename.

Character escapes matches utlbuffer.cpp in the SDK. Specifically, the following characters are escaped: \\n, \\t, \\v, \\b, \\r, \\f, \\a, \, ?, ' and ". / and ? are accepted as escapes, but not produced since they’re unambiguous.

Constants

class srctools.tokenizer.Token

Bases: Enum

A token type produced by the tokenizer.

property has_value: bool

If true, this type has an associated value.

EOF = 0

Produced indefinitely after the end of the file is reached.

STRING = 1

Quoted or unquoted text.

NEWLINE = 2

Produced at the end of every line.

PAREN_ARGS = 3

Parenthesised (data).

DIRECTIVE = 4

#name (automatically casefolded).

COMMENT = 5

A // or /* */ comment.

BRACE_OPEN = 6

A { character.

BRACE_CLOSE = 7

A } character.

PAREN_OPEN = 8

A ( character. Only used if PAREN_ARGS is not.

PAREN_CLOSE = 9

A ) character.

PROP_FLAG = 11

A [!flag]

BRACK_OPEN = 12

A [ character. Only used if PROP_FLAG is not.

BRACK_CLOSE = 13

A ] character.

COLON = 14

A : character, if Tokenizer.colon_operator is enabled.

EQUALS = 15

A = character.

PLUS = 16

A + character, if Tokenizer.plus_operator is enabled.

COMMA = 17

A , character.

srctools.tokenizer.BARE_DISALLOWED: Final = frozenset({'\t', '\n', '\r', ' ', '"', "'", '(', ')', ',', ';', '=', '[', ']', '{', '}'})

Characters not allowed for bare strings. These must be quoted.

srctools.tokenizer.escape_text(text: str, multiline: bool = False) str

Escape special characters and backslashes, so tokenising reproduces them.

This matches utilbuffer.cpp in the SDK. The following characters are escaped: \t, \v, \b, \r, \f, \a, \, ". /, ' and ? are accepted as escapes, but not produced since they’re unambiguous. In addition, \n is escaped only if multiline is false.

Parameters:
  • text – The text to escape.

  • multiline – If set, allow \n unchanged.

Errors

exception srctools.tokenizer.TokenSyntaxError(
message: str,
file: str | PathLike[str] | None = None,
line: int | None = None,
)

An error that occurred when parsing a file.

Normally this is created via BaseTokenizer.error() which formats text into the error and includes the filename/line number from the tokenizer.

The string representation will include the provided file and line number if present.

mess: str

The error message that occurred.

file: str | PathLike[str] | None

The filename of the file being parsed, or None if not known.

line_num: int | None

The line where the error occurred, or None if not applicable (EOF, for instance).

srctools.tokenizer.format_exc_fileinfo(
msg: str,
file: str | PathLike[str] | None,
line_num: int | None,
) str

If a line number or file is provided, include those in the error message.

This is the logic for the str() form of TokenSyntaxError.

BaseTokenizer.error_type: type[TokenSyntaxError]

The exception class to produce if an error occurs. This must be a subtype of TokenSyntaxError, since it is passed the line number and filename in addition to the error message. The error() method can be used to intelligently construct an instance to raise.

BaseTokenizer.error(
message: str | Token,
/,
*args: object,
) TokenSyntaxError

Raise a syntax error exception.

This returns the TokenSyntaxError instance, with line number and filename attributes filled in. The message can be a Token with the associated string value to produce a wrong token error, or a string which will be {}-formatted with the positional args if they are present.

Main API

class srctools.tokenizer.BaseTokenizer(
filename: str | PathLike[str] | bytes | None,
error: type[TokenSyntaxError] | None,
)

Provides an interface for processing text into tokens.

It then provides tools for using those to parse data. This is an abc.ABC, a subclass must be used to provide a source for the tokens.

filename: str | None

The filename that is being parsed. This is passed along to the error class, to produce relevant errors.

line_num: int

The line number of the last token. Can be changed, but is automatically updated whenever Token.NEWLINE tokens are seen.

push_back(tok: Token, value: str | None = None) None

Return a token, so it will be reproduced when called again.

The value is required for Token.STRING, PAREN_ARGS and PROP_FLAG, but ignored for other token types.

peek(
consume_newlines: bool = False,
) tuple[Token, str]

Peek at the next token, without removing it from the stream.

Parameters:

consume_newlines – Skip over newlines until a non-newline is found. All tokens are preserved.

for ... in skipping_newlines() Iterator[tuple[Token, str]]

Iterate over the tokens, skipping newlines.

for ... in block(
name: str,
consume_brace: bool = True,
) Iterator[str]

Helper iterator for parsing keyvalue style blocks.

This will first consume a {. Then it will skip newlines, and output each string section found. When } is found it terminates, anything else produces an appropriate error. This is safely re-entrant, and tokens can be taken or put back as required.

expect(token: Token, skip_newline: bool = True) str

Consume the next token, which should be the given type.

If it is not, this raises an error. If skip_newline is true, newlines will be skipped over. This does not apply if the desired token is newline.

class srctools.tokenizer.Tokenizer(
data: str | Iterable[str],
filename: str | PathLike[str] | bytes | None = None,
error: type[TokenSyntaxError] = TokenSyntaxError,
*,
periodic_callback: Callable[[], object] | None = None,
string_bracket: bool = False,
string_parens: bool = True,
allow_escapes: bool = True,
allow_star_comments: bool = False,
preserve_comments: bool = False,
colon_operator: bool = False,
plus_operator: bool = False,
)

Processes text data into groups of tokens.

This mainly groups strings and removes comments.

Due to many inconsistencies in Valve’s parsing of files, several options are available to control whether different syntaxes are accepted.

periodic_callback: Callable[[], object] | None

If set, is called periodically after a few lines are parsed. Useful to abort parsing operations from external factors.

string_bracket: bool

If set, [bracket] blocks are parsed as a single string-like block. If disabled these are parsed as BRACK_OPEN, STRING then BRACK_CLOSE.

string_parens: bool

If set, (bracket) blocks are parsed as a single string-like block. If disabled these are parsed as PAREN_OPEN, STRING then PAREN_CLOSE.

allow_escapes: bool

This determines whether \n-style escapes are expanded.

allow_star_comments: bool

If enabled, this allows /* */ comments. Otherwise, an immediate error is produced.

colon_operator: bool

This controls whether : produces COLON tokens, or is treated as part of a bare string.

plus_operator: bool

This controls whether + produces PLUS tokens, or is treated as part of a bare string.

preserve_comments: bool

Token.COMMENT are produced if this is set.

class srctools.tokenizer.IterTokenizer(
source: Iterable[tuple[Token, str]],
filename: str | PathLike[str] | bytes | None = None,
error: type[TokenSyntaxError] = TokenSyntaxError,
)

Wraps a token iterator to provide the tokenizer interface.

This is useful to pre-process a token stream before parsing it with other code.

source: Iterator[tuple[Token, str]]

The underlying iterator which tokens are sourced from.