srctools.tokenizer

Parses text into groups of tokens.

This is used internally for parsing KV1, text DMX, FGDs, VMTs, etc. If available this will be replaced with a faster Cython-optimised version.

The BaseTokenizer class implements various helper functions for navigating through the token stream. The Tokenizer class then takes text file objects, a full string or an iterable of strings and actually parses it into tokens, while IterTokenizer allows transforming the stream before the destination receives it.

Once the tokenizer is created, either iterate over it or call the tokenizer to fetch the next token/value pair. Lookahead is supported, accessed by the BaseTokenizer.peek() and BaseTokenizer.push_back() methods. Tokenizers also track the current line number as data is read, letting you raise BaseTokenizer.error(...) to easily produce an exception listing the relevant line number and filename.

Character escapes matches utlbuffer.cpp in the SDK. Specifically, the following characters are escaped: \\n, \\t, \\v, \\b, \\r, \\f, \\a, \, ?, ' and ". / and ? are accepted as escapes, but not produced since they’re unambiguous.

Constants

class srctools.tokenizer.Token

Bases: Enum

A token type produced by the tokenizer.

property has_value: bool: If true, this type has an associated value.

EOF = 0: Produced indefinitely after the end of the file is reached.

STRING = 1: Quoted or unquoted text.

NEWLINE = 2: Produced at the end of every line.

PAREN_ARGS = 3: Parenthesised (data).

DIRECTIVE = 4: #name (automatically casefolded).

COMMENT = 5: A // or /* */ comment.

BRACE_OPEN = 6: A { character.

BRACE_CLOSE = 7: A } character.

PAREN_OPEN = 8: A ( character. Only used if PAREN_ARGS is not.

PAREN_CLOSE = 9: A ) character.

PROP_FLAG = 11: A [!flag]

BRACK_OPEN = 12: A [ character. Only used if PROP_FLAG is not.

BRACK_CLOSE = 13: A ] character.

COLON = 14: A : character, if Tokenizer.colon_operator is enabled.

EQUALS = 15: A = character.

PLUS = 16: A + character, if Tokenizer.plus_operator is enabled.

COMMA = 17: A , character.

srctools.tokenizer.BARE_DISALLOWED: Final = frozenset({'\t', '\n', '\r', ' ', '"', "'", '(', ')', ',', ';', '=', '[', ']', '{', '}'}): Characters not allowed for bare strings. These must be quoted.

srctools.tokenizer.escape_text(text: str, multiline: bool = False) → str

Escape special characters and backslashes, so tokenising reproduces them.

This matches utilbuffer.cpp in the SDK. The following characters are escaped: \t, \v, \b, \r, \f, \a, \, ". /, ' and ? are accepted as escapes, but not produced since they’re unambiguous. In addition, \n is escaped only if multiline is false.

Parameters:

text – The text to escape.
multiline – If set, allow \n unchanged.

Errors

exception srctools.tokenizer.TokenSyntaxError( message: str, file: str | PathLike[str] | None = None, line: int | None = None, )

An error that occurred when parsing a file.

Normally this is created via BaseTokenizer.error() which formats text into the error and includes the filename/line number from the tokenizer.

The string representation will include the provided file and line number if present.

mess: str: The error message that occurred.

file: str | PathLike[str] | None: The filename of the file being parsed, or None if not known.

line_num: int | None: The line where the error occurred, or None if not applicable (EOF, for instance).

srctools.tokenizer.format_exc_fileinfo( msg: str, file: str | PathLike[str] | None, line_num: int | None, ) → str

If a line number or file is provided, include those in the error message.

This is the logic for the str() form of TokenSyntaxError.

BaseTokenizer.error_type: type[TokenSyntaxError]: The exception class to produce if an error occurs. This must be a subtype of TokenSyntaxError, since it is passed the line number and filename in addition to the error message. The error() method can be used to intelligently construct an instance to raise.

BaseTokenizer.error( message: str | Token, /, *args: object, ) → TokenSyntaxError

Raise a syntax error exception.

This returns the TokenSyntaxError instance, with line number and filename attributes filled in. The message can be a Token with the associated string value to produce a wrong token error, or a string which will be {}-formatted with the positional args if they are present.

Main API

class srctools.tokenizer.BaseTokenizer( filename: str | PathLike[str] | bytes | None, error: type[TokenSyntaxError] | None, )

Provides an interface for processing text into tokens.

It then provides tools for using those to parse data. This is an abc.ABC, a subclass must be used to provide a source for the tokens.

filename: str | None: The filename that is being parsed. This is passed along to the error class, to produce relevant errors.

line_num: int: The line number of the last token. Can be changed, but is automatically updated whenever Token.NEWLINE tokens are seen.

push_back(tok: Token, value: str | None = None) → None

Return a token, so it will be reproduced when called again.

The value is required for Token.STRING, PAREN_ARGS and PROP_FLAG, but ignored for other token types.

peek( consume_newlines: bool = False, ) → tuple[Token, str]

Peek at the next token, without removing it from the stream.

Parameters:: consume_newlines – Skip over newlines until a non-newline is found. All tokens are preserved.

for ... in skipping_newlines() → Iterator[tuple[Token, str]]: Iterate over the tokens, skipping newlines.

for ... in block( name: str, consume_brace: bool = True, ) → Iterator[str]

Helper iterator for parsing keyvalue style blocks.

This will first consume a {. Then it will skip newlines, and output each string section found. When } is found it terminates, anything else produces an appropriate error. This is safely re-entrant, and tokens can be taken or put back as required.

expect(token: Token, skip_newline: bool = True) → str

Consume the next token, which should be the given type.

If it is not, this raises an error. If skip_newline is true, newlines will be skipped over. This does not apply if the desired token is newline.

class srctools.tokenizer.Tokenizer( data: str | Iterable[str], filename: str | PathLike[str] | bytes | None = None, error: type[TokenSyntaxError] = TokenSyntaxError, *, periodic_callback: Callable[[], object] | None = None, string_bracket: bool = False, string_parens: bool = True, allow_escapes: bool = True, allow_star_comments: bool = False, preserve_comments: bool = False, colon_operator: bool = False, plus_operator: bool = False, )

Processes text data into groups of tokens.

This mainly groups strings and removes comments.

Due to many inconsistencies in Valve’s parsing of files, several options are available to control whether different syntaxes are accepted.

periodic_callback: Callable[[], object] | None: If set, is called periodically after a few lines are parsed. Useful to abort parsing operations from external factors.

string_bracket: bool: If set, [bracket] blocks are parsed as a single string-like block. If disabled these are parsed as BRACK_OPEN, STRING then BRACK_CLOSE.

string_parens: bool: If set, (bracket) blocks are parsed as a single string-like block. If disabled these are parsed as PAREN_OPEN, STRING then PAREN_CLOSE.

allow_escapes: bool: This determines whether \n-style escapes are expanded.

allow_star_comments: bool: If enabled, this allows /* */ comments. Otherwise, an immediate error is produced.

colon_operator: bool: This controls whether : produces COLON tokens, or is treated as part of a bare string.

plus_operator: bool: This controls whether + produces PLUS tokens, or is treated as part of a bare string.

preserve_comments: bool: Token.COMMENT are produced if this is set.

class srctools.tokenizer.IterTokenizer( source: Iterable[tuple[Token, str]], filename: str | PathLike[str] | bytes | None = None, error: type[TokenSyntaxError] = TokenSyntaxError, )

Wraps a token iterator to provide the tokenizer interface.

This is useful to pre-process a token stream before parsing it with other code.

source: Iterator[tuple[Token, str]]: The underlying iterator which tokens are sourced from.