srctools.tokenizer

Parses text into groups of tokens.

This is used internally for parsing KV1, text DMX, FGDs, VMTs, etc. If available this will be replaced with a faster Cython-optimised version.

The BaseTokenizer class implements various helper functions for navigating through the token stream. The Tokenizer class then takes text file objects, a full string or an iterable of strings and actually parses it into tokens, while IterTokenizer allows transforming the stream before the destination receives it.

Once the tokenizer is created, either iterate over it or call the tokenizer to fetch the next token/value pair. One token of lookahead is supported, accessed by the BaseTokenizer.peek() and BaseTokenizer.push_back() methods. They also track the current line number as data is read, letting you raise BaseTokenizer.error(...) to easily produce an exception listing the relevant line number and filename.

exception srctools.tokenizer.TokenSyntaxError( message: str, file: str | _os.PathLike[str] | None = None, line: int | None = None, )

An error that occurred when parsing a file.

Normally this is created via BaseTokenizer.error() which formats text into the error and includes the filename/line number from the tokenizer.

The string representation will include the provided file and line number if present.

mess: str: The error message that occurred.

file: str | _os.PathLike[str] | None: The filename of the file being parsed, or None if not known.

line_num: int | None: The line where the error occurred, or None if not applicable (EOF, for instance).

srctools.tokenizer.BARE_DISALLOWED: Final = frozenset({'\t', '\n', '\r', ' ', '"', "'", '(', ')', ',', ';', '=', '[', ']', '{', '}'}): Characters not allowed for bare strings. These must be quoted.

class srctools.tokenizer.Token( value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None, )

Bases: Enum

A token type produced by the tokenizer.

EOF = 0: Produced indefinitely after the end of the file is reached.

STRING = 1: Quoted or unquoted text.

NEWLINE = 2: Produced at the end of every line.

PAREN_ARGS = 3: Parenthesised (data).

DIRECTIVE = 4: #name (automatically casefolded).

COMMENT = 5: A // or /* */ comment.

BRACE_OPEN = 6: A { character.

BRACE_CLOSE = 7: A } character.

PROP_FLAG = 11: A [!flag]

BRACK_OPEN = 12: A [ character. Only used if PROP_FLAG is not.

BRACK_CLOSE = 13: A ] character.

COLON = 14: A : character, if colon_operator is enabled.

EQUALS = 15: A = character.

PLUS = 16: A + character, if Tokenizer.plus_operator is enabled.

COMMA = 17: A , character.

has_value: If true, this type has an associated value.

class srctools.tokenizer.BaseTokenizer( filename: str | _os.PathLike[str] | None, error: Type[TokenSyntaxError], )

Provides an interface for processing text into tokens.

It then provides tools for using those to parse data. This is an abc.ABC, a subclass must be used to provide a source for the tokens.

filename: str | None: The filename that is being parsed. This is passed along to the error class, to produce relevant errors.

error_type: Type[TokenSyntaxError]: The exception class to produce if an error occurs. This must be a subtype of TokenSyntaxError, since it is passed the line number and filename in addition to the error message. The error() method can be used to intelligently construct an instance to raise.

line_num: int: The line number of the last token. Can be changed, but is automatically updated whenever Token.NEWLINE tokens are seen.

error( message: str | Token, *args: object, ) → TokenSyntaxError

Raise a syntax error exception.

This returns the TokenSyntaxError instance, with line number and filename attributes filled in. The message can be a Token with the associated string value to produce a wrong token error, or a string which will be {}-formatted with the positional args if they are present.

push_back( tok: Token, value: str | None = None, ) → None

Return a token, so it will be reproduced when called again.

Only one token can be pushed back at once. The value is required for Token.STRING, PAREN_ARGS and PROP_FLAG, but ignored for other token types.

peek() → Tuple[Token, str]: Peek at the next token, without removing it from the stream.

skipping_newlines() → Iterator[Tuple[Token, str]]: Iterate over the tokens, skipping newlines.

block( name: str, consume_brace: bool = True, ) → Iterator[str]

Helper iterator for parsing keyvalue style blocks.

This will first consume a {. Then it will skip newlines, and output each string section found. When } is found it terminates, anything else produces an appropriate error. This is safely re-entrant, and tokens can be taken or put back as required.

expect( token: Token, skip_newline: bool = True, ) → str

Consume the next token, which should be the given type.

If it is not, this raises an error. If skip_newline is true, newlines will be skipped over. This does not apply if the desired token is newline.

class srctools.tokenizer.Tokenizer(data: str | ~typing.Iterable[str], filename: str | _os.PathLike[str] | None = None, error: ~typing.Type[~srctools.tokenizer.TokenSyntaxError] = <class 'srctools.tokenizer.TokenSyntaxError'>, *, string_bracket: bool = False, allow_escapes: bool = True, allow_star_comments: bool = False, preserve_comments: bool = False, colon_operator: bool = False, plus_operator: bool = False)

Processes text data into groups of tokens.

This mainly groups strings and removes comments.

Due to many inconsistencies in Valve’s parsing of files, several options are available to control whether different syntaxes are accepted.

string_bracket: bool: If set, [bracket] blocks are parsed as a single string-like block. If disabled these are parsed as BRACK_OPEN, STRING then BRACK_CLOSE.

allow_escapes: bool: This determines whether \n-style escapes are expanded.

allow_star_comments: bool: If enabled, this allows /* */ comments. Otherwise, an immediate error is produced.

colon_operator: bool: This controls whether : produces COLON tokens, or is treated as part of a bare string.

plus_operator: bool: This controls whether + produces PLUS tokens, or is treated as part of a bare string.

preserve_comments: bool: Token.COMMENT are produced if this is set.

class srctools.tokenizer.IterTokenizer(source: ~typing.Iterable[~typing.Tuple[~srctools.tokenizer.Token, str]], filename: str | _os.PathLike[str] = '', error: ~typing.Type[~srctools.tokenizer.TokenSyntaxError] = <class 'srctools.tokenizer.TokenSyntaxError'>)

Wraps a token iterator to provide the tokenizer interface.

This is useful to pre-process a token stream before parsing it with other code.

srctools.tokenizer.escape_text(text: str) → str

Escape special characters and backslashes, so tokenising reproduces them.

Specifically: \\, ", tab, and newline.

srctools.tokenizer.format_exc_fileinfo( msg: str, file: str | _os.PathLike[str] | None, line_num: int | None, ) → str: If a line number or file is provided, include those in the error message.