A Simple Scripting Language – Part 3 

Part 1 | Part 2 | Part 3 | Part 4 | Part 5

In this post we’ll be looking at Lexical Analysis.  In the previous article in the series, I created an interface for reading and peeking characters from a file, now it is time to turn these characters into meaningful symbols.   First we’ll define the symbols in the language, look at some pseudo-code and finally look at other uses for our Lexer.

Classically Lexical Analysis is performed by creating a series of Regular Expressions defining each type of symbol in a language and converting them into a single Finite State Machine (FSM).  The resulting Finite State Machine usually has  hundreds of states  and is more often than not created with a specialized tool.   FSMs are very fast, however I’m going to take a brute force approach.

The majority of the symbols to be matched in a programming language are simple strings (eg.  “int”, “for”, “{“, “}”,”;” .. etc ), where  performing a string compare will suffice.  Strings, numbers, identifiers and single line comments are more complex and this is where I’ll focus the majority of my time.

  • A string will be defined as a double-quote character (“) followed by any number of non double-quote characters  and terminated by a second double-quote (“)
  • A number will be defined as the digit (0) or the digit (1-9) followed by any number of digits (0-9).
  • An identifier (ie. a variable or function name) will be defined as starting with an underscore (_) or an alpha character(a-z,A-Z) followed by 0 or more alpha, numeric, or underscore characters (a-z,A-Z,0-9,_)
  • A comment will be defined as starting with two forward-slashes (//) and terminating with the first new line character found.  Comments are useful to the programmer as documentation, but are meaningless to a Compiler.  Any comments found should be discarded.

Reading a Token

The above steps can be combined into the following pseudo-code which will the basis for the Lexer_read function.

  1. remove any whitespace
  2. while the next two characters are “//”
    1. throw out all characters until a new line is found
    2. remove all white-space
  3. for each symbol (simple strings)
    1. if the next characters match it   (FOUND A SYMBOL)
  4. if the next character is a 0 (FOUND AN INTEGER)
  5. if the next character is 1-9
    1. continue reading while the next character is 0-9
    2. (FOUND AN INTEGER)
  6. if the next character is a double quote
    1. continue reading until the next character is another double quote
    2. (FOUND A STRING)
  7. if the next character is an underscore or alphabetic (a-z,A-Z)
    1. continue reading while the next character is a-z, A-Z, 0-9 or _
    2. (FOUND AN ID)
  8. If none of the above match, its an invalid Token

The Interface

In the next article I’ll further define the syntax for the simple scripting language and start creating a Recursive Descent Parser.  As a Lexer groups characters into meaningful tokens, a parser groups tokens into meaningful phrases. (ie  int x = 5, means assign 5 to the variable x).  In order to build a parser,  an interface for reading and peeking at the next Token/Symbol in the stream is required.

int Lexer_init(Lexer * lexer, const char * file);

Initialize a Lexer with a file path by creating a LookaheadReader under the hood.  If it was not able to open the file 0 is returned else 1.

void Lexer_addSymbol(Lexer * lexer, const char * symbol, int type);

// eg

#define SYMBOL_SEMICOLON 0

Lexer_addSymbol(lexer, ";", SYMBOL_SEMICOLON)

Define a custom symbol, with an index (0 to LEXER_MAX_USER_DEFINED_SYMBOLS – 1).  All Token’s returned with text matching the specified string, will also contain the numeric type specified.

int Lexer_next(Lexer * lexer, Token * token);

Get the next token from the stream.  If an End of File (SYSTEM_EOF) token is returned it returns 0 else 1.

int Lexer_peek(Lexer * lexer, Token * token);

Peek at the next token from the stream.  If an End of File (SYSTEM_EOF) token is returned it returns 0 else 1.

void Lexer_dealloc(Lexer * lexer);

Close the file and perform cleanup.

typedef struct Token
{
int m_line;   // the line the token was found on
int m_column;  // the column the token was found on
int m_type;   // a numeric identifier listing what type of symbol was  found
string m_value;  // the text of the symbol
} Token;

The Token object contains the line and column the symbol/token was found on,  the type of symbol and the text value of it.  The type field is a built in value (Lexer.h) of SYMBOL_INT, SYMBOL_ID. SYMBOL_STRING, SYMBOL_EOF (end of file) or one passed to Lexer_addSymbol.  For character data I chose to create a custom reference counted string structure.   This allows me to return the same unique string all instances of a symbol added using Lexer_addSymbol,  as well as play nicely with a string Dictionary I will be using later.  There is no one correct way to represent string data, so use whatever you prefer.

Other uses for a Lexer

For Trudy I’ve had to read JSON, Behaviour Trees, CSVs and Alias Wavefront OBJ files.   All these different formats can be parsed using the above Lexer (albeit with some minor modifications).

I chose JSON to represent configuration data in the game because of its compactness and readability.  Instead of creating a whole tree structure in memory for each file, I processed each token as it was read.  Since I created the data, I can be certain that it will be in the correct order.

To parse JSON using the Lexer, symbols for “{“,  “}”,  “[",  "]“,  “,”, “:” will need to be defined using Lexer_addSymbol .

#define SYMBOL_LBRACE   0     // {
#define SYMBOL_RBRACE  1      // }
#define SYMBOL_LARRAY  2      // [
#define SYMBOL_RARRAY  3      // ]
#define SYMBOL_COMMA  4      // ,
#define SYMBOL_COLON     5      // :

Beyond matching symbols, JSON contains object and array structures.

Array Matching

  1. Read a “[“
  2. While the next token isn’t “]”
    1. read the nth value of the array
    2. if the next token isn’t “]”  read the  “,”.  If there is no comma ERROR
  3. Read a “]”

Object Matching

  1. Read a “{“
  2. While the next token isn’t “}”
    1. read an ID, the name of the field
    2. read a “:”
    3. read the value of the field
    4. if the next token isn’t “}”  read the  “,”.  If there is no comma ERROR
  3. Read a “}”

You can see an example of a minimal JSON reader in this posts code.  Please note, if you wish to fully support JSON, single quote strings, floating point values and booleans should be added.

See you soon!

Follow Incubator Games or myself on twitter at  @incubatorgames or @igtristan respectively.

Posted by Tristan Campbell
@igtristan, programmer at Incubator Games
IncubatorGames Says:

[Post] A Simple Scripting Language – Part 3 – http://www.incubatorgames.com/index.php/

Tweets that mention A Simple Scripting Language – Part 3 | Incubator Games -- Topsy.com Says:

[...] This post was mentioned on Twitter by News Effect 9 and Tristan Campbell, Incubator Games. Incubator Games said: [Post] A Simple Scripting Language – Part 3 – http://bit.ly/e0uHpJ [...]

justradek Says:

RT @IncubatorGames: [Post] A Simple Scripting Language – Part 3 – http://www.incubatorgames.com/index.php/

A Simple Scripting Language – Part 4 | Incubator Games Says:

[...] values to external and internal methods.  This Parser will be using the Lexer interface from the previous article in the [...]

A Simple Scripting Language – Part 1 | Incubator Games Says:

[...] 1 | Part 2 | Part 3 | Part 4 | Part [...]

Incubator Games » Blog Archive » A Simple Scripting Language – Part 5 Says:

[…] 1 | Part 2 | Part 3 | Part 4 | Part […]

Comment: