A Simple Scripting Language – Part 4 


Posted by Tristan on January 21st, 2011.

Categories: programming, trudy's mechanicals.

Part 1 | Part 2 | Part 3 | Part 4 | Part 5

In this post, I’ll be building a two pass Parser for a sub set of the final language.  This subset will be capable of basic variable initialization, and passing values to external and internal methods.  This Parser will be using the Lexer interface from the previous article in the series.

An Example Script

1
2
3
4
5
6
7
8
9
10
11
12
extern void doSomething: (string) str andInt: (int) i;

int doSomethingElse
{
   int x = 0;
   string str = "hi";
   [doSomething: str andInt: x];
   x = [doSomethingAgain: 99];
   return 1;
}

extern int doSomethingAgain: (int) x;

Untill this point, I haven’t posted any examples of what the actual syntax of the language will look like.  As you can see in the above example, it is similar to Objective-C but without objects (although I might add them later).  Similar to C like languages, the extern keyword declares that a user definable piece of code (a callback) will be invoked by the Virtual Machine that the language will be running on.  Also unlike C, but akin to modern languages like Java, there will be no pre-declaration of methods.

In order to allow a method to be called before it is declared, I need to create a multi-pass parser.  On the first pass I’ll read through all of the tokens in the file and create a dictionary of all the method signatures I’ve found and their parameter types.  On the second pass, whenever I encounter a method call, I can look at this dictionary to verify that the method signature exists and the correct types of parameters have been passed.

Passes:

  1. Determine all internal and external method signatures.
  2. Do everything else

Initially this parser will start out with only these two passes, but I may add additional ones in the future as needed.

Parsing the Language

Traditionally a parser is built from a series of productions written in a Context Free Grammar.   The left side of each production represents the name of the production being defined. The right side represents the Tokens and Productions that it will be composed of.

Here are the logical symbols I’ll be using to construct each Production:

  • ‘*’ stands for 0 or more of the token to the left of it
  • ‘+’ stands for 1 or more of the token to the left of it
  • ‘(‘ and ‘)’ group tokens
  • ‘|’ means or
  • anything prefixed with ‘Lexer_’ is a token type from the previous post’s Lexer.  (eg.  an identifier, string value, integer value)
  • anything in quotes (eg ‘int’) represents a user defined Token.

These are the productions needed to define the example script:

  1. File = (ExternalMethod | InternalMethod) *
  2. ExternalMethod = ‘extern’ MethodSignature ‘;’
  3. InternalMethod = MethodSignature Block
  4. MethodSignature =  ReturnType (Lexer_ID  | (Lexer_ID ‘:’ ‘(‘Type’)'  Lexer_ID ) +)
  5. ReturnType = ‘int’ | ‘string’ | ‘void’
  6. Type = ‘int’ | ‘string’
  7. Block = ‘{‘  Statement * ‘}’
  8. Statement = StringDeclareStatement | IntDeclareStatement | MethodCallStatement | AssignStatement | ReturnStatement
  9. StringDeclareStatement = (‘string’ Lexer_ID ‘=’ Expression ‘;’)| (‘string’ Lexer_ID ‘;’)
  10. IntDeclareStatement = (‘int’ Lexer_ID ‘=’ Expression ‘;’ ) | (‘int’ Lexer_ID ‘;’)
  11. MethodCallStatement = MethodCallExpression ‘;’
  12. AssignStatement = Lexer_ID ‘=’ Expression ‘;’
  13. ReturnStatement = (‘return’ Expression ‘;’) | (‘return’ ‘;’)
  14. Expression = Lexer_INT | Lexer_STRING | Lexer_ID | MethodCallExpression
  15. MethodCallExpression = ‘[' ( Lexer_ID  | (Lexer_ID ':' Expression)+ )  ']‘

Parser building tools often create a tree of nodes representing all the productions found in an entire file, this is known as an Abstract Syntax Tree.  However I’m going to create a Recursive Descent Parser where each of these rules , or group of rules will result in a C function.

Traversing through the Tokens in a File

In order to build a Recursive Decent Parser, I first need to create an interface to iterate through the output of the Lexer;  this is the TokenList.  This TokenList maintains an internal linked list of all the Tokens read from a file using the Lexer.   In addition the TokenList also provides functions for validating input and providing meaningful error messages to the user.  The TokenList interface is listed below:

int TokenList_init(TokenList * tl, Lexer * lexer);

This function initializes the TokenList with all of the Tokens that can be read from the passed Lexer.  Returns 1 if successful.

int TokenList_peekType(TokenList * tl);  -

This function returns the type of the next Token in the TokenList.  This corresponds to the m_type field in Token struct.

Token * TokenList_expect(TokenList * tl, int type, const char * message) ;

This function returns the next token in a list and if that token does not match the expected type it displays an error and stops compilation.

Token * TokenList_expectMultiple(TokenList * tl, int valid_symbol_count, int * array_of_symbols, const char * message);

This function is similar to TokenList_expect but allows for a choice of symbol types to match against.

void TokenList_error(TokenList * tl, Token * tok, const char * message);

This function displays an error blaming the token passed to it.  The message is displayed to screen along with the line and column that the Token was found on.  Eventually I’ll put in a user defined function pointer to be called instead, so that the choice of error handling behavior is up to the user.

Example: Using the TokenList to Parse an Integer Declaration Statement

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Method * m = ...
TokenList * tl = ....

// if the next token is "int" its an Integer Declaration Statement
if (TokenList_peekType(tl) == SYMBOL_INT)
{
   // eat the "int" token
   TokenList_expect(tl, SYMBOL_INT, "");  

   // expect an ID that represents the variable name
   Token * variable_name = TokenList_expect(tl, LEXER_ID, "Expecting a variable name");

   // check that another variable by this name hasn't been declared in this method
   if (Method_addVariable(m, variable_name->m_value, SYMBOL_INT) == 0)
   {
      TokenList_error(tl, variable_name, "A variable by this name already exists");
   }

   // if we find a ";" the statement is over
   // if we find a "=" expect that an expression will follow
   if (TokenList_peekType(tl) == SYMBOL_SEMICOLON) {
      TokenList_expect(tl, SYMBOL_SEMICOLON, "");   // can't fail
   }
   else
   {
      // expect to see a "="
      Token * assign = TokenList_expect(tl, SYMBOL_ASSIGN, "Expecting a =");
      int expression_type;
      // call a function to evaluate the expression on the right hand side
      Pass1_expression(tl, m, &expression_type);

      // verify that the expression type matches the type of the variable
      if (expression_type != SYMBOL_INT) {
         TokenList_error(tl, assign, "Expecting an integer expression on the right side of =");
      }
      // check to see that the statement is terminated by a ";"
      TokenList_expect(tl, SYMBOL_SEMICOLON, "Expecting a ;");
   }
}

Recursive Decent Parser Implementation

In this post’s code you’ll find the following functions within Parser.c.  Listed on the right side of the function name is the list of productions that it handles.

Pass0

  • void Parser_pass0 – File
  • void Parser_pass0_method – ExternalMethod, InternalMethod
  • void Parser_pass0_block – does not directly correspond to a production, just checks that all ‘{‘ s and ‘}’ s come in pairs.

Results:

The first pass, creates a Dictionary of Method objects found in the entire file keyed by the method’s signature.  Each Method contains information about the number and type of parameters and whether the method is internal or external.

Pass 1

  • void Parser_pass1 – File
  • void Parser_pass1_method – ExternalMethod, InternalMethod
  • void Parser_pass1_block – Block
  • void Parser_pass1_statement – StringDeclareStatement, IntDeclareStatement, FunctionCallStatement, AssignStatement
  • void Parser_pass1_expression – Lexer_INT (integer value), Lexer_STRING (string constant), Lexer_ID (local variable)
  • void Parser_pass1_method_call_expression – FunctionCallExpression

Results:

Eventually an array of assembly op-codes.  At this point it just dumps the script to the screen for validation.

Conclusion

Next time I’ll start building the Virtual Machine  and have the Compiler start outputting instructions.  Over further posts I’ll  progressively add features to the language.

Download this weeks code.

As always, you can follow myself or Incubator Games on twitter at @igtristan and @incubatorgames respectively.

Trudy Concept Dump #5 


Posted by Radek on December 27th, 2010.

Categories: concepts, trudy's mechanicals.

Here is our final concept art update! Going forward we’ll start posting other art assets and in-game shots, but these are the last character and prop concepts.

tatjana01 Trudy Concept Dump #5

Early sketches of Tatjana, Valerjan's aunt, burden, and only family member.

tatjana05 Trudy Concept Dump #5

A closeup of Tatjana's "iron lung" chest that needs to be replaced periodically to keep her breathing.

Tatjana Trudy Concept Dump #5

The finalized Tatjana, moribund in her industrial wheelchair.

gattlingturret3 Trudy Concept Dump #5

Our elevator-styled Gatling Gun turrets.

sewerslug01 Trudy Concept Dump #5

Initial sketches of the Sewer Slug. As the only "monster" in the first part of the game, we gave it a fat bottom and ended up avoiding the insectoid look to make it stand out a bit more.

SewerSlug Trudy Concept Dump #5

Having spent centuries bathing in Trudy's waste, the Sewer Slugs have developed some unique abilities.

A Simple Scripting Language – Part 3 


Posted by Tristan on December 14th, 2010.

Categories: programming, trudy's mechanicals.

Part 1 | Part 2 | Part 3 | Part 4 | Part 5

In this post we’ll be looking at Lexical Analysis.  In the previous article in the series, I created an interface for reading and peeking characters from a file, now it is time to turn these characters into meaningful symbols.   First we’ll define the symbols in the language, look at some pseudo-code and finally look at other uses for our Lexer.

Classically Lexical Analysis is performed by creating a series of Regular Expressions defining each type of symbol in a language and converting them into a single Finite State Machine (FSM).  The resulting Finite State Machine usually has  hundreds of states  and is more often than not created with a specialized tool.   FSMs are very fast, however I’m going to take a brute force approach.

The majority of the symbols to be matched in a programming language are simple strings (eg.  “int”, “for”, “{“, “}”,”;” .. etc ), where  performing a string compare will suffice.  Strings, numbers, identifiers and single line comments are more complex and this is where I’ll focus the majority of my time.

  • A string will be defined as a double-quote character (“) followed by any number of non double-quote characters  and terminated by a second double-quote (“)
  • A number will be defined as the digit (0) or the digit (1-9) followed by any number of digits (0-9).
  • An identifier (ie. a variable or function name) will be defined as starting with an underscore (_) or an alpha character(a-z,A-Z) followed by 0 or more alpha, numeric, or underscore characters (a-z,A-Z,0-9,_)
  • A comment will be defined as starting with two forward-slashes (//) and terminating with the first new line character found.  Comments are useful to the programmer as documentation, but are meaningless to a Compiler.  Any comments found should be discarded.

Reading a Token

The above steps can be combined into the following pseudo-code which will the basis for the Lexer_read function.

  1. remove any whitespace
  2. while the next two characters are “//”
    1. throw out all characters until a new line is found
    2. remove all white-space
  3. for each symbol (simple strings)
    1. if the next characters match it   (FOUND A SYMBOL)
  4. if the next character is a 0 (FOUND AN INTEGER)
  5. if the next character is 1-9
    1. continue reading while the next character is 0-9
    2. (FOUND AN INTEGER)
  6. if the next character is a double quote
    1. continue reading until the next character is another double quote
    2. (FOUND A STRING)
  7. if the next character is an underscore or alphabetic (a-z,A-Z)
    1. continue reading while the next character is a-z, A-Z, 0-9 or _
    2. (FOUND AN ID)
  8. If none of the above match, its an invalid Token

The Interface

In the next article I’ll further define the syntax for the simple scripting language and start creating a Recursive Descent Parser.  As a Lexer groups characters into meaningful tokens, a parser groups tokens into meaningful phrases. (ie  int x = 5, means assign 5 to the variable x).  In order to build a parser,  an interface for reading and peeking at the next Token/Symbol in the stream is required.

int Lexer_init(Lexer * lexer, const char * file);

Initialize a Lexer with a file path by creating a LookaheadReader under the hood.  If it was not able to open the file 0 is returned else 1.

void Lexer_addSymbol(Lexer * lexer, const char * symbol, int type);

// eg

#define SYMBOL_SEMICOLON 0

Lexer_addSymbol(lexer, ";", SYMBOL_SEMICOLON)

Define a custom symbol, with an index (0 to LEXER_MAX_USER_DEFINED_SYMBOLS – 1).  All Token’s returned with text matching the specified string, will also contain the numeric type specified.

int Lexer_next(Lexer * lexer, Token * token);

Get the next token from the stream.  If an End of File (SYSTEM_EOF) token is returned it returns 0 else 1.

int Lexer_peek(Lexer * lexer, Token * token);

Peek at the next token from the stream.  If an End of File (SYSTEM_EOF) token is returned it returns 0 else 1.

void Lexer_dealloc(Lexer * lexer);

Close the file and perform cleanup.

typedef struct Token
{
int m_line;   // the line the token was found on
int m_column;  // the column the token was found on
int m_type;   // a numeric identifier listing what type of symbol was  found
string m_value;  // the text of the symbol
} Token;

The Token object contains the line and column the symbol/token was found on,  the type of symbol and the text value of it.  The type field is a built in value (Lexer.h) of SYMBOL_INT, SYMBOL_ID. SYMBOL_STRING, SYMBOL_EOF (end of file) or one passed to Lexer_addSymbol.  For character data I chose to create a custom reference counted string structure.   This allows me to return the same unique string all instances of a symbol added using Lexer_addSymbol,  as well as play nicely with a string Dictionary I will be using later.  There is no one correct way to represent string data, so use whatever you prefer.

Other uses for a Lexer

For Trudy I’ve had to read JSON, Behaviour Trees, CSVs and Alias Wavefront OBJ files.   All these different formats can be parsed using the above Lexer (albeit with some minor modifications).

I chose JSON to represent configuration data in the game because of its compactness and readability.  Instead of creating a whole tree structure in memory for each file, I processed each token as it was read.  Since I created the data, I can be certain that it will be in the correct order.

To parse JSON using the Lexer, symbols for “{“,  “}”,  “[",  "]“,  “,”, “:” will need to be defined using Lexer_addSymbol .

#define SYMBOL_LBRACE   0     // {
#define SYMBOL_RBRACE  1      // }
#define SYMBOL_LARRAY  2      // [
#define SYMBOL_RARRAY  3      // ]
#define SYMBOL_COMMA  4      // ,
#define SYMBOL_COLON     5      // :

Beyond matching symbols, JSON contains object and array structures.

Array Matching

  1. Read a “["
  2. While the next token isn't "]”
    1. read the nth value of the array
    2. if the next token isn’t “]”  read the  “,”.  If there is no comma ERROR
  3. Read a “]”

Object Matching

  1. Read a “{“
  2. While the next token isn’t “}”
    1. read an ID, the name of the field
    2. read a “:”
    3. read the value of the field
    4. if the next token isn’t “}”  read the  “,”.  If there is no comma ERROR
  3. Read a “}”

You can see an example of a minimal JSON reader in this posts code.  Please note, if you wish to fully support JSON, single quote strings, floating point values and booleans should be added.

See you soon!

Follow Incubator Games or myself on twitter at  @incubatorgames or @igtristan respectively.