A Simple Scripting Language – Part 4 

Part 1 | Part 2 | Part 3 | Part 4 | Part 5

In this post, I’ll be building a two pass Parser for a sub set of the final language.  This subset will be capable of basic variable initialization, and passing values to external and internal methods.  This Parser will be using the Lexer interface from the previous article in the series.

An Example Script

1
2
3
4
5
6
7
8
9
10
11
12
extern void doSomething: (string) str andInt: (int) i;

int doSomethingElse
{
   int x = 0;
   string str = "hi";
   [doSomething: str andInt: x];
   x = [doSomethingAgain: 99];
   return 1;
}

extern int doSomethingAgain: (int) x;

Untill this point, I haven’t posted any examples of what the actual syntax of the language will look like.  As you can see in the above example, it is similar to Objective-C but without objects (although I might add them later).  Similar to C like languages, the extern keyword declares that a user definable piece of code (a callback) will be invoked by the Virtual Machine that the language will be running on.  Also unlike C, but akin to modern languages like Java, there will be no pre-declaration of methods.

In order to allow a method to be called before it is declared, I need to create a multi-pass parser.  On the first pass I’ll read through all of the tokens in the file and create a dictionary of all the method signatures I’ve found and their parameter types.  On the second pass, whenever I encounter a method call, I can look at this dictionary to verify that the method signature exists and the correct types of parameters have been passed.

Passes:

  1. Determine all internal and external method signatures.
  2. Do everything else

Initially this parser will start out with only these two passes, but I may add additional ones in the future as needed.

Parsing the Language

Traditionally a parser is built from a series of productions written in a Context Free Grammar.   The left side of each production represents the name of the production being defined. The right side represents the Tokens and Productions that it will be composed of.

Here are the logical symbols I’ll be using to construct each Production:

  • ‘*’ stands for 0 or more of the token to the left of it
  • ‘+’ stands for 1 or more of the token to the left of it
  • ‘(‘ and ‘)’ group tokens
  • ‘|’ means or
  • anything prefixed with ‘Lexer_’ is a token type from the previous post’s Lexer.  (eg.  an identifier, string value, integer value)
  • anything in quotes (eg ‘int’) represents a user defined Token.

These are the productions needed to define the example script:

  1. File = (ExternalMethod | InternalMethod) *
  2. ExternalMethod = ‘extern’ MethodSignature ‘;’
  3. InternalMethod = MethodSignature Block
  4. MethodSignature =  ReturnType (Lexer_ID  | (Lexer_ID ‘:’ ‘(‘Type’)’  Lexer_ID ) +)
  5. ReturnType = ‘int’ | ‘string’ | ‘void’
  6. Type = ‘int’ | ‘string’
  7. Block = ‘{‘  Statement * ‘}’
  8. Statement = StringDeclareStatement | IntDeclareStatement | MethodCallStatement | AssignStatement | ReturnStatement
  9. StringDeclareStatement = (‘string’ Lexer_ID ‘=’ Expression ‘;’)| (‘string’ Lexer_ID ‘;’)
  10. IntDeclareStatement = (‘int’ Lexer_ID ‘=’ Expression ‘;’ ) | (‘int’ Lexer_ID ‘;’)
  11. MethodCallStatement = MethodCallExpression ‘;’
  12. AssignStatement = Lexer_ID ‘=’ Expression ‘;’
  13. ReturnStatement = (‘return’ Expression ‘;’) | (‘return’ ‘;’)
  14. Expression = Lexer_INT | Lexer_STRING | Lexer_ID | MethodCallExpression
  15. MethodCallExpression = ‘[' ( Lexer_ID  | (Lexer_ID ':' Expression)+ )  ']‘

Parser building tools often create a tree of nodes representing all the productions found in an entire file, this is known as an Abstract Syntax Tree.  However I’m going to create a Recursive Descent Parser where each of these rules , or group of rules will result in a C function.

Traversing through the Tokens in a File

In order to build a Recursive Decent Parser, I first need to create an interface to iterate through the output of the Lexer;  this is the TokenList.  This TokenList maintains an internal linked list of all the Tokens read from a file using the Lexer.   In addition the TokenList also provides functions for validating input and providing meaningful error messages to the user.  The TokenList interface is listed below:

int TokenList_init(TokenList * tl, Lexer * lexer);

This function initializes the TokenList with all of the Tokens that can be read from the passed Lexer.  Returns 1 if successful.

int TokenList_peekType(TokenList * tl);  -

This function returns the type of the next Token in the TokenList.  This corresponds to the m_type field in Token struct.

Token * TokenList_expect(TokenList * tl, int type, const char * message) ;

This function returns the next token in a list and if that token does not match the expected type it displays an error and stops compilation.

Token * TokenList_expectMultiple(TokenList * tl, int valid_symbol_count, int * array_of_symbols, const char * message);

This function is similar to TokenList_expect but allows for a choice of symbol types to match against.

void TokenList_error(TokenList * tl, Token * tok, const char * message);

This function displays an error blaming the token passed to it.  The message is displayed to screen along with the line and column that the Token was found on.  Eventually I’ll put in a user defined function pointer to be called instead, so that the choice of error handling behavior is up to the user.

Example: Using the TokenList to Parse an Integer Declaration Statement

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Method * m = ...
TokenList * tl = ....

// if the next token is "int" its an Integer Declaration Statement
if (TokenList_peekType(tl) == SYMBOL_INT)
{
   // eat the "int" token
   TokenList_expect(tl, SYMBOL_INT, "");  

   // expect an ID that represents the variable name
   Token * variable_name = TokenList_expect(tl, LEXER_ID, "Expecting a variable name");

   // check that another variable by this name hasn't been declared in this method
   if (Method_addVariable(m, variable_name->m_value, SYMBOL_INT) == 0)
   {
      TokenList_error(tl, variable_name, "A variable by this name already exists");
   }

   // if we find a ";" the statement is over
   // if we find a "=" expect that an expression will follow
   if (TokenList_peekType(tl) == SYMBOL_SEMICOLON) {
      TokenList_expect(tl, SYMBOL_SEMICOLON, "");   // can't fail
   }
   else
   {
      // expect to see a "="
      Token * assign = TokenList_expect(tl, SYMBOL_ASSIGN, "Expecting a =");
      int expression_type;
      // call a function to evaluate the expression on the right hand side
      Pass1_expression(tl, m, &expression_type);

      // verify that the expression type matches the type of the variable
      if (expression_type != SYMBOL_INT) {
         TokenList_error(tl, assign, "Expecting an integer expression on the right side of =");
      }
      // check to see that the statement is terminated by a ";"
      TokenList_expect(tl, SYMBOL_SEMICOLON, "Expecting a ;");
   }
}

Recursive Decent Parser Implementation

In this post’s code you’ll find the following functions within Parser.c.  Listed on the right side of the function name is the list of productions that it handles.

Pass0

  • void Parser_pass0 – File
  • void Parser_pass0_method – ExternalMethod, InternalMethod
  • void Parser_pass0_block – does not directly correspond to a production, just checks that all ‘{‘ s and ‘}’ s come in pairs.

Results:

The first pass, creates a Dictionary of Method objects found in the entire file keyed by the method’s signature.  Each Method contains information about the number and type of parameters and whether the method is internal or external.

Pass 1

  • void Parser_pass1 – File
  • void Parser_pass1_method – ExternalMethod, InternalMethod
  • void Parser_pass1_block – Block
  • void Parser_pass1_statement – StringDeclareStatement, IntDeclareStatement, FunctionCallStatement, AssignStatement
  • void Parser_pass1_expression – Lexer_INT (integer value), Lexer_STRING (string constant), Lexer_ID (local variable)
  • void Parser_pass1_method_call_expression – FunctionCallExpression

Results:

Eventually an array of assembly op-codes.  At this point it just dumps the script to the screen for validation.

Conclusion

Next time I’ll start building the Virtual Machine  and have the Compiler start outputting instructions.  Over further posts I’ll  progressively add features to the language.

Download this weeks code.

As always, you can follow myself or Incubator Games on twitter at @igtristan and @incubatorgames respectively.

Posted by Tristan Campbell
@igtristan, programmer at Incubator Games
IncubatorGames Says:

[Post] A Simple Scripting Language – Part 4 – http://www.incubatorgames.com/index.php/

justradek Says:

RT @IncubatorGames: [Post] A Simple Scripting Language – Part 4 – http://www.incubatorgames.com/index.php/

Tweets that mention A Simple Scripting Language – Part 4 | Incubator Games -- Topsy.com Says:

[...] This post was mentioned on Twitter by Radek Koncewicz and Tristan Campbell, Incubator Games. Incubator Games said: [Post] A Simple Scripting Language – Part 4 – http://bit.ly/f9T6Gv [...]

Incubator Games » Blog Archive » A Simple Scripting Language – Part 1 Says:

[…] 1 | Part 2 | Part 3 | Part 4 | Part […]

Comment: