Posted by Tristan on November 26th, 2010.
Categories: programming, trudy's mechanicals.
The Look-ahead Reader
The first stage of any compiler is simply being able to read characters from a text file. While not the most glamorous part, what we will do in this post will set the foundation for everything that comes later. For the sake of simplicity and maintaining cross platform compatibility I chose to use C and read script files using the stdio file operators (fopen, fread, fclose).
If you need to brush up on your file access functions this site should get you up to speed.
Reading text from a file, that doesn’t seem too hard?
Not really, but there are several nuances in addition to reading characters.
As we all have been taught, calling a system interrupt repeatedly (fread is an abstraction on one) is hazardous to your program’s performance. Instead of reading 1 character at a time, we will read chunks of 1024 characters a time. The naive approach would be to load 1024 bytes from the file. Once all that text is processed, load the next 1024 until we reach the end of file.
At this point we would have a BufferedReader.
In order to facilitate next weeks step, Lexical Analysis, we also need the ability to peek (look at without reading) several characters in advance (thus the term Look-ahead Reader). Lexical Analysis is the act of taking characters and transforming them into meaningful symbols. (ie the characters “i” “n” “t” form “int” which defines a type of data).
However a simple BufferedReader is not sufficient to support peek operations. For example, if we are currently examining the last character in the buffer, the next character cannot be peeked at without overwriting all the data in the buffer. However if we split our buffer into 2 halves (bufferA and bufferB) we can easily support look-ahead.
- initially read both bufferA and bufferB
- when you’ve processed all the data in bufferA and have started reading bufferB, populate bufferA from the file
- when you’ve processed all the data in bufferB and have started reading bufferA, populate bufferB from the file
- when you’ve reached the end of file stop populating the buffers.
This will guarantee that there is always half a buffer size of text available to peek at at all times. Special care must be taken in the cast that a user tries to peek beyond the end of the file. A meaningful character should be returned to indicate this condition. In this implementation I chose to return a ‘\0′ (NULL character).
Also to allow for meaningful error messages later on, the Look-ahead Reader will keep track of the current line and column in the file. Nothing is more frustrating than knowing there is an error but having no idea what code caused it.
What does the interface look like?
I’ve taken a pseudo object-oriented approach to structuring my C, borrowing slightly from Objective-C. The following are the functions contained within LookaheadReader.h.
Initialize the passed “reader” to read from the file indicated by “path”. It opens the file, and populates the internal buffer with 1024 characters.
Returns the next character, and reads chunks of 512 bytes from the file when needed into the internal buffer. If you try to read past the end of the file ‘\0′ will be returned.
Peeks at the next nth character. 0 indicates the next character to be read, 1 the one after that… etc. If you try to peek past the end of the file ‘\0′ will be returned.
Discards characters until it peeks a non white-space character.
Closes the file.
Lets see some code!
For this project I will be supplying the C code under the MIT license. If there is enough interest in porting the code to another language besides C, I may make additional versions available.
Next week, we’ll start planning the syntax of our language and use the Look-ahead Reader to create meaningful symbols out of a text file.
You can follow myself or Incubator Games on twitter at @igtristan or @incubatorgames respectively.
See you next week!