Blogger Widgets

Total Page visits

Thursday, August 15, 2013

Input Buffering Techniques in Compiler Design


      Some efficiency issues concerned with the buffering of input.
      A two-buffer input scheme that is useful when lookahead on the input is necessary to identify tokens.
      Techniques for speeding up the lexical analyser, such as the use of sentinels to mark the buffer end.
      There are three general approaches to the implementation of a lexical analyser:
 
         1. Use a lexical-analyser generator, such as Lex compiler to produce the lexical analyser from a regular expression based specification. In this, the generator provides routines for reading and buffering the input.
 
    2. Write the lexical analyser in a conventional systems-programming language, using I/O facilities of that language to read the input.     
      
      3. Write the lexical analyser in assembly language and explicitly manage the reading of input.

 Buffer pairs:
     Because of a large amount of time can be consumed moving characters, specialized buffering techniques have been developed to reduce the amount of overhead required to process an input character.
      The scheme to be discussed:
      Consists a buffer divided into two N-character halves.
                               
N – Number of characters on one disk block, e.g., 1024 or 4096.
     Read N characters into each half of the buffer with one system read command.
     If fewer than N characters remain in the input, then eof is read into the buffer after the input characters.
     Two pointers to the input buffer are maintained.
     The string of characters between two pointers is the current lexeme.
     Initially both pointers point to the first character of the next lexeme to be found.
     Forward pointer, scans ahead until a match for a pattern is found.
     Once the next lexeme is determined, the forward pointer is set to the character at its right end.
     If the forward pointer is about to move past the halfway mark, the right half is filled with N new input characters.
     If the forward pointer is about to move past the right end of the buffer, the left half is filled with N new characters and the forward pointer wraps around to the beginning of the buffer.
     Disadvantage of this scheme:
     This scheme works well most of the time, but the amount of lookahead is limited.
     This limited lookahead may make it impossible to recognize tokens in situations where the distance that the forward pointer must travel is more than the length of the buffer.
     For example: DECLARE ( ARG1, ARG2, … , ARGn ) in PL/1 program;
     Cannot determine whether the DECLARE is a keyword or an array name until the character that follows the right parenthesis.
     Sentinels:
     In the previous scheme, must check each time the move forward pointer that have not moved off one half of the buffer. If it is done, then must reload the other half.
     Therefore the ends of the buffer halves require two tests for each advance of the forward pointer.
     This can reduce the two tests to one if it is extend each buffer half to hold a sentinel character at the end.
     The sentinel is a special character that cannot be part of the source program. (eof character is used as sentinel).
                                       
      In this, most of the time it performs only one test to see whether forward points to an eof.
      Only when it reach the end of the buffer half or eof, it performs more tests.
      Since N input characters are encountered between eof’s, the average number of tests per input character is very close to 1.

No comments: