Blogger Widgets

Total Page visits

Thursday, August 15, 2013

Specification of Tokens

         Regular expressions are notation for specifying patterns.
      Each pattern matches a set of strings.
      Regular expressions will serve as names for sets of strings.
      Strings and Languages:
      The term alphabet or character class denotes any finite set of symbols.
      e.g., set {0,1} is the binary alphabet.
      The term sentence and word are often used as synonyms for the term string.
      The length of a string s is written as |s| - is the number of occurrences of symbols in s.
      e.g., string “banana” is of length six.
      The empty string denoted by ε – length of empty string is zero.
      The term language denotes any set of strings over some fixed alphabet.
      e.g., {ε} – set containing only empty string is language under φ.
      If x and y are strings, then the concatenation of x and y (written as xy) is the string formed by appending y to x. x = dog and y = house; then xy is doghouse.
      sε = εs = s.
      s0 = ε, s1 = s, s2 = ss, s3 = sss, … so on.

Table1.2 Terms for parts of a string
TERM
DEFINITION
Prefix of s

A string obtained by removing zero or more trailing symbols of string s; e.g., ban is a prefix of banana.
Suffix of s

A string formed by deleting zero or more of the leading symbols of s; e.g., nana is a suffix of banana.
Substring of s

A string obtained by deleting a prefix and a suffix from s; e.g., nan is a substring of banana.
Proper prefix, suffix, or substring of s
Any nonempty string x that is a prefix, suffix or substring of s that s <> x.
Subsequence of s

Any string formed by deleting zero or more not necessarily contiguous symbols from s; e.g., baaa is a subsequence of banana.
      Operations on Languages:
     There are several operations that can be applied to languages:

Table 1.3 Definitions of operations on languages L and M

OPERATION
DEFINITION
Union of L and M. written LυM
L υ M = { s | s is in L or s is in M }
Concatenation of L and M. written LM
LM = { st | s is in L and t is in M }
Kleene closure of L.
written L*

 
L* denotes “zero or more concatenation of” L.  
Positive closure of L.
written L+

L+ denotes “one or more Concatenation of” L.

      Regular Expressions:
     It allows defining the sets to form tokens precisely.
     e.g., letter ( letter | digit) *
     Defines a Pascal identifier – which says that the identifier is formed by a letter followed by zero or more letters or digits.
     A regular expression is built up out of simpler regular expressions using a set of defining rules.
     Each regular expression r denotes a language L(r).
      The rules that define the regular expressions over alphabet ∑.
     (Associated with each rule is a specification of the language denoted by the regular expression being defined)
      1. ε is a regular expression that denotes {ε}, i.e. the set containing the empty string.
      2. If a is a symbol in ∑, then a is a regular expression that denotes {a}, i.e. the set containing the string a.
      3. Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then
     a) (r) | (s) is a regular expression denoting the languages L(r) U L(s).
     b) (r)(s) is a regular expression denoting the languages L(r)L(s).
     c) (r)* is a regular expression denoting the languages (L(r))*.
     d) (r) is a regular expression denoting the languages L(r).
      A language denoted by a regular expression is said to be a regular set.
      The specification of a regular expression is an example of a recursive definition.
      Rule (1) and (2) form the basis of the definition.
      Rule (3) provides the inductive step.


Table 1.4 Algebraic Properties of regular expressions

AXIOM
DESCRIPTION
r|s = s|r
| is commutative
r|(s|t) = (r|s)|t
| is associative
(rs)t = r(st)
Concatenation is associative
r(s|t) = rs|rt
(s|t)r = sr|tr
Concatenation distributes over |

εr = r
rε = r
ε is the identity element for concatenation

r* = (r|ε)*
Relation between * and ε
r** = r*
* Is idempotent

      Regular Definition:
     If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form
      d1 → r1
      d2 → r2
     
      dn → rn
     Where each di is a distinct name, and each ri is a regular expression over the symbols in ∑ U {d1, d2, … , di-1}, i.e., the basic symbols and the previously defined names.
     e.g. (regular definition in bold):
      letter → A|B|…|Z|a|b|…|z
      digit → 0|1|…|9
      id → letter ( letter | digit ) *

      Notational Shorthand:
     This shorthand is used in certain constructs that occur frequently in regular expressions.
      1. one or more instance: unary postfix operator + means “one or more instances of”. If r is a regular expression that denotes the language L(r), the r+ is a regular expression that denotes the language (L(r))+. Similarly unary postfix operator * means “zero or more instances of”. The two algebraic identities r* and r+ relate the kleene and positive closure operators.
      2. zero or one instance: unary postfix operator ? means “zero or one instance of”. The notation r? is a shorthand for r|ε.
      3. character class: the notation [abc] is a shorthand for a|b|c.

No comments: