Specification of Tokens ~ My View On Computers and World

Thursday, August 15, 2013

Specification of Tokens

• Regular expressions are notation for specifying patterns.

• Each pattern matches a set of strings.

• Regular expressions will serve as names for sets of strings.

• Strings and Languages:

• The term alphabet or character class denotes any finite set of symbols.

• e.g., set {0,1} is the binary alphabet.

• The term sentence and word are often used as synonyms for the term string.

• The length of a string s is written as |s| - is the number of occurrences of symbols in s.

• e.g., string “banana” is of length six.

• The empty string denoted by ε – length of empty string is zero.

• The term language denotes any set of strings over some fixed alphabet.

• e.g., {ε} – set containing only empty string is language under φ.

• If x and y are strings, then the concatenation of x and y (written as xy) is the string formed by appending y to x. x = dog and y = house; then xy is doghouse.

• sε = εs = s.

• s⁰ = ε, s¹ = s, s² = ss, s³ = sss, … so on.

Table1.2 Terms for parts of a string

TERM	DEFINITION
Prefix of s	A string obtained by removing zero or more trailing symbols of string s; e.g., ban is a prefix of banana.
Suffix of s	A string formed by deleting zero or more of the leading symbols of s; e.g., nana is a suffix of banana.
Substring of s	A string obtained by deleting a prefix and a suffix from s; e.g., nan is a substring of banana.
Proper prefix, suffix, or substring of s	Any nonempty string x that is a prefix, suffix or substring of s that s <> x.
Subsequence of s	Any string formed by deleting zero or more not necessarily contiguous symbols from s; e.g., baaa is a subsequence of banana.

• Operations on Languages:

– There are several operations that can be applied to languages:

Table 1.3 Definitions of operations on languages L and M

OPERATION	DEFINITION
Union of L and M. written LυM	L υ M = { s \| s is in L or s is in M }
Concatenation of L and M. written LM	LM = { st \| s is in L and t is in M }
Kleene closure of L. written L*	*L denotes “zero or more concatenation of” L.**
Positive closure of L. written L+	L⁺ denotes “one or more Concatenation of” L.

• Regular Expressions:

– It allows defining the sets to form tokens precisely.

– e.g., letter ( letter | digit) *

– Defines a Pascal identifier – which says that the identifier is formed by a letter followed by zero or more letters or digits.

– A regular expression is built up out of simpler regular expressions using a set of defining rules.

– Each regular expression r denotes a language L(r).

• The rules that define the regular expressions over alphabet ∑.

– (Associated with each rule is a specification of the language denoted by the regular expression being defined)

• 1. ε is a regular expression that denotes {ε}, i.e. the set containing the empty string.

• 2. If a is a symbol in ∑, then a is a regular expression that denotes {a}, i.e. the set containing the string a.

• 3. Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then

– a) (r) | (s) is a regular expression denoting the languages L(r) U L(s).

– b) (r)(s) is a regular expression denoting the languages L(r)L(s).

– c) (r)* is a regular expression denoting the languages (L(r))*.

– d) (r) is a regular expression denoting the languages L(r).

• A language denoted by a regular expression is said to be a regular set.

• The specification of a regular expression is an example of a recursive definition.

• Rule (1) and (2) form the basis of the definition.

• Rule (3) provides the inductive step.

Table 1.4 Algebraic Properties of regular expressions

AXIOM	DESCRIPTION
r\|s = s\|r	\| is commutative
r\|(s\|t) = (r\|s)\|t	\| is associative
(rs)t = r(st)	Concatenation is associative
r(s\|t) = rs\|rt (s\|t)r = sr\|tr	Concatenation distributes over \|
εr = r rε = r	ε is the identity element for concatenation
r* = (r\|ε)*	*Relation between and ε**
r** = r*	* Is idempotent

• Regular Definition:

– If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form

• d₁→ r₁

• d₂ → r₂

• …

• d_n → r_n

– Where each d_i is a distinct name, and each r_i is a regular expression over the symbols in ∑ U {d₁, d₂, … , d_i-1}, i.e., the basic symbols and the previously defined names.

– e.g. (regular definition in bold):

• letter → A|B|…|Z|a|b|…|z

• digit → 0|1|…|9

• id → letter ( letter | digit ) *

• Notational Shorthand:

– This shorthand is used in certain constructs that occur frequently in regular expressions.

• 1. one or more instance: unary postfix operator ⁺ means “one or more instances of”. If r is a regular expression that denotes the language L(r), the r⁺ is a regular expression that denotes the language (L(r))⁺. Similarly unary postfix operator * means “zero or more instances of”. The two algebraic identities r* and r⁺ relate the kleene and positive closure operators.

• 2. zero or one instance: unary postfix operator ? means “zero or one instance of”. The notation r? is a shorthand for r|ε.

• 3. character class: the notation [abc] is a shorthand for a|b|c.

Pages

Total Page visits

Thursday, August 15, 2013

Specification of Tokens

No comments: