• Regular expressions are notation for specifying patterns.
• Each pattern matches a set of strings.
• Regular expressions will serve as names for sets of strings.
• Strings and Languages:
• The term alphabet or character class denotes any finite set of symbols.
• e.g., set {0,1} is the binary alphabet.
• The term sentence and word are often used as synonyms for the term string.
• The length of a string s is written as |s| - is the number of occurrences of symbols in s.
• e.g., string “banana” is of length six.
• The empty string denoted by ε – length of empty string is zero.
• The term language denotes any set of strings over some fixed alphabet.
• e.g., {ε} – set containing only empty string is language under φ.
• If x and y are strings, then the concatenation of x and y (written as xy) is the string formed by appending y to x. x = dog and y = house; then xy is doghouse.
• sε = εs = s.
• s0 = ε, s1 = s, s2 = ss, s3 = sss, … so on.
Table1.2 Terms for parts of a string
TERM
|
DEFINITION
|
Prefix of s
|
A string obtained by removing zero or more trailing symbols of string s; e.g., ban is a prefix of banana.
|
Suffix of s
|
A string formed by deleting zero or more of the leading symbols of s; e.g., nana is a suffix of banana.
|
Substring of s
|
A string obtained by deleting a prefix and a suffix from s; e.g., nan is a substring of banana.
|
Proper prefix, suffix, or substring of s
|
Any nonempty string x that is a prefix, suffix or substring of s that s <> x.
|
Subsequence of s
|
Any
string formed by deleting zero or more not necessarily contiguous
symbols from s; e.g., baaa is a subsequence of banana.
|
• Operations on Languages:
– There are several operations that can be applied to languages:
Table 1.3 Definitions of operations on languages L and M
OPERATION
|
DEFINITION
|
Union of L and M. written LυM
|
L υ M = { s | s is in L or s is in M }
|
Concatenation of L and M. written LM
|
LM = { st | s is in L and t is in M }
|
Kleene closure of L.
written L*
|
L* denotes “zero or more concatenation of” L.
|
Positive closure of L.
written L+
|
L+ denotes “one or more Concatenation of” L.
|
• Regular Expressions:
– It allows defining the sets to form tokens precisely.
– e.g., letter ( letter | digit) *
– Defines a Pascal identifier – which says that the identifier is formed by a letter followed by zero or more letters or digits.
– A regular expression is built up out of simpler regular expressions using a set of defining rules.
– Each regular expression r denotes a language L(r).
• The rules that define the regular expressions over alphabet ∑.
– (Associated with each rule is a specification of the language denoted by the regular expression being defined)
• 1. ε is a regular expression that denotes {ε}, i.e. the set containing the empty string.
• 2. If a is a symbol in ∑, then a is a regular expression that denotes {a}, i.e. the set containing the string a.
• 3. Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then
– a) (r) | (s) is a regular expression denoting the languages L(r) U L(s).
– b) (r)(s) is a regular expression denoting the languages L(r)L(s).
– c) (r)* is a regular expression denoting the languages (L(r))*.
– d) (r) is a regular expression denoting the languages L(r).
• A language denoted by a regular expression is said to be a regular set.
• The specification of a regular expression is an example of a recursive definition.
• Rule (1) and (2) form the basis of the definition.
• Rule (3) provides the inductive step.
Table 1.4 Algebraic Properties of regular expressions
AXIOM
|
DESCRIPTION
|
r|s = s|r
|
| is commutative
|
r|(s|t) = (r|s)|t
|
| is associative
|
(rs)t = r(st)
|
Concatenation is associative
|
r(s|t) = rs|rt
(s|t)r = sr|tr
|
Concatenation distributes over |
|
εr = r
rε = r
|
ε is the identity element for concatenation
|
r* = (r|ε)*
|
Relation between * and ε
|
r** = r*
|
* Is idempotent
|
• Regular Definition:
– If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form
• d1 → r1
• d2 → r2
• …
• dn → rn
– Where each di is a distinct name, and each ri is a regular expression over the symbols in ∑ U {d1, d2, … , di-1}, i.e., the basic symbols and the previously defined names.
– e.g. (regular definition in bold):
• letter → A|B|…|Z|a|b|…|z
• digit → 0|1|…|9
• id → letter ( letter | digit ) *
• Notational Shorthand:
– This shorthand is used in certain constructs that occur frequently in regular expressions.
• 1. one or more instance: unary postfix operator + means “one or more instances of”. If r is a regular expression that denotes the language L(r), the r+ is a regular expression that denotes the language (L(r))+. Similarly unary postfix operator * means “zero or more instances of”. The two algebraic identities r* and r+ relate the kleene and positive closure operators.
• 2. zero or one instance: unary postfix operator ? means “zero or one instance of”. The notation r? is a shorthand for r|ε.
• 3. character class: the notation [abc] is a shorthand for a|b|c.
No comments:
Post a Comment