Editing String

Jump to: navigation, search

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision Your text
Line 2: Line 2:
  
 
==Definitions==
 
==Definitions==
An '''alphabet''', usually denoted <math>\Sigma</math> is a finite nonempty set (whose size is denoted <math>|\Sigma|</math>). It is often assumed that the alphabet is totally ordered, but this is not always necessary. (Since the alphabet is finite, we can always impose a total order when it would be useful to do so, such as when constructing a [[dictionary]].) An element of this the alphabet is known as a '''character'''. Unless otherwise specified, we will always assume that two characters from the same alphabet can be compared for equality in constant time.
+
An '''alphabet''', usually denoted <math>\Sigma</math> is a finite nonempty set (whose size is denoted <math>|\Sigma|</math>). It is often assumed that the alphabet is totally ordered, but this is not always necessary. (Since the alphabet is finite, we can always impose a total order when it would be useful to do so, such as when constructing a [[dictionary]].) An element of this the alphabet is known as a '''character'''. The size of the alphabet can affect the asymptotic complexity of string algorithms. In particular, if <math>N</math> is the size of the input, and the alphabet size is bounded by a constant independent of <math>N</math>, then it is called a '''constant''' alphabet; if its size is <math>O(N)</math>, it is called '''integer''' alphabet; and if it is larger than this then it is called a '''general''' alphabet.
  
 
The set of <math>n</math>-tuples of <math>\Sigma</math> is denoted <math>\Sigma^n</math>. A '''string of length <math>n</math>''' is an element of <math>\Sigma^n</math>. The set <math>\Sigma^*</math> is defined <math>\Sigma^0 \cup \Sigma^1 \cup ...</math>; an element of <math>\Sigma^*</math> is known simply as a '''string''' over <math>\Sigma</math>. The '''empty string''', denoted <math>\epsilon</math> or <math>\lambda</math>, is the unique element of <math>\Sigma^0</math>. The length of a string <math>S</math> is denoted <math>|S|</math>. (Note that the usual definition of "string" requires strings to have finite length, although arbitrarily long strings exist.)
 
The set of <math>n</math>-tuples of <math>\Sigma</math> is denoted <math>\Sigma^n</math>. A '''string of length <math>n</math>''' is an element of <math>\Sigma^n</math>. The set <math>\Sigma^*</math> is defined <math>\Sigma^0 \cup \Sigma^1 \cup ...</math>; an element of <math>\Sigma^*</math> is known simply as a '''string''' over <math>\Sigma</math>. The '''empty string''', denoted <math>\epsilon</math> or <math>\lambda</math>, is the unique element of <math>\Sigma^0</math>. The length of a string <math>S</math> is denoted <math>|S|</math>. (Note that the usual definition of "string" requires strings to have finite length, although arbitrarily long strings exist.)
Line 29: Line 29:
 
* <math>|ST| = |S|+|T|</math>
 
* <math>|ST| = |S|+|T|</math>
 
For example, concatenating ''there'' and ''in'' gives ''therein''. So do concatenating ''the'' and ''rein'', or <math>\epsilon</math> and ''therein'', or ''therein'' and <math>\epsilon</math>. Note that in general, concatenation is not commutative.
 
For example, concatenating ''there'' and ''in'' gives ''therein''. So do concatenating ''the'' and ''rein'', or <math>\epsilon</math> and ''therein'', or ''therein'' and <math>\epsilon</math>. Note that in general, concatenation is not commutative.
 
===Alphabet size===
 
The size of the alphabet can affect the asymptotic complexity of string algorithms. In particular, if <math>N</math> is the size of the input, then we can classify alphabets as follows<ref>Juha Kärkkäinen and Peter Sanders. 2003. Simple linear work suffix array construction. In ''Proceedings of the 30th international conference on Automata, languages and programming'' (ICALP'03), Jos C. M. Baeten, Jan Karel Lenstra, Joachim Parrow, and Gerhard J. Woeginger (Eds.). Springer-Verlag, Berlin, Heidelberg, 943-955. Retrieved 2010-11-13 from [http://monod.uwaterloo.ca/~cs482/suffix-icalp03.pdf http://monod.uwaterloo.ca/~cs482/suffix-icalp03.pdf].</ref>:
 
* If <math>|\Sigma| \in O(1)</math>, that is, it is bounded above by a constant ''independent of <math>N</math>'', then the alphabet is said to be '''constant'''.
 
* If the characters are integers bounded by <math>O(N^c)</math> for some <math>c</math>, then the alphabet is said to be '''integer'''. (This is equivalent to saying that the characters are integers with at most <math>O(N)</math> digits.) Note that this implies <math>|\Sigma| \in O(N^c)</math> and equivalently <math>\log|\Sigma| \in O(\log N)</math>.
 
* All alphabets are '''general''', even when they are neither constant nor integer.
 
The dividing lines may seem arbitrary, but they are placed where they are because for many algorithms a better asymptotic complexity guarantee (in terms of <math>N</math>) may be achieved when the alphabet is constant rather than when it is integer, or when it is integer rather than when it is general. For example, assume that we want to construct a [[trie]] of a set of strings of total size <math>N</math> using constant space. We need to store the links from each node (of which there may be up to <math>O(N)</math>) to its children. These links should be indexed by character, so that we can trace down a path of characters from root to leaf when looking for a word.
 
* When the alphabet is constant, we can simply store the child links of each node in an array and walk each link in constant time, and a search will take <math>O(N)</math> time (because the trie's height cannot possibly be more than <math>N</math>), and furthermore the space complexity is <math>O(N|\Sigma|) = O(N)O(1) = O(N)</math>.
 
* When the alphabet is integer, this approach will not meet the linear space restriction. Instead, we can store the links at each node in a [[map]] (from character to [[pointer]]), which will allow us to perform the walk in <math>O(N\log|\Sigma|) = O(N\log N)</math> time. The total extra space usage (that is, the total size of all the maps) is proportional to the total size of the maps, which, being equal to the total number of links, is <math>O(N)</math>.
 
* If the alphabet is general, and we do the same, then the time taken to walk down the tree will be <math>O(N \log |\Sigma|)</math> again; but this is no longer bounded by <math>O(N \log N)</math>.
 
  
 
==Treatment in programming languages==
 
==Treatment in programming languages==
Line 50: Line 40:
 
* ''Find'' the first occurrence of a given character in a given string, or find the first occurrence of a given string as a substring of another string.
 
* ''Find'' the first occurrence of a given character in a given string, or find the first occurrence of a given string as a substring of another string.
 
* ''Compare'' two strings [[Lexicographic order|lexicographically]].
 
* ''Compare'' two strings [[Lexicographic order|lexicographically]].
 
==References==
 
<references/>
 

Please note that all contributions to PEGWiki are considered to be released under the Attribution 3.0 Unported (see PEGWiki:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)