Editing String

The '''string''' is one of the most fundamental objects in computer science. Strings are intimately familiar to us in the form of text and therefore it is no wonder that the theory of strings (which is to be distinguished from the field of theoretical physics known as "string theory") is one of the most extensively studied subdisciplines of computer science.

==Definitions==
An '''alphabet''', usually denoted <math>\Sigma</math> is a finite nonempty set (whose size is denoted <math>|\Sigma|</math>). It is often assumed that the alphabet is totally ordered, but this is not always necessary. (Since the alphabet is finite, we can always impose a total order when it would be useful to do so, such as when constructing a [[dictionary]].) An element of this the alphabet is known as a '''character'''. The size of the alphabet can affect the asymptotic complexity of string algorithms. In particular, if <math>N</math> is the size of the input, and the alphabet size is bounded by a constant independent of <math>N</math>, then it is called a '''constant''' alphabet; if its size is <math>O(N)</math>, it is called '''integer''' alphabet; and if it is larger than this then it is called a '''general''' alphabet.

The set of <math>n</math>-tuples of <math>\Sigma</math> is denoted <math>\Sigma^n</math>. A '''string of length <math>n</math>''' is an element of <math>\Sigma^n</math>. The set <math>\Sigma^*</math> is defined <math>\Sigma^0 \cup \Sigma^1 \cup ...</math>; an element of <math>\Sigma^*</math> is known simply as a '''string''' over <math>\Sigma</math>. The '''empty string''', denoted <math>\epsilon</math> or <math>\lambda</math>, is the unique element of <math>\Sigma^0</math>. The length of a string <math>S</math> is denoted <math>|S|</math>. (Note that the usual definition of "string" requires strings to have finite length, although arbitrarily long strings exist.)

It follows from the definition that all strings are [[sequence]]s, but not all sequences are strings.

For ease of conceptualization, we shall usually assign a ''symbol'', a graphical representation, to each character of the alphabet and, considering a string as a sequence of characters, render it as a sequence of symbols. We shall usually do so in '''boldface''', but in this section we shall use ''italics'' to avoid confusion with terms being defined, hence, ''PEG''. On other occasions we may choose to represent them with the ordered list notation, hence, [''P'',''E'',''G'']. (This form is useful when the symbols consist of more than one glyph, as can be the case when they are integers; see below.) We will often number the characters of strings, sometimes starting from zero, sometimes starting from one.

Examples of alphabets include:
* The binary alphabet, with exactly two characters (<math>|\Sigma|=2</math>), usually denoted ''0'' and ''1''. In virtually all computers ever built, data are strings over the binary alphabet, and are known as bit strings.
* The Latin alphabet, with the characters ''a'', ''A'', ''b'', ''B'', ..., ''z'', ''Z'' (<math>|\Sigma|=52</math>). English words may be considered strings over the Latin alphabet.
* The Unicode character set (the size of this alphabet depends somewhat on what we consider to be a valid Unicode character). Text files may be considered strings over this alphabet.
* The set of integers <math>\Sigma = \{0, 1, ..., N-1\}</math> for some <math>N \in \mathbb{N}</math>, for which <math>|\Sigma| = N</math>. Here the characters are also integers, and their corresponding symbols might have more than one glyph when <math>N > 10</math>. (Recall that the definition of ''character'' is broader than the common concept of the character as the smallest, indivisible element of writing.)
* The set of nitrogenous bases in DNA, {''A'', ''C'', ''G'', ''T''}, with <math>|\Sigma|=4</math>. Codons are considered members of <math>\Sigma^3</math>, and DNA sequences members of <math>\Sigma^*</math>.
* The set of universal proteinogenic amino acids, {''A'', ''C'', ''D'', ''E'', ''F'', ''G'', ''H'', ''I'', ''K'', ''L'', ''M'', ''N'', ''P'', ''Q'', ''R'', ''S'', ''T'', ''V'', ''W'', ''Y''} (<math>|\Sigma|=20</math>). The primary structure of a protein is represented as a string over this alphabet. This and the preceding example are important alphabets in bioinformatics.

A '''substring''' of a string <math>S = c_1 c_2 ... c_n</math> is a string with the same alphabet as <math>S</math> given by <math>s = c_i c_{i+1} ... c_j</math> for some <math>i, j \in \mathbb{N}</math> with <math>i \leq j \leq n</math>. Note that the empty string is considered a substring of all other strings. In other words, it is a series of (possibly zero) characters occurring consecutively in a string, taken in order. For example, ''the'', ''in'', ''here'', and ''therein'' are all substrings of ''therein'', as is <math>\epsilon</math>; however, ''tin'' is not (the characters are not consecutive in the original string), nor is ''rine'' (the characters do not occur in order in the original string). A '''prefix''' is a substring with <math>i = 1</math>, so that <math>\epsilon</math>, ''the'', ''there'', and ''therein'' are prefixes of ''therein''. A '''suffix''' is a substring with <math>j = n</math>, so that <math>\epsilon</math>, ''in'', ''rein'', ''herein'', and ''therein'' are suffixes of ''therein''. We write <math>s \sqsubset S</math> to indicate that <math>s</math> is a prefix of <math>S</math>. Likewise the notation <math>s \sqsupset S</math> denotes that <math>s</math> is a suffix of <math>S</math>. We say that <math>s</math> is a '''proper prefix''' of <math>S</math> when <math>s \sqsubset S</math> and <math>s \neq S</math>, with '''proper suffix''' defined analogously.

These examples may have given the misleading impression that strings which do not represent "valid" words, such as ''ther'', are not valid strings. This is entirely untrue; ''ther'' is also a valid prefix of ''therein''. The definition of a string says nothing about the ''validity'' or ''meaning'' of a string. A '''language''' is a (possibly empty, often infinite) subset of <math>\Sigma^*</math>; an element of a language is often called a ''word'' (although this term should be used with caution). Thus, while every element of <math>\Sigma^*</math> is a valid string, not all are necessarily valid words in a language over <math>\Sigma^*</math>. For example, let <math>\Sigma</math> be defined as the Latin alphabet and <math>L \subseteq \Sigma^*</math> be the language consisting of the representations of all valid English words. (Note that we have been careful to distinguish the words themselves from their representations as strings.) Then ''ther'' is certainly a valid ''string'', being an element of <math>\Sigma^*</math>, but is not a valid ''word'' in <math>L</math>, since it does not represent an English word.

Thus, a string of length <math>N</math> has <math>N+1</math> prefixes, <math>N+1</math> suffixes, <math>N</math> proper prefixes, <math>N</math> proper suffixes, and <math>1+N(N+1)/2</math> substrings. (Here we consider two substrings to be distinct if they start or end at different indices, except that the empty string is counted only once; see also [[counting distinct substrings]].)

Two strings with the same alphabet can be '''concatenated'''. To concatenate two strings is to link them together to give a new string which begins with the first and ends with the second with no overlap. Formally, the concatenation of strings <math>S</math> and <math>T</math> is denoted <math>ST</math> and has the following three properties:
* <math>S \sqsubset ST</math>
* <math>T \sqsupset ST</math>
* <math>|ST| = |S|+|T|</math>
For example, concatenating ''there'' and ''in'' gives ''therein''. So do concatenating ''the'' and ''rein'', or <math>\epsilon</math> and ''therein'', or ''therein'' and <math>\epsilon</math>. Note that in general, concatenation is not commutative.

==Treatment in programming languages==
In many programming languages, strings are superficially similar to [[array]]s of characters, in that the syntax for accessing a string's characters usually match the syntax for accessing array elements. Some programming languages, such as C, go as far as to implement strings as arrays of characters. Other programming languages may have a separate library and separate set of functions for handling common string operations, such as:
* ''Insert'' one string into another at a specified position.
:* Special case: ''Concatenate'' two strings (insert a string at the beginning or the end of another).
::* Special case: ''Append'' a string or a character (which may be considered a string of length one) to the end of an existing string.
* ''Erase'' a substring of a given string at a specified position with a specified length.
:* Special case: erasing from the end may be faster.
* ''Copy'': Return a substring of a given string at a specified location with a specified length.
* ''Find'' the first occurrence of a given character in a given string, or find the first occurrence of a given string as a substring of another string.
* ''Compare'' two strings [[Lexicographic order|lexicographically]].
@@ Line 2: / Line 2: @@
 ==Definitions==
-An '''alphabet''', usually denoted <math>\Sigma</math> is a finite nonempty set (whose size is denoted <math>|\Sigma|</math>). It is often assumed that the alphabet is totally ordered, but this is not always necessary. (Since the alphabet is finite, we can always impose a total order when it would be useful to do so, such as when constructing a [[dictionary]].) An element of this the alphabet is known as a '''character'''. Unless otherwise specified, we will always assume that two characters from the same alphabet can be compared for equality in constant time.
+An '''alphabet''', usually denoted <math>\Sigma</math> is a finite nonempty set (whose size is denoted <math>|\Sigma|</math>). It is often assumed that the alphabet is totally ordered, but this is not always necessary. (Since the alphabet is finite, we can always impose a total order when it would be useful to do so, such as when constructing a [[dictionary]].) An element of this the alphabet is known as a '''character'''. The size of the alphabet can affect the asymptotic complexity of string algorithms. In particular, if <math>N</math> is the size of the input, and the alphabet size is bounded by a constant independent of <math>N</math>, then it is called a '''constant''' alphabet; if its size is <math>O(N)</math>, it is called '''integer''' alphabet; and if it is larger than this then it is called a '''general''' alphabet.
 The set of <math>n</math>-tuples of <math>\Sigma</math> is denoted <math>\Sigma^n</math>. A '''string of length <math>n</math>''' is an element of <math>\Sigma^n</math>. The set <math>\Sigma^*</math> is defined <math>\Sigma^0 \cup \Sigma^1 \cup ...</math>; an element of <math>\Sigma^*</math> is known simply as a '''string''' over <math>\Sigma</math>. The '''empty string''', denoted <math>\epsilon</math> or <math>\lambda</math>, is the unique element of <math>\Sigma^0</math>. The length of a string <math>S</math> is denoted <math>|S|</math>. (Note that the usual definition of "string" requires strings to have finite length, although arbitrarily long strings exist.)
@@ Line 29: / Line 29: @@
 * <math>|ST| = |S|+|T|</math>
 For example, concatenating ''there'' and ''in'' gives ''therein''. So do concatenating ''the'' and ''rein'', or <math>\epsilon</math> and ''therein'', or ''therein'' and <math>\epsilon</math>. Note that in general, concatenation is not commutative.
-===Alphabet size===
-The size of the alphabet can affect the asymptotic complexity of string algorithms. In particular, if <math>N</math> is the size of the input, then we can classify alphabets as follows<ref>Juha Kärkkäinen and Peter Sanders. 2003. Simple linear work suffix array construction. In ''Proceedings of the 30th international conference on Automata, languages and programming'' (ICALP'03), Jos C. M. Baeten, Jan Karel Lenstra, Joachim Parrow, and Gerhard J. Woeginger (Eds.). Springer-Verlag, Berlin, Heidelberg, 943-955. Retrieved 2010-11-13 from [http://monod.uwaterloo.ca/~cs482/suffix-icalp03.pdf http://monod.uwaterloo.ca/~cs482/suffix-icalp03.pdf].</ref>:
-* If <math>|\Sigma| \in O(1)</math>, that is, it is bounded above by a constant ''independent of <math>N</math>'', then the alphabet is said to be '''constant'''.
-* If the characters are integers bounded by <math>O(N^c)</math> for some <math>c</math>, then the alphabet is said to be '''integer'''. (This is equivalent to saying that the characters are integers with at most <math>O(N)</math> digits.) Note that this implies <math>|\Sigma| \in O(N^c)</math> and equivalently <math>\log|\Sigma| \in O(\log N)</math>.
-* All alphabets are '''general''', even when they are neither constant nor integer.
-The dividing lines may seem arbitrary, but they are placed where they are because for many algorithms a better asymptotic complexity guarantee (in terms of <math>N</math>) may be achieved when the alphabet is constant rather than when it is integer, or when it is integer rather than when it is general. For example, assume that we want to construct a [[trie]] of a set of strings of total size <math>N</math> using constant space. We need to store the links from each node (of which there may be up to <math>O(N)</math>) to its children. These links should be indexed by character, so that we can trace down a path of characters from root to leaf when looking for a word.
-* When the alphabet is constant, we can simply store the child links of each node in an array and walk each link in constant time, and a search will take <math>O(N)</math> time (because the trie's height cannot possibly be more than <math>N</math>), and furthermore the space complexity is <math>O(N|\Sigma|) = O(N)O(1) = O(N)</math>.
-* When the alphabet is integer, this approach will not meet the linear space restriction. Instead, we can store the links at each node in a [[map]] (from character to [[pointer]]), which will allow us to perform the walk in <math>O(N\log|\Sigma|) = O(N\log N)</math> time. The total extra space usage (that is, the total size of all the maps) is proportional to the total size of the maps, which, being equal to the total number of links, is <math>O(N)</math>.
-* If the alphabet is general, and we do the same, then the time taken to walk down the tree will be <math>O(N \log |\Sigma|)</math> again; but this is no longer bounded by <math>O(N \log N)</math>.
 ==Treatment in programming languages==
@@ Line 50: / Line 40: @@
 * ''Find'' the first occurrence of a given character in a given string, or find the first occurrence of a given string as a substring of another string.
 * ''Compare'' two strings [[Lexicographic order|lexicographically]].
-==References==
-<references/>