Difference between revisions of "String"
Line 2: | Line 2: | ||
==Definitions== | ==Definitions== | ||
− | An '''alphabet''', usually denoted <math>\Sigma</math> is a finite nonempty set (whose size is denoted <math>|\Sigma|</math>). It is often assumed that the alphabet is totally ordered, but this is not always necessary. (Since the alphabet is finite, we can always impose a total order when it would be useful to do so, such as when constructing a [[dictionary]].) An element of this the alphabet is known as a '''character'''. | + | An '''alphabet''', usually denoted <math>\Sigma</math> is a finite nonempty set (whose size is denoted <math>|\Sigma|</math>). It is often assumed that the alphabet is totally ordered, but this is not always necessary. (Since the alphabet is finite, we can always impose a total order when it would be useful to do so, such as when constructing a [[dictionary]].) An element of this the alphabet is known as a '''character'''. The size of the alphabet can affect the asymptotic complexity of string algorithms. In particular, if <math>N</math> is the size of the input, and the alphabet size is bounded by a constant independent of <math>N</math>, then it is called a '''constant''' alphabet; if its size is <math>O(N)</math>, it is called '''integer''' alphabet; and if it is larger than this then it is called a '''general''' alphabet. |
The set of <math>n</math>-tuples of <math>\Sigma</math> is denoted <math>\Sigma^n</math>. A '''string of length <math>n</math>''' is an element of <math>\Sigma^n</math>. The set <math>\Sigma^*</math> is defined <math>\Sigma^0 \cup \Sigma^1 \cup ...</math>; an element of <math>\Sigma^*</math> is known simply as a '''string''' over <math>\Sigma</math>. The '''empty string''', denoted <math>\epsilon</math> or <math>\lambda</math>, is the unique element of <math>\Sigma^0</math>. The length of a string <math>S</math> is denoted <math>|S|</math>. (Note that the usual definition of "string" requires strings to have finite length, although arbitrarily long strings exist.) | The set of <math>n</math>-tuples of <math>\Sigma</math> is denoted <math>\Sigma^n</math>. A '''string of length <math>n</math>''' is an element of <math>\Sigma^n</math>. The set <math>\Sigma^*</math> is defined <math>\Sigma^0 \cup \Sigma^1 \cup ...</math>; an element of <math>\Sigma^*</math> is known simply as a '''string''' over <math>\Sigma</math>. The '''empty string''', denoted <math>\epsilon</math> or <math>\lambda</math>, is the unique element of <math>\Sigma^0</math>. The length of a string <math>S</math> is denoted <math>|S|</math>. (Note that the usual definition of "string" requires strings to have finite length, although arbitrarily long strings exist.) |
Revision as of 04:29, 29 May 2011
The string is one of the most fundamental objects in computer science. Strings are intimately familiar to us in the form of text and therefore it is no wonder that the theory of strings (which is to be distinguished from the field of theoretical physics known as "string theory") is one of the most extensively studied subdisciplines of computer science.
Definitions
An alphabet, usually denoted is a finite nonempty set (whose size is denoted ). It is often assumed that the alphabet is totally ordered, but this is not always necessary. (Since the alphabet is finite, we can always impose a total order when it would be useful to do so, such as when constructing a dictionary.) An element of this the alphabet is known as a character. The size of the alphabet can affect the asymptotic complexity of string algorithms. In particular, if is the size of the input, and the alphabet size is bounded by a constant independent of , then it is called a constant alphabet; if its size is , it is called integer alphabet; and if it is larger than this then it is called a general alphabet.
The set of -tuples of is denoted . A string of length is an element of . The set is defined ; an element of is known simply as a string over . The empty string, denoted or , is the unique element of . The length of a string is denoted . (Note that the usual definition of "string" requires strings to have finite length, although arbitrarily long strings exist.)
It follows from the definition that all strings are sequences, but not all sequences are strings.
For ease of conceptualization, we shall usually assign a symbol, a graphical representation, to each character of the alphabet and, considering a string as a sequence of characters, render it as a sequence of symbols. We shall usually do so in boldface, but in this section we shall use italics to avoid confusion with terms being defined, hence, PEG. On other occasions we may choose to represent them with the ordered list notation, hence, [P,E,G]. (This form is useful when the symbols consist of more than one glyph, as can be the case when they are integers; see below.) We will often number the characters of strings, sometimes starting from zero, sometimes starting from one.
Examples of alphabets include:
- The binary alphabet, with exactly two characters (), usually denoted 0 and 1. In virtually all computers ever built, data are strings over the binary alphabet, and are known as bit strings.
- The Latin alphabet, with the characters a, A, b, B, ..., z, Z (). English words may be considered strings over the Latin alphabet.
- The Unicode character set (the size of this alphabet depends somewhat on what we consider to be a valid Unicode character). Text files may be considered strings over this alphabet.
- The set of integers for some , for which . Here the characters are also integers, and their corresponding symbols might have more than one glyph when . (Recall that the definition of character is broader than the common concept of the character as the smallest, indivisible element of writing.)
- The set of nitrogenous bases in DNA, {A, C, G, T}, with . Codons are considered members of , and DNA sequences members of .
- The set of universal proteinogenic amino acids, {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y} (). The primary structure of a protein is represented as a string over this alphabet. This and the preceding example are important alphabets in bioinformatics.
A substring of a string is a string with the same alphabet as given by for some with . Note that the empty string is considered a substring of all other strings. In other words, it is a series of (possibly zero) characters occurring consecutively in a string, taken in order. For example, the, in, here, and therein are all substrings of therein, as is ; however, tin is not (the characters are not consecutive in the original string), nor is rine (the characters do not occur in order in the original string). A prefix is a substring with , so that , the, there, and therein are prefixes of therein. A suffix is a substring with , so that , in, rein, herein, and therein are suffixes of therein. We write to indicate that is a prefix of . Likewise the notation denotes that is a suffix of . We say that is a proper prefix of when and , with proper suffix defined analogously.
These examples may have given the misleading impression that strings which do not represent "valid" words, such as ther, are not valid strings. This is entirely untrue; ther is also a valid prefix of therein. The definition of a string says nothing about the validity or meaning of a string. A language is a (possibly empty, often infinite) subset of ; an element of a language is often called a word (although this term should be used with caution). Thus, while every element of is a valid string, not all are necessarily valid words in a language over . For example, let be defined as the Latin alphabet and be the language consisting of the representations of all valid English words. (Note that we have been careful to distinguish the words themselves from their representations as strings.) Then ther is certainly a valid string, being an element of , but is not a valid word in , since it does not represent an English word.
Thus, a string of length has prefixes, suffixes, proper prefixes, proper suffixes, and substrings. (Here we consider two substrings to be distinct if they start or end at different indices, except that the empty string is counted only once; see also counting distinct substrings.)
Two strings with the same alphabet can be concatenated. To concatenate two strings is to link them together to give a new string which begins with the first and ends with the second with no overlap. Formally, the concatenation of strings and is denoted and has the following three properties:
For example, concatenating there and in gives therein. So do concatenating the and rein, or and therein, or therein and . Note that in general, concatenation is not commutative.
Treatment in programming languages
In many programming languages, strings are superficially similar to arrays of characters, in that the syntax for accessing a string's characters usually match the syntax for accessing array elements. Some programming languages, such as C, go as far as to implement strings as arrays of characters. Other programming languages may have a separate library and separate set of functions for handling common string operations, such as:
- Insert one string into another at a specified position.
- Special case: Concatenate two strings (insert a string at the beginning or the end of another).
- Special case: Append a string or a character (which may be considered a string of length one) to the end of an existing string.
- Erase a substring of a given string at a specified position with a specified length.
- Special case: erasing from the end may be faster.
- Copy: Return a substring of a given string at a specified location with a specified length.
- Find the first occurrence of a given character in a given string, or find the first occurrence of a given string as a substring of another string.
- Compare two strings lexicographically.