Array

From PEGWiki
Revision as of 07:17, 5 March 2011 by Brian (Talk | contribs) (Created page with "The '''array''' is a data structure considered fundamental in imperative programming languages. We shall take the definition of the term to be ''a collection of objects, all ...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The array is a data structure considered fundamental in imperative programming languages. We shall take the definition of the term to be a collection of objects, all of the same size and type, regularly spaced in random-access memory (RAM). Because the objects are regularly spaced in memory, their addresses occur in arithmetic progression. Therefore, when we assign the object at the lowest memory address an index of, for example, zero, the next object after that an index of one, and so on up to the nth object, which receives an index of n-1, we can access any object in the array in constant time just by knowing the index (using the formula for the ith term of an arithmetic progression). This property makes the array data structure useful and applicable to a wide variety of different algorithms, and many other data structures contain arrays, using them to organize information internally. In particular, an array is often the ideal data structure for a finite sequence.

In functional programming languages based upon the lambda calculus, arrays are not fundamental ingredients, as the lambda calculus is not aware of the organization of computer memory. Nevertheless, they may have language support for the sake of efficiency.

Terminology

The size of an array is the number of objects it contains. The stored objects themselves are known as elements, entries, or values of the array. An element of an array is uniquely identified in that array by an index, which is usually a natural number (as in the preceding paragraph), but may be any kind of object taken from a finite well-ordered set. (For example, the Pascal programming language allows the use of characters as array indices, as characters may be well-ordered according to their ASCII values; it allows boolean indices, with false = 0 and true = 1; but it does not allow reals. All programming languages that support arrays support natural numbers as indices, and some do not support any other type of index.) Indices do not appear in the array itself; the choice of index is an arbitrary convention we use to abstract the locations of elements in arrays, without using the addresses themselves, which can be cumbersome to work with and may change every time a program is run and memory for the array allocated. If an element of an array is located at index i, the element immediately following it in the array, if there is one, occurs at an index equal to the successor of i; if i is an integer then this is i+1.

A segment or slice of an array consists of exactly those values of the array occurring at indices taken from a contiguous range. For example, we may choose the entire array, or we may choose no elements at all, or we may choose the fifth element, or we may choose the ninth, tenth, eleventh, and twelfth elements; all these are considered segments or slices.

Address computation

If each element in an array occupies x bytes of memory, then the difference in addresses of consecutive elements in the array is some value d \geq x. It may be exactly x, in which case the array takes up exactly the same amount of space as its individual elements combined, but, depending on the programming language, d may also be somewhat greater than x in order to keep the elements word-aligned. (In practice, most compilers will not choose d > x, simply because it wastes space.)

Now, suppose this array is indexed by members of some finite well-ordered set I. Define a bijection f between I and \mathbb{N} (starting from zero) as follows: since I is well-ordered, it has a least element l; we shall write f(0) = l. Then, either I is exhausted, or it contains a next smaller element, l'; then we shall write f(1) = l', and so on until the elements of I have been put into order in one-to-one correspondence with \{0, 1, ..., n-1\} (where n = |I|).

Furthermore, suppose that the first element in the array, the element located at the lowest memory address, in fact has address a. Then the element located at index i will have address a+df^{-1}(i). Likewise, if we know the address a' of an element, then its index is given by f((a'-a)/d).

For example, suppose that a program is counting the number of words that start with each letter of the Latin alphabet. Then it will probably have an array indexed by the alphabet in which the values are, say, integers 4 bytes long. So the element at index a holds the number of words that start with a, the element at index b holds the number of words that start with b, and so on up to z. Then, we will have f(x) defined to be the (x+1)th letter of the alphabet, hence f(0) = a, ..., f(25) = z. So suppose that the first element of the array (the count of words starting with a) has address 92631507. Then, suppose we wish to look up the number of words starting with the letter s. s is the 19th letter of the alphabet, so we will find this entry at address 92631507+4×18 = 92631579. Likewise, suppose we wish to know which element is located at address 92631555. To do this, we let x = 92631555-92631507 = 48, whence x/4 = 12 and the index we seek is f(12), the thirteenth letter of the alphabet, which is m. (This example illustrates that it's really not as complicated as it sounds; the mapping of indices to addresses is very intuitive.)

Multidimensional arrays

The visual interpretation of the array is, perhaps, a line of boxes, each of which is a variable (a space in memory that holds one of the array's elements). So, for example, suppose there were fifty people in a line and we wanted to store their names. Then we could assign each person a number according to his or her position in the line, and use these as indices to an array with 50 elements, each of which holds the name of one person (presumably as a fixed-length string or a pointer to a variable-length string)?

It becomes clear fairly quickly, however, that this mode of thinking is severely limiting. For example, suppose that there are 50 rows of 50 occupied chairs each, and we want to store the names of all 2500 people listening to a speech. What do we do now? One method is to imagine that we asked the first row to stand up and form a line off to the left side, then we asked the second row to stand up and line up behind the people who used to be in the first row, in such a way that if Alice is originally seated to the left of Bob, she will end up in front of Bob in the line; and so on down until all 2500 people have lined up, and we assign the number 0 to the first person in the line, 1 to the second person, and eventually 2499 to the last person. (In a bird's eye view facing the podium, we are assigning IDs in English reading order: left to right then top to bottom.) Then we could use an array indexed by the integers from 0 to 2499 to hold information about the names of all the people attending.

But this is cumbersome. Suppose that the guest sitting in the 12th row and the 34th column is randomly selected to win a prize, and we wish to call him out by name. Then we have to compute that his ID is (34-1) + 50\cdot(12-1) = 583, and look up the array element at index 583. It would be much more convenient if we could simply use the ordered pair (34,12) as an index --- not a location along a line, but the coordinates of a point in a plane. The array is being used, after all, to represent a table or matrix or grid of things, not a line.

This, in fact, is the principle behind a multidimensional array, an array indexed by n-tuples, where n > 1. (Such an array is said to be n-dimensional.) It is not hard to form a bijection between tuples and \mathbb{N}; in this example, we have a two-dimensional array with f(0) = (1,1), f(1) = (1,2), ..., f(49) = (1,50), f(50) = (2,1), ..., f(99) = (2,50), ..., f(2450) = (50,1), ..., f(2499) = (50,50). The point is that, in order to ease the cognitive burden on the programmer and aid visualization, the details of the address translation should be taken care of by the compiler and hidden from the programmer. Many programming languages support arrays with several dimensions; these are particularly important in dynamic programming. The size of a multidimensional array is the product of the sizes of the individual sets whose Cartesian product was taken to form the set of tuples as indices for the array. The elements of the tuple need not be all of the same type.

(It is important to remember that a multidimensional array is still, in a sense, a line of values sitting in memory, and that the convention of indexing them by tuples is only for convenience and has no real existence within the machine. Accordingly, the interpretation of the array determines its abstract nature, and not vice versa; whether the first element of an ordered pair as the index represents the row or the column is irrelevant as long as one is consistent, and they may represent, in fact, whatever the programmer wants; the first element may identify a star system and the second one of the stars in the system identified by the first, for example, or the interpretation may be totally abstract, with no physical nature.)