Data structure

From PEGWiki
Jump to: navigation, search

A data structure or container is a scheme for organizing data in memory (such as random-access memory, a hard disk, or across the hard disks of a network of data servers). Data structures are designed with efficiency and compactness in mind; that is, the best data structures are the ones that do not use very much more space than is taken up by the raw data themselves, and allow the data to be efficiently operated upon. Nearly all high-level programming languages have built-in support for some data structures; for example, imperative programming languages usually have built-in support for the array, functional programming languages usually have built-in support for the singly linked list, and stack-based programming languages, as the name implies, automatically organize all data onto a stack.

Data structures are always used in conjunction with algorithms; the algorithm determines what to do with data, and the data structure determines how this is done. For this reason, algorithms and data structures form the foundation of computer science. Choosing appropriate data structures for the implementation of an algorithm can greatly boost the algorithm's efficiency.

Anatomy[edit]

Strictly speaking, the data structure consists of two parts: the interface and the implementation, although the latter is what is usually being referred to when the term data structure is used, because it is often highly nontrivial in comparison to the former. In short, the interface is the outside of an algorithm, whereas the implementation is the inside.

Interface[edit]

The interface is the means by which a data structure may be operated on; the set of all operations that a data structure directly allows. For example, a stack allows an element to be pushed onto the top or removed from the top; these two operations are part of the stack's interface. A stack does not allow us to peek at the fifth element from the top without disturbing the elements above it, therefore this is not part of the interface. A stack allows us to pop two elements from the top, but this is probably not part of the interface either, because as long as the stack provides the means to pop one element in its interface, we can simply do this twice to achieve the desired effect. Knowing the interface of a data structure is knowing enough information to incorporate the data structure into a larger data structure or into an algorithm.

Implementation[edit]

At some level, a data structure must actually do what its interface tells us it does. The means by which this is accomplished is known as the implementation. The implementation consists of the details concerning how the data are in fact organized in memory and what happens to them when an operation from the interface is executed. It consists of the details of what algorithms are used to operate on the data and what smaller data structures, if any, are incorporated into a given data structure. The bulk of the code associated with a data structure is usually implementation. For example, we can implement a stack by equipping an extensible array with a push and pop operation, where pushing adds an element on the end and popping removes one. We could also equip a linked list with the same operations; all the data are still there, but they are stored differently in memory. A third implementation would be to take a deque and restrict its interface, providing only the abilities to push and pop at the end. The implementation of a data structure can affect the efficiency of the algorithm that uses it, but does not affect its behavior.

Abstract data type[edit]

If all we know about a data structure is its interface, it is said to be an abstract data type. For example, many programming languages' standard libraries contain associative arrays or dictionaries, which generalize the concept of arrays by allowing any totally ordered types as indices. There are many ways of implementing these, such as by binary search trees, hash tables, or skip lists, but the language standard usually does not specify which one, and the programmer who uses the library has no need to know which implementation is used. Hence the associative array is said to be an abstract data type.

Analogy[edit]

A secretary's desk and filing cabinets can be considered a data structure. The desk is the interface; here you can ask the secretary to add a file to the records, or to retrieve a file that already exists in the records. The filing cabinet is the implementation; it is where the data are actually stored, neatly alphabetized and sorted. If a company places an ad in the papers, seeking a new secretary, the company is, in effect, advertising for an abstract data type; they need the tasks of storing and retrieving files (among others) to be performed, but do not specify which files should go in which drawers and the like. (In this example, the secretary himself/herself corresponds to the CPU.)

Properties[edit]

A data structure is said to be immutable if it is not possible to modify it after it has been created; in order to be useful, an immutable data structure must be fed all the data it is to contain at the time of its creation, and if we wish to add, remove, or replace the data it contains, we must actually use it to create a new data structure of the same type that is identical except for the modifications we wish to make. An example is the String in Java. A data structure that allows its content to be changed is said to be mutable. A data structure can be made immutable by simply not providing anything in the interface that would modify it, or using language-specific constructs such as a const declaration to block write access to what would otherwise be mutable. In a purely functional programming language such as Haskell, with strict referential transparency, all data structures are immutable, whereas in assembly language, whatever data structures one creates will be mutable.

A data structure is said to be persistent when, after updating it, it is still possible to retrieve or operate on previous versions of the data structure. Arrays, for example, are not persistent, because after we change an element of an array, the old value of that element is lost. We could, of course, simply take the collection of all versions a data structure has gone through to be a persistent representation of that data structure, but this will only be useful if the size of this is much less than the sum of the sizes of all the versions the data structure has gone through. (This is the case, for example, with properly implemented immutable tree data structures in which the new version shares most of its nodes with the old version.)

Important data structures[edit]

The following are among the most fundamental and ubiquitous abstract data types:

  • Strings are finite sequences of characters.
  • Stacks allow items to be added and removed. When an item is removed, it is always the one most recently added.
  • Queues allow items to be added and removed. When an item is removed, it is always the remaining item added earliest.
  • Deques are generalizations of both stacks and queues; they allow addition and removal of elements at "both ends".
  • Priority queues are like queues, but the "most important" item is always the first to be removed. The "importance", or priority, of an element is set when it is added.
  • Set data structures, or symbol tables, maintain sets (collections of distinct items) and allow us to add an item, remove an item, or check whether an item is present.
  • Associative arrays or dictionaries are like sets, but two items are added at a time; the first may be used to "look up" the second.

The following are common implementations of data structures:

  • Arrays store a sequence of data sequentially in random-access memory.
  • Adjacency matrices are two-dimensional arrays used to represent graphs.
  • Linked lists store a sequence of data in random-access memory by having each datum "link" to the next (by holding the memory address of the next item).
  • Adjacency lists are collections of sequential containers (typically arrays or linked lists) used to represent graphs.
  • Trees organize data in a hierarchial structure by assigning to all data but one, the root, a parent.
  • Binary search trees are trees in which each node has at most two children, called left and right, and the value stored in a left child is always less than or equal to that stored in its parent, and the value stored in a right child is always greater than that stored in its parent. They are usually used to implement symbol tables and dictionaries.
  • Binary heaps are complete binary trees in which the value stored in each node is strictly greater than the values stored in its child nodes. They are often used to implement priority queues.
  • Tries or prefix trees are trees used to contain sets of strings.