The Algorithms Design Manual, Data structures

Contents:

Data Structures

Dictionaries

Input Description : A set of $n$ records, each identified by one or more fields.
Problem description : Build and maintain a data structure to efficiently locate, insert, and delete the record associated with any query key $q$.

It is more important to avoid using a bad data structure than to identify the single best option available.
Questions to ask:
- How many items will you have in your data structure?
- Do you know the relative frequencies of insert, delete, and search operations?
- Can we assume that the access pattern for keys will be uniform and random?
- Is it critical that individual operations be fast, or only that the total amount of work done over the entire program be minimized?
Under the hood:
- Unsorted linked lists or arrays: For small data sets.
- Sorted linked lists or arrays
- Hash tables
  - How do I deal with collisions? Open addressing or bucketing?
  - How big should the table be? At least as big as $m$, the number of items you expect to put in the table, it better be prime.
  - What hash functions should I choose? For strings,
    
    $$H(S,j) = \sum_{i=0}^{m-1}\alpha^{m-(i+1)}\times char(s_{i+j})\mod{m}$$
    $\alpha$ is the size of the alphabet, $char(x)$ maps a character to its ASCII code. This can be computer efficiently as,
    
    $$H(S,j+1) = (H(S,j) - \alpha^{m-1}char(s_j))\alpha + s_{j+m}$$
    . In the end, make sure that the distribution is as uniform as possible.
- Binary search trees
  - Elegant data structures that support fast insertions, deletions and queries.
  - Do I need self balancing trees? Red-black trees, splay trees.
  - Use the best implementations!!
- B-trees
  - For data sets so large that they will not fit in the memory.
  - Uses caches and secondary storage.
  - Look for Cache-oblivious algorithms
- Skip lists
  - Somewhat of a cult data structure.
Implementations
- C++ : std::map
  - Implemented as Binary Search Tree.
- LEDA
  - hashing, perfect hashing, B-trees, red-black trees, random search trees, skip lists.
- Java Collections and Data Structures Library in Java
Related : Sorting, searching

Priority Queues

Input description: A set of records with numerically or otherwise totally-ordered keys.
Problem description: Build and maintain a data structure for providing quick access to the smallest or largest key in the set.

Very useful in simulations(as future events ordered by time).
Not required if insertions, deletions and queries are not intermixed.
Questions to ask:
- What other operations do you need?
- Do you know the maximum data structure size in advance?
- Might you change the priority of the elements already in the queue?
Choices:
- Sorted array or list
- Binary heaps
- Bounded height priority queue
  - Useful for small, discrete range of keys.
- Binary Search trees
  - Very useful when you need more dictionary operations or the size is unbounded.
- Fibonacci and pairing heaps
  - Speed up decrease-key operation.
Implementations
- C++ STL std::priority_queue
  - Uses a std::vector or std::dequeue.
- LEDA
  - Binary heaps, Fibonacci heaps, pairing heaps, Emde-Boas trees and bounded height priority queues
- Java Collections : java.util.PriorityQueue
Related : Dictionaries, sorting, shortest path.

Suffix Trees and Arrays

Input description: A reference string $S$.
Problem description: Build a data structure to quickly find all places where an arbitrary query string $q$ occurs in $S$.

Phenomenally useful for solving string problems.
A suffix tree is simply a trie of all the proper suffixes of $S$.
Can be constructed in $O(n)$ by making clever use of pointers!!
What can you do with it?
- Find all occurrences of $q$ as a substring of $S$ in $O(|q|+k)$, where $k$ is the number of occurrences.
- Longest substring common to a set of strings.
- Find the longest palindrome in $S$.
They get even better when used as suffix arrays.
Implementations
- C implementation by Schurmann and Stoye.
- Many C/C++ implementations in Pizza&Chili corpus
- Java
  - BioJava::SuffixTree
Related : String Matching , Text Compression, Longest Common Substring

Graph Data Structures

Input description: A graph $G$.
Problem description: Represent the graph $G$ using a flexible, efficient data structure.

Adjacency matrices and adjacency lists.
For most things, adjacency lists are the way to go, unless the graphs are very small or very dense.
Questions to ask:
- How big will your graph be?
- How dense will your graph be?
- Which algorithms will you be implementing?
- Will you be modifying the graph over the course of your application?
Prefer existing implementations.
Graphs have different flavors, like Planar graphs, Hypergraphs, hierarchical graphs(representation).
Implementation
- C++ : LEDA, Boost
- Java : JUNG, Data Structures Library(JDSL), JGraphT
- Stanford Graphbase
Related : Set data structures, graph partition

Set data structures

Input description: A universe of items $U = \{u_1, \ldots , u_n}\$ on which is defined a collection of subsets $S = \{S_1, \ldots , S_m\}$.
Problem description: Represent each subset so as to efficiently $(1)$ test whether $u_i \in S_j$, $(2)$ compute the union or intersection of $S_i$ and $S_j$, and $(3)$ insert or delete members of $S$.

In mathematical terms, unordered collection of objects drawn from a fixed universal set.
Uses a single canonical order for implementations, typically sorted.
Distinguished from dictionaries(no fixed universal set) and strings(order is important).
Multisets : An element can occur more than once.
A system of subsets can also be represented as a hypergraph.
Representations:
- Bit vectors : For a fixed universal set. Space efficient, fast intersection and union. Not very good for sparse subsets.
- Containers and dictionaries
- Bloom filters : Bit vectors with hashing.
Set partition : When an item can be in only one set.
Representations for set partition structures:
- Collection of containers : Membership testing is costly.
- Generalized bit vector : Union, intersection takes $O(n),\; n = |U|$.
- Dictionary with a subset attribute : : Union, intersection are slow.
- Union-find data structures : Awesome asymptotic performance!!! Even better with path compression. Does not allow breaking subsets created by unions.
Implementations
- C++ STL set and multiset. LEDA.
- Java Collections HashSet and TreeSet
Related : Generating subsets, generating partitions, set cover, minimum spanning tree.

Kd-Trees

Input description: A set $S$ of $n$ points or more complicated geometric objects in $k$ dimensions.
Problem description: Construct a tree that partitions space by half-planes such that each object is contained in its own box-shaped region.

Spatial data structures that hierarchically decompose space into a small number of cells, each containing a few representatives from an input set of points.
Flavors differ in how the splitting plane is selected:
- Cycling through the dimensions
- Cutting along the largest dimension
- Quadtrees or Octtrees : Cutting (axes-parallel planes) simultaneously along all the dimensions.
- BSP-trees : General(arbitrary) cutting planes to carve up cells. More complicated to work with.
- R-trees : Objects are partitioned into (possibly overlapping) boxes.
Advantages :
- Point location.
- Nearest neighbor search.
- Range search
- Partial key search
Effective upto say 20 dimensions.
Implementations
- KDTREE2 : C++ and Fortran
- Terralib : C++
Related : Nearest-neighbor search, point location, range search.

Part 2 of the The Algorithms Design Manual series

Previous articles

Next articles