Contents:
Set and String Problems
Set Cover
Input description: A collection of subsets \(S = {S_1, \ldots , S_m}\) of the universal set
\(U = {1, \ldots , n}\).
Problem description: What is the smallest subset \(T\) of \(S\) whose union equals
the universal set—i.e. , \( \cup_{i=1}^{|T|} T_i = U\)?
- Several variations of set problem,
- Are you allowed to cover elements more than once? Distinction between set cover and set packing.
- Are your sets derived from the edges or vertices of a graph? Can be modeled as vertex cover instead.
- Do your subsets contain only two elements each? You are in luck. Becomes NP-complete with three variables.
- Do you want to cover elements with sets, or sets with elements
- Greedy is the most natural and effective heuristic for set cover.
- Simulated annealing is likely to produce somewhat better set covers.
- Backtracking can be used to guarantee an optimal solution, but is computationally very expensive.
- Alternative, integer programming formulation of set cover.
- Related : Matching, vertex cover, set packing
Set Packing
Input description: A set of subsets \(S = {S_1, \ldots , S_m}\) of the universal set \(U =
{1, \ldots , n}\).
Problem description: Select (an ideally small) collection of mutually disjoint subsets from \(S\) whose union is the universal set.
- Must every element appear in exactly one selected subset? This will be NP-complete is yes.
- Does each element have its own singleton set?
- What is the penalty for covering elements twice?
- The same heuristics as in set cover work here.
- Related : Independent set, set cover
String Matching
Input description: A text string \(t\) of length \(n\). A pattern string \(p\) of length \(m\).
Problem description: Find the first (or all) instances of pattern \(p\) in the text.
- Arises in almost all text-processing applications.
- Issues to look for,
- Are your search patterns and/or texts short? Simple \(O(mn)\) would suffice.
- What about longer texts and patterns? Preprocessing the search pattern can help improve the jumps on finding a mismatch.
- Do I expect to find the pattern or not? Check backwards and perform better jumps.
- Will you perform multiple queries on the same text? Use suffix trees or suffix arrays.
- Will you search many texts using the same patterns? Build finite automaton. Also useful when searching for regular expressions.
- What if our text or pattern contains a spelling error? Use Approximate string matching algorithms.
- Implementations : Strmat, grep and variants, Boost string algorithms.
- Related : Suffix trees, approximate string matching
Approximate String Matching
Input description: A text string \(t\) and a pattern string \(p\).
Problem description: What is the minimum-cost way to transform \(t\) to \(p\) using insertions, deletions, and substitutions?
- A fundamental problem because we live in an error-prone world.
- Dynamic programming provides the basic approach. \(D[i,j]\) = cost of editing \(i\) characters of string \(p\) into first \(j\) characters of text \(t\). The recurrence selected the minimum of,
- if \(p_i = t_j\) then \(D[i-1, j-1]\) else \(D[i-1, j-1] +\) substitution cost.
- \(D[i-1, j] +\) deletion cost of \(p_j\).
- \(D[i, j+1] +\) deletion cost of \(t_j\).
- Several issues remain however,
- Do I match the pattern against the full text, or against a substring?
- How should I select the substitution and insertion/deletion costs?
- How do I find the actual alignment of the strings?
- What if the two strings are very similar to each other?
- Is your pattern short or long? You can use bit-parallel algorithms, which can be many times faster than dynamic programming.
- How can I minimize the required storage? Quadratic space required by dp table is only required if you need to reconstruct the actual sequence alignment.
- Should I score long runs of indels differently?
- Does similarity mean strings that sound alike?
- Implementations : agrep, nrgrep, TRE
- Related : String matching, longest common substring
Text Compression
Input description: A text string \(S\).
Problem description: Create a shorter text string \(S'\) such that \(S\) can be correctly reconstructed from \(S'\).
- Decreasing secondary storage prices seem to have increased interest in data compression, because there is more data to compress.
- Several questions arise while selecting the right compression algorithm,
- Must we recover the exact input text after compression? Lossy vs Lossless encoding. Documents can't use lossy compression. Images/videos can.
- Can I simplify my data before I compress it? Any information that can be removed would help in compression later.
- Does it matter whether the algorithm is patented? Just something to look out for.
- How do I compress image data? Use JPEG.
- Must compression run in real time? Depends on the application.
- Several dozen algorithms are available,
- Huffman codes, greedy variable length encoding based on frequency.
- Lempel-ziv algorithms (LZV compression), just use this.
- Implementations : gzip, bzip2
- Related : Shortest common superstring, cryptography
Cryptography
Input description: A plaintext message \(T\) or encrypted text \(E\), and a key \(k\).
Problem description: Encode \(T\) (decode \(E\)) using \(k\) giving \(E\) (\(T\)).
- Foundation for the modern internet.
- Three classes of cryptosystems everyone should be aware of,
- Ceaser shifts, substitute on character for something else. Easy to break using statistical analysis.
- Block shuffle ciphers, DES(now crackable), AES, Triple DES.
- Public key cryptography, RSA
- Never try to roll out your own novel cryptosystem. It is not going to work.
- Certain problems related to cryptography,
- How can I validate the integrity of data against random corruption? use checksums, cyclic redundancy check.
- How can I validate the integrity of data against deliberate corruption? MD5, SHA-256.
- How can I prove that a file has not been changed? Digital signatures.
- How can I restrict access to copyrighted material? Stream ciphers are used, but are needed to be protected by the law.
- Implementations : Nettle, Crypto++
- Related : Factoring and primality testing, text compression
Finite State Machine Minimization
Input description: A deterministic finite automaton \(M\).
Problem description: Create the smallest deterministic finite automaton \(M'\) such that \(M'\) behaves identically to \(M\).
- Useful in control systems, compilers, software and hardware design.
- Minimizing the state machine can reduce both storage and execution costs.
- Consider three different problems,
- Minimizing deterministic finite state machine, be removing redundancy.
- Constructing deterministic machines from nondeterministic machines, this can however be worse than NP-complete, as the number of states blow up exponentially.
- Constructing machines from regular expressions, can construct either NFA or DFA.
- Related : Satisfiability, string matching
Longest Common Substring/Subsequence
Input description: A set \(S\) of strings \(S_1, \ldots , Sn\).
Problem description: What is the longest string \(S'\) such that all the characters
of \(S'\) appear as a substring or subsequence of each \(S_i, 1 \le i \le n\)?
- Arises when finding similarity among multiple sequences.
- Issues arising include,
- Are you looking for a common substring?
- Are you looking for a common scattered subsequence?
- What if there are relatively few sets of matching characters?
- What if the strings are permutations?
- What if we have more than two strings to align?
- Related : Approximate string matching, shortest common superstring
Shortest Common Superstring
Input description: A set of strings \(S = {S_1, \ldots , S_m}\).
Problem description: Find the shortest string \(S'\) that contains each string \(S_i\) as
a substring of \(S'\).
- This can be useful in data compression, DNA sequence assembly.
- Finding a super-sequence is not a problem, but finding the shortest super-sequence is NP-complete.
- Easily reducible to asymmetric TSP problem.
- A greedy heuristic: find the substrings with maximum overlap and merge them, repeat.
- Related : Suffix trees, text compression
Comments
comments powered by Disqus