| Solve health-related algorithmic problems through crowdsourcing. | Algorithm Development Through Crowdsourcing |
The following is a description of a challenge and the problem statement that was used for it. Please note that at this point we do not expect you to develop the problem statement (as shown below). Our research team will work with you to translate your algorithm into a challenge statement for TopCoder.
We selected a genome-sequencing problem in the emerging field of immunomics, the systems-level study of the immune system. A main goal in this field is to map the relationship between specific sets of B and T cells (both are types of white blood cells), which comprise the adaptive branch of the immune system, and the viral, bacterial, cancer, and autoimmune antigens that they target. These cells' specificity for particular antigens depends on which antibodies (for B cells) and T cell receptors (TCRs) they happen to make, which in turn depends on the sequence of their antibody or TCR genes. Unlike other genes, antibody and TCR are not encoded as single genes in the genome. Instead they are built up combinatorially in each cell from gene segments, with the result that each new cell can have a gene with a different DNA sequence. Genomic diversity is increased further by insertion or deletion of nucleotides at the joins between segments, by mutation in the resulting gene, and by combinatorial pairing of the encoded protein products. The protein products are called light and heavy chains for antibodies, and alpha and beta (or gamma and delta) chains for TCR. Heavy, beta, and delta chains have three segments; the others have two. A first step toward the goal of an immunology gene map is being able to determine, given a known set of gene segments (e.g. obtained through genome sequencing), which segments comprise each gene, for each of the ~106 genes in the ~106 B and T cells in a vial of a patient's blood. The optimal algorithm solution needed to be maximally accurate, better than existing alternative predictors such as BLAST, MegaBLAST or IMGT/V-QUEST, and fast enough - e.g., 106 sequences/minute on a typical desktop computer - to analyze multiple samples in a reasonable amount of time on standard, off-the-shelf, computer hardware.
Sequence alignment is a well-studied task that comes in many flavors. One problem involves determining how various pieces of DNA combine to form longer sequences which are then used for various purposes. In particular, we have three sets of sequences: A, B and C. By selecting sequences a, b, and c from A, B and C, we can append them to achieve a longer sequences abc. However, this process is not without complexities, and during the appending process a number of insertions, deletions, and mutations might be made, making it difficult to tell what the original a, b and c were.
You will be given String[ ]'s A, B and C representing the sets mentioned above. You will also be given a number of query strings. For each query, you must determine which sequence abc most closely matches the query string, where closely is defined by Levenshtein distance. You should return a int[ ] with three element for each query string. Thus, ret[i*3] gives the index of the element of A to use for query i (indexed from 0), while element ret[i*3+1] gives the element of B and ret[i*3+2] gives the element of C.
The query strings will be formed by selecting one string from each set. For each of the three selected strings, two independent random integers d1 and d2 between 0 and 10, inclusive will be chosen and d1 characters will be deleted from the front of the string and d2 characters will be deleted from the end. Next, four integers (i1 ... i4) will be chosen by sampling from an exponential distribution with lambda = 0.3, and then taking the floor of the sampled value. i1 random characters will be inserted before a, i2 random characters between a and b, i3 random characters between b and c, and i4 random characters after c. Finally, each character will be changed to a different character with probability 0.1.
Each of the sets A, B and C will be randomly (and independently) selected from one of four biological datasets. Thus, there are 64 possible values for the sets (4 for each A, B and C).
Your solution should find abc triples that are close to queries quickly. For a set of queries, we find the total Levenshtein distance summed over all queries, DIST. The total runtime over all queries in the set is TIME. Your score will then be DIST * 0.01 + ln(TIME). Your overall score will be the sum over all test cases of 1,000,000/score.
An offline tester is available.
| Class: | SequenceAlignment |
| Method: | recover |
| Parameters: | String[ ], String[ ], String[ ], String[ ] |
| Returns: | int[ ] |
| Method signature: (be sure your method is public) |
int[ ] recover(String[ ] A, String[ ] B, String[ ] C, String[ ] queries) |
Harvard Catalyst will work with selected applicants to develop the complete problem statement.