Jul 022010
 

Top-10

Author: Felix Halim

This part of review is writen by the author himself (Felix Halim).

Brute Force
The naive solution is to sort the dictionary using the three ordering rules (length, lexicographic, index). For each query, we do linear scan for each worded (in sorted order) and terminate as soon as we collect 10 words that has the query substring. This solution is too slow O(N log N + Q * N). Binary search can help to find the starting index for each query and the scan can start that index but it’s still too slow (it’s still doing the linear scan). Here is the sample code: top10_tle_strstr.cpp

Suffix-Array and Segment-Tree
By looking at the large number of queries, it suggests that each query should be answered in less than linear time. To answer a query in less than linear time, we need to pre-process the N words in the dictionary using some "well known" data structure (i.e. suffix-array or suffix-tree). All the words in the dictionary can be concatenated together and separated by a separator (e.g. '#') forming a very long string of K characters. Suffix-array is then applied on this long string.

Suffix-array made it easy to find a substring in a very long text of K characters in O(log K) time. There exists a simple way to construct suffix-array in O(K * log^2(K)). Suffix-array can give us a range of indexes in the suffix array that has a certain prefix (the query substring) in O(log K) time. The "Top 10" answers lies in this range. That is, it satisfies the first rule of the problem (all the "Top 10" words have to contain the query substring). Unfortunately, the range can be very big and finding the actual "Top 10" answers in that range can be as costly as the naive linear scan.

This is where the segment-tree comes to the rescue. If we construct a segment-tree on the suffix-array,
we are able to do range-query in the suffix-array to select the "Top 10" answers in O(log(N) * 10 * 10). Each node in the segment-tree will collect and merge the "Top 10" answers from its children nodes and from itself. The "Top 10" is maintained using insertion sort algorithm thus it requires the O(10 * 10) time for each merge.

During the merging, the three ordering rules (length, lexicographic, index) are applied. The first ordering (by length) is easy to compute. The second ordering (by lexicographic) is a little tricky. The second ordering for the N words can be precalculated after the suffix-array is constructed. For each word, we just need to mark its position in the suffix-array. Using this "suffix-position", the second ordering can be checked as simple as comparing two integers. The third ordering is trivial.

The complexity of this solution is O(K * log^2(K) + N * log(K) + K * log(K) * 10 * 10 + Q * log(K) * 10 * 10) which is composed of suffix-array + suffix-position + segment-tree-on-suffix-array + segment-tree-query where K is the total number of characters in the dictionary. This solution runs about 2 seconds using judge's input which is fast enough to get this problem accepted. Here is the sample code : top10_ac_sa_smtree.cpp

Suffix-Tree or Suffix-Array + Longest Common Prefix
The concatenated words in the dictionary can be pre-processed using suffix-tree data structure. Each node in this tree can be annotated to contain the "suffix-position" using a regular tree traversal O(K). Using the suffix-position, we can pre-calculate the "Top 10" answers for each node in the suffix-tree in O(K * 10 * 10) since there are at most 2*K nodes in the suffix-tree. Each query is answered by traversing the tree to the correct node and print the pre-calculated "Top 10" answers in that node O(query length).

The total complexity is O(K + K * 10 * 10 + L) where K is the total number of characters in the dictionary and L is the total number of characters in the queries. This solution runs about 0.75 seconds using judge's input. However, this solution is not necessary to get the problem accepted. This solution is only for those who enjoy the beauty of suffix-tree. The suffix-tree can be constructed in at least two ways. First is using the Ukkonen's method, second is using the suffix-array + longest common prefix to build the suffix-tree (the slower method which runs in about 1.0 seconds). Here is the sample code: top10_ac_sfxtree.cpp or top10_ac_sa_lcp_sfxtree.cpp

This problem was solved only by 1 team: UESTC_floyd from University of Electronic Science and Technology of China, minute 259.


  16 Responses to “ACM-ICPC 2009 Jakarta – Problem Set & Analysis”

  1. Wah, akhirnya jadi juga :)) *Cepet* juga ya, ga perlu sampai satu tahun udah jadi =)) …

  2. Wogh.. siapa itu yang dihitamkan di YM yang lagi away?

    Btw, CONGRATS! Akhirnya jadi juga ini blog…

  3. Itu si husin, gw belum ganti nama di YM gw, jadi masih muncul ym-id dia. Diitemin biar ga diprotes ama orangnya πŸ˜€

  4. kyknya problem A ada yg lupa ditutup tagnya.

    ternyata write up H cuma gitu tok… πŸ˜› tau gitu harusnya bisa jadi dari dulu2 dong =))

  5. berhubung tampaknya komentarnya gak dibaca:
    Ax

  6. ah sial, tagnya gak bener.
    A x

  7. ah dodol ini
    < i > A < sub > x < sub > < /i >

  8. @mahli: ngapain lu?

  9. maaf, saya masih agak belum paham,
    himpunan S itu menyatakan himpunan apa ya?

    makasih sebelumnya..

  10. @akbar: yang soal Alien ya? S itu himpunan evil alien yang sudah dibom. Contoh: S = {0,1,1,0}, artinya ada 4 evil alien, alien kedua dan ketiga sudah dibom (0 = masih hidup, 1 = sudah dibom).

  11. @suhendry : saya sudah mulai paham,

    Dari yang saya tangkap:
    1. dp[i] : minimal bom untuk membunuh set i evil alien.
    2. dp[i] = min(dp[i], dp[i&(~v[j])]+1); untuk mengupdate apakah untuk membunuh minimal i alien diperlukan dp[i] bom, atau dp[i&(~v[j])] + 1 bom.

    Tapi masih ada yang saya belum mengerti mengenai:
    dp[i] = min(dp[i], dp[i&(~v[j])]+1);

    Misalkan di looping awal:
    – i = {0,0,0,1} = 1, dan dp[i] = 15;
    – ketika kita bom di koordinat j = (0,0), maka set evil yang akan mati = v[(0,0)] = {0,0,1,1} = 3.

    Jadi dp[1] = min(dp[1],dp[1&(~3)]+1)
    dp[1] = min(dp[1],dp[1&(12)]+1)
    dp[1] = min(dp[1],dp[12]+1).

    Tetapi 12 itu = {1,1,0,0} (artinya alien 1 dan 2 dari kiri mati)
    sehingga misalkan saja dp[12]+1 < dp[1], maka dp[1] = dp[12]+1 dan itu menjadi tidak valid karena dp[1] menyatakan set evil {0,0,0,1} atau alien keempat mati, sedangkan dp[12] hanya alien 1 dan 2 saja yang mati sedangkan alien 4 tidak.

    Kalau salah mohon dikoreksi..Maaf ya saya jadi banyak tanya…:D hehe

  12. horey…akhirnya….yg ditunggu-tunggu….sangat cepat nih…….:)

  13. Susu.. mana warnanyaaaaa? yang penjelasan gw…

  14. @felixh: huehehehehe, maaf, fixed πŸ˜€

  15. Problem E – A + B

    The smallest possible A + B can be obtained from A and B in their smallest possible number base. To find this you only need to find the smallest digit in A (and B) which is x and its smallest base is x + 1.

    bukannya yang bener gini ya?

    The smallest possible A + B can be obtained from A and B in their smallest possible number base. To find this you only need to find the LARGEST digit in A (and B) which is x and its smallest base is x + 1.

    CMIIW. thx

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)