Author: Felix Halim
This part of review is writen by the author himself (Felix Halim).
The naive solution is to sort the dictionary using the three ordering rules (length, lexicographic, index). For each query, we do linear scan for each worded (in sorted order) and terminate as soon as we collect 10 words that has the query substring. This solution is too slow O(N log N + Q * N). Binary search can help to find the starting index for each query and the scan can start that index but it’s still too slow (it’s still doing the linear scan). Here is the sample code: top10_tle_strstr.cpp
Suffix-Array and Segment-Tree
By looking at the large number of queries, it suggests that each query should be answered in less than linear time. To answer a query in less than linear time, we need to pre-process the N words in the dictionary using some "well known" data structure (i.e. suffix-array or suffix-tree). All the words in the dictionary can be concatenated together and separated by a separator (e.g. '#') forming a very long string of K characters. Suffix-array is then applied on this long string.
Suffix-array made it easy to find a substring in a very long text of K characters in O(log K) time. There exists a simple way to construct suffix-array in O(K * log^2(K)). Suffix-array can give us a range of indexes in the suffix array that has a certain prefix (the query substring) in O(log K) time. The "Top 10" answers lies in this range. That is, it satisfies the first rule of the problem (all the "Top 10" words have to contain the query substring). Unfortunately, the range can be very big and finding the actual "Top 10" answers in that range can be as costly as the naive linear scan.
This is where the segment-tree comes to the rescue. If we construct a segment-tree on the suffix-array,
we are able to do range-query in the suffix-array to select the "Top 10" answers in O(log(N) * 10 * 10). Each node in the segment-tree will collect and merge the "Top 10" answers from its children nodes and from itself. The "Top 10" is maintained using insertion sort algorithm thus it requires the O(10 * 10) time for each merge.
During the merging, the three ordering rules (length, lexicographic, index) are applied. The first ordering (by length) is easy to compute. The second ordering (by lexicographic) is a little tricky. The second ordering for the N words can be precalculated after the suffix-array is constructed. For each word, we just need to mark its position in the suffix-array. Using this "suffix-position", the second ordering can be checked as simple as comparing two integers. The third ordering is trivial.
The complexity of this solution is O(K * log^2(K) + N * log(K) + K * log(K) * 10 * 10 + Q * log(K) * 10 * 10) which is composed of suffix-array + suffix-position + segment-tree-on-suffix-array + segment-tree-query where K is the total number of characters in the dictionary. This solution runs about 2 seconds using judge's input which is fast enough to get this problem accepted. Here is the sample code : top10_ac_sa_smtree.cpp
Suffix-Tree or Suffix-Array + Longest Common Prefix
The concatenated words in the dictionary can be pre-processed using suffix-tree data structure. Each node in this tree can be annotated to contain the "suffix-position" using a regular tree traversal O(K). Using the suffix-position, we can pre-calculate the "Top 10" answers for each node in the suffix-tree in O(K * 10 * 10) since there are at most 2*K nodes in the suffix-tree. Each query is answered by traversing the tree to the correct node and print the pre-calculated "Top 10" answers in that node O(query length).
The total complexity is O(K + K * 10 * 10 + L) where K is the total number of characters in the dictionary and L is the total number of characters in the queries. This solution runs about 0.75 seconds using judge's input. However, this solution is not necessary to get the problem accepted. This solution is only for those who enjoy the beauty of suffix-tree. The suffix-tree can be constructed in at least two ways. First is using the Ukkonen's method, second is using the suffix-array + longest common prefix to build the suffix-tree (the slower method which runs in about 1.0 seconds). Here is the sample code: top10_ac_sfxtree.cpp or top10_ac_sa_lcp_sfxtree.cpp
This problem was solved only by 1 team: UESTC_floyd from University of Electronic Science and Technology of China, minute 259.