The KWIC index system accepts an ordered set of lines, each line is an ordered set of words, and each word is an ordered set of characters. Any line may be "circularly shifted" by repeatedly removing the first word and appending it at the end of the line. The KWIC index system outputs a listing of all circular shifts of all lines in alphabetical order. This is a small system. Except under extreme circumstances (huge data base, no supporting software), such a system could be produced by a good programmer within a week or two.Here's my code:
#!/usr/bin/env python
def main():
f = open("kwic.txt", "rU")
out = open("kwic-output.txt", "w")
final = []
for line in f:
words = line.split()
count = len(words)
for i in xrange(count):
final.append(makestr(words))
cycle(words)
final.sort()
for ele in final:
out.write(ele + "\n")
def makestr(li):
s = ""
first = 1
for ele in li:
if first == 1:
first = 0
s += ele
else:
s += " " + ele
return s
def cycle(li):
tmp = li[0]
del li[0]
li.append(tmp)
return li
if __name__ == '__main__': main()
By the way, if someone knows a more elegant way to avoid having an extra space in makestr, let me know. I'm aware of the option of deleting the first character after making the string, but I don't consider that very nice either.
Messages (4)
Unix (sh/ksh/bash):
while read line; do n=$(echo "$line" | rs -h); n=${n#* }; echo "$line" | rs -yg1 $n $((n+1)); done | sed -e 's/^[^ ]* *//' -e 's/ */ /g' | sort -df
Note: there are supposed to be two spaces before the * in the sed patterns, but blogger won't let me use <pre>.
I think you might want the string method "join" for the space problem. See how I used it below. Thanks for the fun problem, by the way!
Underscores for leading spaces:
def lineToCycles(line):
____toks = line.split(' ')
____def rot(first):
________return ' '.join(toks[first:] + toks[:first])
____return [rot(i) for i in range(len(toks))]
def kwic(lines):
____result = [];
____for line in lines:
________result += lineToCycles(line)
____result.sort()
____return result
# For instance,
print kwic(['blah foo bar', 'one two three'])
# prints ['bar blah foo', 'blah foo bar', 'foo bar blah', 'one two three', 'three one two', 'two three one']
Oh right, Python has a join function built in already :-)
I decided it'd be convenient to avoid an extra space if combing recursively and wrote this in Scheme:
(define makestr
__(lambda (li)
____(cond ((null? li) "")
______((null? (cdr li)) (car li))
______((string-append (car li) " " (makestr (cdr li)))))))
You can get shorter...
def line_to_cycles(line):
____toks = line.split(' ')
____return [" ".join(toks[i:] + toks[:i])
________________for i in range(len(toks))]
def kwic(lines):
____return sorted(line_to_cycles(line) for line in lines)
-T.