classifier4J提供的文本摘要功能
net.sf.classifier4J.summariser包
类SimpleSummariser实现ISummariser接口的public String summarise(String input, int numSentences)
方法,实现了简单的文本摘要功能。简单地说,它就是将词频较高的词所在的句子按其在文中的顺序抽取
出来形成摘要,不失为一个简单可行的方法。
/**
* @see net.sf.classifier4J.summariser.ISummariser#summarise(String input, int numSentences)
*/
public String summarise(String input, int numSentences) {
// step 1 计算输入文章(input)的所有单词词频
// 使用helper类Utilities的getWordFrequency方法,包括词语拆分、提供stopword等步骤
// 这里String[] stopWords = { "a", "and", "the", "me", "i", "of", "if", "it", "is",
// "they", "there", "but", "or", "to", "this", "you", "in", "your", "on", "for", "as",
// "are", "that", "with", "have", "be", "at", "or", "was", "so", "out", "not", "an" };
// get the frequency of each word in the input
Map wordFrequencies = Utilities.getWordFrequency(input);
// step 2 获取较高频率的前100个单词集合
// now create a set of the X most frequent words
Set mostFrequentWords = getMostFrequentWords(100, wordFrequencies);
// step 3 将输入切分成句子集合
// 使用了String的split("(\\.|!|\\?)+(\\s|\\z)")
// break the input up into sentences
// workingSentences is used for the analysis, but
// actualSentences is used in the results so that the
// capitalisation will be correct.
String[] workingSentences = Utilities.getSentences(input.toLowerCase());
String[] actualSentences = Utilities.getSentences(input);
// step 4 找出较高词频的前numSentences个句子
// iterate over the most frequent words, and add the first sentence
// that includes each word to the result
Set outputSentences = new LinkedHashSet();
Iterator it = mostFrequentWords.iterator();
while (it.hasNext()) {
String word = (String) it.next();
for (int i = 0; i < workingSentences.length; i++) {
if (workingSentences[i].indexOf(word) >= 0) {
outputSentences.add(actualSentences[i]);
break;
}
if (outputSentences.size() >= numSentences) {
break;
}
}
if (outputSentences.size() >= numSentences) {
break;
}
}
// step 5 将句子按照在文中出现的顺序重新排列
List reorderedOutputSentences = reorderSentences(outputSentences, input);
// step 6 添加必要分隔符,形成摘要
StringBuffer result = new StringBuffer("");
it = reorderedOutputSentences.iterator();
while (it.hasNext()) {
String sentence = (String) it.next();
result.append(sentence);
result.append("."); // This isn't always correct - perhaps it should be whatever symbol the sentence finished with
if (it.hasNext()) {
result.append(" ");
}
}
return result.toString();
}
--
0 comments:
Post a Comment