1. 所有的分类器(BayesianClassifier和VectorClassifier)
都继承自AbstractCategorizedTrainableClassifier
IClassifer
|__ICategorisedClassifier (是否给定类别)
|__ITrainableClassifier (是否训练)
2. 几个重要字段
IWordsDataSource wordsData; //单词类别、频率等记录的存储来源
ITokenizer tokenizer; //将文本切分成单词集合
IStopWordProvider stopWordProvider; //stopword列表
如JDBCWordsDataSource是一个基于JDBC的数据源,对应的数据库中表定义如下:
CREATE TABLE word_probability (
word VARCHAR(255) NOT NULL,
category VARCHAR(20) NOT NULL,
match_count INT DEFAULT 0 NOT NULL,
nonmatch_count INT DEFAULT 0 NOT NULL,
PRIMARY KEY(word, category)
)
3. 几个重要方法
3.1 classify
//Function to determine the probability string matches a criteria for a given category
public double classify(String category, String input) throws WordsDataSourceException {
//首先由tokenizer将文本input切分成单词集合words
//然后通过wordsData按照category获取单词频率
WordProbability[] wps = calcWordsProbability(category, words);
//计算总的概率,归一化,返回匹配值
return normaliseSignificance(calculateOverallProbability(wps));
}
calculateOverallProbability(wps)是如下计算单个wordProb的
wordProb = matchCnt / (matchCnt + nonmatchCnt)
而overallProb是如下计算的
overallProb = (wordProb[0] * wordProb[1] * ......) /
((wordProb[0] * wordProb[1] * ......) + ((1 -wordProb[0]) * (1 - wordProb[1]) * ......))
不知道为什么这样就实现了Bayes方法:(
3.2 isMatch
//Function to determine if a string matches a criteria for a given category
protected boolean isMatch(String category, String input) throws WordsDataSourceException {
//sth. omitted ......
double matchProbability = classify(category, input);
return (matchProbability >= cutoff);
//感觉就是先用classify方法获取匹配值,然后与cutoff比较以确定是否匹配
//这里cutoff是阈值(threshold value)吗???
}
3.3 teachMatch
//训练分类器,指出某个输入文本属于此类
protected void teachMatch(String category, String input) throws WordsDataSourceException {
//直接向wordsData添加数据
}
3.4 teachNonMatch
//训练分类器,指出某个输入文本不属于此类
protected void teachNonMatch(String category, String input) throws WordsDataSourceException {
//直接向wordsData添加数据
}
0 comments:
Post a Comment