Table of Content

    24 June 2006
    Volume 41 Issue 3
    The Shark-Search algorithm based on clustering links
    SU Qi,XIANG Kun and SUN Bin
    J4. 2006, 41(3):  1-04 .  doi:
    Abstract ( 4042 )   Save
    Related Articles | Metrics
    Based on the analysis of the focused-crawling algorithm Shark-Search, an improved Shark-Search algorithm with link clustering is proposed. The new algorithm by several comparable experiments is validated. The results show that it could identify the relevance between link and focused topic more effectively.
    Portrait retrieval based on news environment
    WANG Tai-feng,Yuan Ping-bo,JIA Ji-min,Yu Meng-hai
    J4. 2006, 41(3):  5-10 .  doi:
    Abstract ( 1464 )   Save
    Related Articles | Metrics
    A novel solution for personal portrait search is provided based on the content of news report. Firstly, named entities(NE) and associated keywords were extracted from the news. Then queries formed by NE were issued to general image search to get candidate images. Finally, two steps of filtering were deployed to get the optimal portrait as filtering with keywords associated with the NE to get images actually relevant to the person in the news and face detection to get optimal portrait. Preliminary experiments show that this solution works well on getting portrait for a person in news, and what's more, displaying portraits beside the news content is a very helpful feature to enhance readers' efficiency of news reading and help them focus on interested news much easier.
    Applying fuzzy cluster algorithm to Web information retrieval
    GAO Xiang,WANG Min
    J4. 2006, 41(3):  11-12 .  doi:
    Abstract ( 1265 )   Save
    Related Articles | Metrics
    Recently Internet is developing rapidly, and the information in Internet is growing greatly. How to discover information of high quality efficiently from Web is an important research issue. This paper discusses the creation of such a system that uses a fuzzy cluster algorithm to search in Web documents; we evaluate our method by performing experiments over html files from Web site. The results from experiments, as described at the end of section 4, show preliminarily that the fuzzy logic based cluster algorithm can enhance search quality efficiently.
    An approach to generate boolean query in question andanswering retrieval system
    HE Jing
    J4. 2006, 41(3):  13-17 .  doi:
    Abstract ( 1363 )   Save
    Related Articles | Metrics
    A novel approach is proposed to generate Boolean query with the surface grammar information. The process is to generate the initial query first, and then iterate to adjust the query according to the retrieved result until a suitable query is acquired. The experiments on TREC2004/2005 QA TRACK data set show our approach improves the precision, coverage and redundancy greatly without increasing complexity much(iterate 2.5 times average).
    Locality preserving kernel method and its application
    WAN Hai-ping,HE Hua-can,ZHOU Yan-quan
    J4. 2006, 41(3):  18-20 .  doi:
    Abstract ( 1334 )   Save
    Related Articles | Metrics
    Kernel method is now a powerful alternative in many machine learning tasks.Practice shows that it will achieve better performance if domain knowledge could be incorporated.The relationship between global information and local information when processing data is discussed.A text categorization test is conducted to compare our method with other several classification methods,and one will see that ours outperforms them.
    Using text blocks based on multiple templates hidden markov model for text information extraction
    WANG Lei,CHEN Zhi-ping,LI Zhi-cheng
    J4. 2006, 41(3):  19-24 .  doi:
    Abstract ( 1315 )   Save
    Related Articles | Metrics
    Since varied training data sources are not profitable for the learning of optimal model parameters, then a novel text information extraction algorithm based on hidden Markov model with multiple templates is proposed, which makes use of the information of format and list separators to segment text, and then extracts text information through combining theparameters of releasing probability for universal training, using multiple form templates to train the parameters of initial probability and transition probability for hidden Markov mode. Experimental results show better performance in precision and recall over simple hidden Markov model.
    Research on semanticbased Web data search engine
    SHI Yi-yi
    J4. 2006, 41(3):  23-29 .  doi:
    Abstract ( 1303 )   Save
    Related Articles | Metrics
    Mass heterogeneous and unstructured Web data makes it difficult to find necessary information.Using Web service technology,this article proposes a method of building metadata registry(MDR) and ontology management system(OMS) based on the semantics of metadata and ontology. By producding distributed closure, then implementing a search engine(SBWSE) using an ontology representing language RDF, it can help achieve high efficient access to distributed heterogeneous data, and settle the problem of incomplete and incorrect search results.
    Chinese question answering systemoriented Chinese parsing
    ZHANG Liang,WANG Hai-mei,HUANG He-yan,ZHANG Xiao-fei
    J4. 2006, 41(3):  30-33 .  doi:
    Abstract ( 1489 )   Save
    Related Articles | Metrics
    Chinese syntactic parsing is a key of Chinese study and Chinese information processing, as well as a difficulty. The characteristics of Chinese question sentence are discussed. In Chinese question answering systemoriented Chinese parsing, corpusbased syntactic processes are applied. Chinese question sentence short, question word, question structure etc are employed. The preliminary experiment result shows that its performance can achieve design goal.
    An improved KNN classification algorithm based on particle swarm optimization
    ZHANG Guo-ying,SHA Yun,JIANG Hui-na
    J4. 2006, 41(3):  34-36 .  doi:
    Abstract ( 1582 )   Save
    Related Articles | Metrics
    An efficient algorithm PSOKNN is proposed to reduce the computational complexity of KNN text classification algorithm, it is based on particle swarm optimization which has random and irected global search ability to search randomly and directed within training document set. During the procedure for searching k nearest neighbors of tested sample, the particle swarm moves jumpily, and those document vectors that are impossible to be the k closest vectors are kicked out quickly. By classifying Reuters21578,the veracity of KNNPSO is the same as that of KNN, and PSOKNN reduces approximate 70% classification than KNN.
    The advertisement spam image filtering method based on image content
    XU Yang-yang,YUAN Hua
    J4. 2006, 41(3):  37-41 .  doi:
    Abstract ( 1198 )   Save
    Related Articles | Metrics
    In recent years, the problem of spam email becomes more and more serious. To shield from the textbased spam filters, the senders of spam embed text into image and then more and more spam images appear in emails. To solve this problem, the advertisement spam image filtering method based on image content is brought forward. Firstly, the text areas are extracted from the image, and then the features of these text areas are used to filter the spam images. The result of the experiment shows that the method is effective.
    New word identification based on largescale corpus
    SHI Shui-cai,YU Hong-kui,LV Xue-qiang,LI Yu-qin
    J4. 2006, 41(3):  42-45 .  doi:
    Abstract ( 1918 )   Save
    Related Articles | Metrics
    String frequent static, sub string reduction and several filtering method are used to analyze one set Chinese new word mining system and identify new word by using character, word and N-gram dictionary based on statistic largescale corpus.With the system based on those methods, new word without length and domain limit can be identified.
    Naive Bayes Chinese text classification based on core words of class
    YUAN Fang,YUAN Jun-ying
    J4. 2006, 41(3):  46-49 .  doi:
    Abstract ( 1702 )   Save
    Related Articles | Metrics
    Abstract:From the view of manual classification, words in title,abstract and key words are more important than others. A classification mode is advanced based on core words of class. This classification mode extracts core words from title, abstract and key words, and these core words' effect is strengthened via the mode of weight. The experiment of Naive Bayes classification indicates that this method can effectively improve classification precision of Chinese texts.
    Research on email filtering by the frequency of the terms in character fields
    LIU Hui,MA Jun,LEI Jing-sheng,LIAN Li
    J4. 2006, 41(3):  50-53 .  doi:
    Abstract ( 1324 )   Save
    Related Articles | Metrics
    A novel method for Email filtering is proposed based on the information of character fields and the frequency of the terms in the character fields. The techniques used in the method are discussed, which include selecting the characters of text documents, the constructing the character lexicons as well as the computation of the weights of the term frequency (TF). In addition, an improved probabilistic model for the computation of the similarity of among text documents is provided. Experiments show that the new method is better than traditional Rocchio method in terms of recall, precision and some other evaluation targets.
    AClustering Web user based on interest similarity
    ZHANG Wen-dong,YI Yi-hu
    J4. 2006, 41(3):  54-57 .  doi:
    Abstract ( 1293 )   Save
    Related Articles | Metrics
    It is a key aspect of Web mining to cluster user in terms of browsing interest on Web.The page content and browsing path are takedn into account in measurement of browsing interest. Whereafter, new method is brought forth to calculate similarity of interest and users are clustered by using transitive closure algorithm.Result of experimnet shows new method can efficiently improve precision of clustering.
    Dimensionality reduction based on spectral graph and its application
    WAN Hai-ping,HE Hua-can
    J4. 2006, 41(3):  58-60 .  doi:
    Abstract ( 1244 )   Save
    Related Articles | Metrics
    Most machine learning tasks confront the problem of dimensionality reduction for extracting meaningful features and processing convenience. In these topological spaces, it usually adopts Euclidean distance to measure similarity between objects. It is argued that in many learning tasks the path from one object to another will also be a proper alternative. Also the relationship between local and global information is discussed when selecting features. A dimensionality reduction method incorporating both path and distance feature is proposed based on spectral graph theory which aims at preserving local meaningful neighborhood structures in the original data. In the experiments of both face recognition and information retrieval, it achieves positive results.
    An algorithm to cluster the search results basedon the association rules
    SONG Chun-fang,SHI Bing
    J4. 2006, 41(3):  61-65 .  doi:
    Abstract ( 1548 )   Save
    Related Articles | Metrics
    A method is offered to cluster the search results.It uses the association rules getting the salient phrases in the Web documents,and the salient phrases just represent the corresponding clusters.The documnets in the clusters are related to the salient phraes.Then the clusters are ranked according to the score we have assigned for each of them,and finally the results are displayed to the uses.
    AA SVM multiclassifier based on the weighted threshold strategy
    CAO Hong,DONG Shou-bin,ZHANG Ling
    J4. 2006, 41(3):  66-69 .  doi:
    Abstract ( 1357 )   Save
    Related Articles | Metrics
    algorithm named OVAWWT are presented to improve the equitableness and the precision of classifiers, a multiclassifier of SVMlight named MSVMlight based on the OVAWWT strategy is implemented, and two experiments on CWT100G data set are constructed, one to compare MSVMlight with other classifiers, and the other to compare WRCut strategy with RCut strategy. The results show that compared with other classifiers MSVMlight has a higher precision rate and shorter training time and that OVAWWT algorithm can improve the precision rate of OVA.
    AA hybrid classifier based on the rough sets and BPneural networks
    BAI Ru-jiang,WANG Xiao-yue
    J4. 2006, 41(3):  70-75 .  doi:
    Abstract ( 1491 )   Save
    Related Articles | Metrics
    Abstract: A hybrid classifier is presented based on the combination of rough set theory and BP neural network. Firstly, the documents are denoted by vector space model. Secondly it reduced the feature vector by using rough sets. Finally classed the documents by BP neural network. Experimental results show that the algorithm RoughANN is effective for the texts classification, and has the better performance in classification precision, stability and faulttolerance when compared with the traditional classification methods, Bayesian classifiers SVM and kNN, especially for the complex classification problems with many feature vectors.
    Entry page search algorithm based on URLtype prior probabilities
    HU Jungang,DONG Shou-bin,CHEN Xiao-zhi,ZHANG Yuan-feng
    J4. 2006, 41(3):  76-80 .  doi:
    Abstract ( 1323 )   Save
    Related Articles | Metrics
    Entry page (home page) retrieval has the goal to retrieve just one right document, and the queries are usually short Web page names. As a result, finding precisely an entry page with a high initial is quite difficult. According to unigram language model, the authors extract the field of Web page contents for baseline retrieval, which are useful for finding Chinese entry page, and then we build a new model combined contentfield and noncontents features of Web pages (e.g. URLtype prior ,proved to have the strongest predictive power). According to the prior probabilities of URLtype, the relationship between entry page and its subpages is discovered. Based on the relationship, we propose a new algorithm that entry page is extracted from relevant subpages (PERS). At last, we get the result from rerank, and achieve a great advance on performance of entry page retrieval by using PERS.
    The study on automitic classification of digital documents of scientific papers
    LI Sen,MA Jun,ZHAO Yan,LEI Jing-sheng
    J4. 2006, 41(3):  81-84 .  doi:
    Abstract ( 1254 )   Save
    Related Articles | Metrics
    Abstract: Since scientific papers are usually semistructural documents, a hierarchy classification model based on the metadata of scientific papers is proposed, where the metadata include the titles, keyword sets, abstracts and so on.Experiments show the precision of the classification based on the metadata of papers is close to that of the classification based on the full text of papers. Furthermore, the classification precisions are better than the best known classification algorithm if the papers are classified based on taxonomy of application domains as follows: first, the metadata are used to classify paper roughly based on the higher evels of taxonomy, then full texts are utilized to classify these papers on the lower levels of taxonomy. Since the size of metadata is less than that of full text and the number of papers classified in a subclass is less than that of total number of papers, the new model enhances the efficiency of paper classification when the number of classes is bigger and the documents are distributed averagely in the given taxonomy.
    Research on standard sample generation system for email filtering
    XU Xuan,DING Wei
    J4. 2006, 41(3):  85-89 .  doi:
    Abstract ( 1243 )   Save
    Related Articles | Metrics
    Abstract: Lack of standard Chinese mail dataset, the performance of various Spamfilter systems can't be evaluated. The further research on the issue concerning the standard sample generation is made, through analyzing the problems on the collection of email samples. Meanwhile, the design of a standard sample generation system applied in real environment is given. A standard email dataset for evaluating the email filter system is provided, and will be finally developed to be a base corpus of email filtering technique.
    Ontology based on focused crawler
    ZHENG Jian-zhen,LIN Kun-hui,ZHOU Chang-le,KANG Kai
    J4. 2006, 41(3):  90-94 .  doi:
    Abstract ( 1191 )   Save
    Related Articles | Metrics
    Focused crawler can fetch large quantities of domain resources from the Web in a short time. It is very helpful in both foused search engines and data mining companies. In order to overcome the deficiency of topic filtering strategy based on keywords widly used nowadays, the paper proposed a topic filtering stratege based on concept elicited by concept congregation idea. The paper also proposed an authority modified weight calculation formula based on different importance of Web page information. By doing this, real time Web page filtering based on concept can be achieved. In the hope of improving focused crawler's work efficiency more, the paper also proposed a link forecast algorithm. At last, the comparative experiment shows that the strategies proposed in this paper are pratical.
    Chinese Web page feature selection method based on Sequential data mining
    GU Feng,LIU Chen-xi,WU Yangyang
    J4. 2006, 41(3):  95-99 .  doi:
    Abstract ( 1347 )   Save
    Related Articles | Metrics
    Abstract: A method is proposed to select feature candidates from Chinese websites on the basis of sequential data mining, and it is used in the model of Chinese websites classification. This method uses improved PAT tree data structure to mine the frequent strings in the same class of Chinese websites, calculates the net frequency, mines frequent meaningful words, phrases, and English words from Chinese websites, and obtains text features with the help of the CHI algorithm. Experiments show that this algorithm not only mines most of the features selected by the traditional algorithm, but also mines some new meaningful personnames, placenames, new words, phrases, and foreign words.
    A question answering system based on question pattern match
    XIAN Jian,MO Xuan-lang and XI Jiang-qing
    J4. 2006, 41(3):  100-103 .  doi:
    Abstract ( 1239 )   Save
    Related Articles | Metrics
    Question Answering(QA) system is a system that puts the question and the answer into a database and organizes them. This system can answer students’ questions automatically by a technology named semantic understanding used for understanding natural language. The design and implementation of QA System are presented. This system adopts twosides mostly matched method, intellectual question match pattern and semantic analyze means which are based on dependent parsing tree for the question analyzing. A great deal of corrective rate and answering callback rate in our test show that this system is a new good system.
    An optimized method for minimal cubing approach
    Ma Jia-sai,ZHANG Yong-jun
    J4. 2006, 41(3):  104-107 .  doi:
    Abstract ( 1301 )   Save
    Related Articles | Metrics
    OLAP operations on highdimensional data set can be done by using minimal cubing approach. The high dimensions can be partitioned properly to accelerate querying analysis by studying the historical records of OLAP operations. By doing this the efficiency of OLAP operations can be improved with the similar space complexity.
    The study of Chinese Webpage classification based on block importance
    DUAN Xin,MA Jun,SONG Ling
    J4. 2006, 41(3):  108-111 .  doi:
    Abstract ( 1211 )   Save
    Related Articles | Metrics
    Webpage classification is more difficult than that for puretext documents because of noisy information in Web pages. A Web page can be segmented into multiple blocks and the importance of blocks in a Web page for classification is not equivalent, which can be utilized to improve the quality of Webpage classification. Several revalent methods for blocksegmentation in a Web page are introduced, and then it is validated that the method for Chinese Webpage classification based on block importance is better than the one for traditional methods.
    Chunk parsing for sentences based on SVM
    IN Yu-ming,LI You
    J4. 2006, 41(3):  112-115 .  doi:
    Abstract ( 1365 )   Save
    Related Articles | Metrics
    The system uses a statisticalbased model based on SVM(Support Vector Machine) to recognize chunks from Chinese sentences. The SVM models use files which have been marked chunks by hand. Through selecting chunks' characteristic parameter and multiclass SVM models, the system finishes chunking. The algorithm and the results are given, and the model's characteristic is analyzed.
    Rough sets information retrieval model based on multual information
    FU Xue-feng,LIU Qiu-yun,WANG Ming-wen
    J4. 2006, 41(3):  116-119 .  doi:
    Abstract ( 1330 )   Save
    Related Articles | Metrics
    In the processing of information retrieval, the existence of polysemy and synonymy would lead to uncertainty, which reduces the effectiveness of information retrieval. A model based on mutual information is proposed, in which the uncertainty is captured by rough sets. At first, the mutual information between the words of the training corpus is counted, and then the mutual information is employed to build an equivalent relation through fuzzy clustering. An information retrieval model based on upper and lower proximations of rough sets is proposed and implemented in the light of quivalent relation.Experiments show that the model can get improvement of information retrieval.
    A user interest modelbased personalized information
    ZHANG Yu,YUAN Fang
    J4. 2006, 41(3):  120-125 .  doi:
    Abstract ( 1245 )   Save
    Related Articles | Metrics
    At present, the design of information search tools is mostly based on the needs of all users, but not the personal interest, and leads to lower precision. A kind of personalized search method that is based on the interests of users fills the users' interests into a tree structure with ODP. It can provide corresponding search result according to the user's interest when he raises a query. It also considers the human physiological characteristic, seasonal attenuation of human cerebra, and to make the user profile update along with time. The result of experiment suggests that the method can achieve the personalized recommendation based on the user's interest, and therefore it can fulfill the needs of users better
    Information retrieval model based on Markov Network
    CAO Ying,WANG Ming-wen,TAO Hong-liang
    J4. 2006, 41(3):  126-130 .  doi:
    Abstract ( 1378 )   Save
    Related Articles | Metrics
    A novel method is proposed and realized based on Markov Network, which can construct Markov Network according to the co-occurrence information of the terms in collection. The relevance probability between document and query is computed by adding additional evidential sources to the model from the term space Markov Network. Experiment results show that our Markov model method can get more effective performances compared with other methods.
    Structure and content-based extraction of topical information from Web pages
    WU Peng-fei,MENG Xiang-zeng,LIU Jun-xiao,MA Feng-juan
    J4. 2006, 41(3):  131-134 .  doi:
    Abstract ( 1543 )   Save
    Related Articles | Metrics
    Combining the Web page's internal features and external structural layout, mapping table is suggested to tansform the view of Web page. The approach gets highly semantic cohesiveness of the topical contents of the Web page exactly, based on the structure and revelatory rules for Web page's segmentation and identification and the use of the vector space model for Web content analysis. Experimental results show that this method is more ideal for the topical information extraction of complexstructure Web pages.
    Web news retrieval based on splited vector space model
    WANG Wei-dong,SONG Dan,SONG Ren-jie
    J4. 2006, 41(3):  135-138 .  doi:
    Abstract ( 1286 )   Save
    Related Articles | Metrics
    Based on the analysis of the deficiency of the traditional vector space retrieval model, a Web News Retrieval approach is presented based on splited vector space model. Instead of using a single term vector as event representation, the terms into four semantic classes are split(names, temporal expressions, spatial terms and contents) according to the semantic diffirence of them, form four vector spaces, and process and weigh the classes separately. Temporal expressions and augment spatial terms with geography information are formalized and this data in the retrieval is used. The approach is motivated by experiment.
    Automatic text classification model based on random forest
    ZHANG Wei-hua,WANG Ming-wen,GAN Li-xin
    J4. 2006, 41(3):  139-143 .  doi:
    Abstract ( 1605 )   Save
    Related Articles | Metrics
    Abstract: With the rapid development of World Wide Web, text classification has become the key technology in organizing and processing large amount of document data. Random forests are a combination of tree predictors such that each tree depends on the values of random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges to a limit as the number of trees in the forest becomes large. In experiments it is compared to C4.5, KNN, SMO and SVM, and the results show that its performance is higher than C4.5 and comparable with KNN, SMO and SVM. It is a promising technique for text categorization.
    Class information feature selection method for text classification
    YU Jun-ying,WANG Ming-wen,SHENG Jun
    J4. 2006, 41(3):  144-148 .  doi:
    Abstract ( 1395 )   Save
    Related Articles | Metrics
    Abstract: With the explosion of web documents, text classification becomes more important in Information Retrieval applications. It is very difficult to evaluate the statistical characteristics of amples because of the high dimensions. It will lead to “over study" and reduce classifiers' performance. So that feature selection and extraction before analysis are necessary. A class information feature selection method is proposed, in which the class information of the training document is taken into account while keeping as much document information as possible. The experiments show that this method can get good performance, and it is consistently better than OCFS and CHI on macro average F1.