JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE)

A survey of web spam and anti-spam techniques

LI Zhi-chao, YU Hui-jia, LIU Yi-qun, MA Shao-ping

J4. 2011, 46(5): 1-8.

Abstract ( 1700 )

PDF (775KB) ( 1817 )

Save

Related Articles | Metrics

With the increase of Web information, search engines have become the principal approach to information retrieval. The accessing of a page is basically decided by its ranking in search engines. Some sites boost their page ranking without improving the quality of the pages, but deceive the search engines according to its characteristic, which is called Web Spam. Web spam is one of the challenges of search engines. Valid anti-spam techniques are presented with an introduction of common Web spam.

A survey of collaborative Web search

SUN Jing-yu, CHEN Jun-jie, YU Xue-li, LI Xian-hua

J4. 2011, 46(5): 9-15.

Abstract ( 1282 )

PDF (633KB) ( 975 )

Save

Related Articles | Metrics

Current research about collaborative Web search is still in the exploratory stage,and the research questions and directions are not clear. A review is given to explain how a collaborative Web search arises and how to classify it. Then some current theories and practice of the collaborative Web search systems are summarized. Finally, problems and shortcomings of current research are pointed out and future research directions are proposed.

A survey of product reviews mining

XI Ya-hui, ZHANG Ming, YUAN Fang, WANG Yu

J4. 2011, 46(5): 16-23.

Abstract ( 1290 )

PDF (582KB) ( 1208 )

Save

Related Articles | Metrics

The basic concept and general framework of product reviews mining were described, and the product reviews mining was divided into four subtasks. The recent research findings in every subtask of scholars at home and abroad were systematically introduced, and the future research directions in this field were pointed out.

Algorithm of dynamic maintaince of index library for a distributed search engine

ZENG Jian-ping, WU Cheng-rong, GONG Ling-hui

J4. 2011, 46(5): 24-27.

Abstract ( 1433 )

PDF (467KB) ( 720 )

Save

Related Articles | Metrics

There are numerous users in distributed search engines, at the same time, there are frequent updates of the index, which can lead to delay of response time in processing user request. An algorithm based on a memory map table for dynamic index library maintaince is proposed. The index library is organized according to time granular setting. By maintaining a table for available index lists in memory, efficient index updates can be performed. Experiment shows that the proposed algorithm can create an index for new data in time, while decreasing the response time of user query request.

A collaborative filtering recommendation mechanism based on user profile in unstructured P2P networks

LIU Jian1, YIN Chun-xia 2*, YUAN Fu-yong3

J4. 2011, 46(5): 28-33.

Abstract ( 1332 )

PDF (498KB) ( 834 )

Save

Related Articles | Metrics

Nowadays, collaborative filtering is one of the most successful technologies applyed in information recommender systems. However, with increase of the number of users and the amount of information needed to filter, the systems′ computational complexity quickly increases, and most centralized recommender systems have to face the low scalability problem. To solve the scalability problem of the recommender systems, a distributed collaborative filtering recommendation mechanism with an unstructured P2P architecture is proposed. In the recommendation mechanism, the content of resource is represented by a vector according to the lexical chain method, and then the user profile can be represented by a preferred resource set. In addition, with the change of the user′s interest, the proposed mechanism also utilizes dynamic neighbor peer set reformation to gain a real time personalized recommendation.

Deep directional collection of Web data

XIA Tian1,2

J4. 2011, 46(5): 34-38.

Abstract ( 1151 )

PDF (750KB) ( 727 )

Save

Related Articles | Metrics

Based on the Web surf behaviors of human beings, crawling directions are restricted by extracted crawling subpages, and the associated relationships of crosspage compound object are realized through the properties′ inheritance between crawl datum. Then, the generalized crawl process with deep directional collection support is designed and implemented. Experimental results about the hot posts of the Tianya site show that this method can achieve data collection of complicated objects without changing the main procedure, and has high collection efficiency.

A privacy protection method based on a trust and information flow model

GAO Feng1, HE Jing-sha2

J4. 2011, 46(5): 39-43.

Abstract ( 1323 )

PDF (870KB) ( 789 )

Save

Related Articles | Metrics

The existing trust based on privacy protection methods simply map the trust rank to privacy sensitive grade which cannot express the dynamic and contextdepend characteristic of trust and privacy information. In order to solve this problem, an information flow model based on trust is proposed, and the model is reasonable and secure by analysis. Using this new model and also by controlling the access granularity of privacy information, a privacy protection method is proposed which can use trust to protect privacy effectively and safely.

The search result diversification approach based on the HITS algorithm

CHEN Fei, ZHANG Min, LIU Yi-qun, MA Shao-ping

J4. 2011, 46(5): 44-48.

Abstract ( 1384 )

PDF (944KB) ( 1055 )

Save

Related Articles | Metrics

To avoid the problem that users′ diversity needs cannot be precisely obtained or documents provided cannot concern all aspects of the needs in a specific query,a new method was proposed based on the linkparsing feature of the HITS algorithm, in where the possibility was directly calculated according to the diversity of documents in the search result list for a query, and then the result list was reranked based on this value. Experimental results on the TREC′s largescale data collections verified that this method was effective.

Query expansion method based on the user interest and term relation

XU Jian-min1,3, CHEN Zhen-ya2, CUI Yan3

J4. 2011, 46(5): 49-53.

Abstract ( 1236 )

PDF (517KB) ( 892 )

Save

Related Articles | Metrics

The traditional query expansion method can not retrieve information according to different users’ requirement, so a new query expansion method is proposed on the basis of users’ interest and term relation. A series of terms reflecting users’ interest are obtained by mining the web log records, and the weights of original queries′ are adjusted according to the synonym relativity between these terms and the original words, then the original queries are expanded by using the ontology relevance relativity between these terms and the original words. Experimental results show that the approach can enhance the accuracy ratio and the recall compared to the traditional query expansion method.

Extended information retrieval model based on the Markov network cliques

SHI Song1, WANG Ming-wen1, TU Wei2, HE Shi-zhu1

J4. 2011, 46(5): 54-57.

Abstract ( 1307 )

PDF (609KB) ( 849 )

Save

Related Articles | Metrics

Query expansion based on global analysis model is a common and effective approach to improve information retrieval performance. First, the Markov network model was built by calculating the similarity between terms. Second, the description of relationship between candidate terms was strengthened, and the clique structure was extracted from the Markov network. Finally, candidate terms and query terms in the clique structure were merged for query expansion. Experimental results showed that query expansion based on the Markov random walk matrix performs better than query expansion based on the similarity matrix, and query expansion based on the clique extraction method performs better than query expansion based on the general extraction method.

Research on large-scale text hierarchies combining relevant category information

HE Shi-zhu, WANG Ming-wen, ZHOU Jun-jun, SHI Song

J4. 2011, 46(5): 58-62.

Abstract ( 1252 )

PDF (1338KB) ( 717 )

Save

Related Articles | Metrics

The deep classification model is an effective paradigm for solving largescale classification problems. An improved model was proposed based on the paradigm. First, a new method was used to evaluate the effectiveness of search stage independently. Second, the category and document information were collectively used to select category candidates. Finally, the classifier of Rocchio was trained based on the class centroid, and at the same time the information of related categories was used to determine the final category. Experiments on the corpus ODP show that the proposed approach outperforms the other new methods.

Image retrieval based on salient regions and nonsubsampled Contourlet transform

ZHANG Hui-yun, ZHANG Xin-ming, LI Shuang, GUO Wen-lu

J4. 2011, 46(5): 63-66.

Abstract ( 1068 )

PDF (827KB) ( 700 )

Save

Related Articles | Metrics

To improve the quality of image retrieval, a new method for image retrieval is proposed based on the salient region′s histogram and non-sub-sampled Contourlet transform. First, the image was divided into the salient region and the background region with the salient points and compute the two region′s histogram as the color features. Second, the image was decomposed by nonsubsampled Contourlet transform to get the texture information. Finally, the images were effectively retrieved with the combination of color features and texture information. Experimental results show that the proposed method has better retrieval performance and higher efficiency.

Study on Deep Web pages mining based on Apriori algorithm

LI Gui, HAN Zi-yang, ZHENG Xin-lu, LI Zheng-yu

J4. 2011, 46(5): 67-70.

Abstract ( 1323 )

PDF (925KB) ( 820 )

Save

Related Articles | Metrics

The max frequent association pages in Deep Web sites are recognized by using Apriori algorithm, and the non-max frequent association pages are pruned.Then, all the max frequent association pages are obtained by website traversing. Experimental results of some real estate Deep Web data extraction prove that the algorithm is feasible and valid.

Research on spam detection techniques based on clustering

JIANG Sheng-yi1, PANG Guan-song2, ZHANG Jian-jun3

J4. 2011, 46(5): 71-76.

Abstract ( 1511 )

PDF (787KB) ( 918 )

Save

Related Articles | Metrics

With the surge of email spam, how to detect it becomes an important and urgent problem. To cope with the defects of kNN spam detection, an improved kNN spam detection approach based on clustering is proposed. First, by using the least distance principle, the training email text samples are divided into several hyper spheres with the approximate radius, and the texts contained in hyper spheres are from one or more of these categories. Second, the clusters (hyper spheres) are tagged by using the majority voting mechanism,which means that each cluster is tagged with the category containing the most text in the cluster, and the detection model consists of tagged clusters. Finally, the email texts are detected with the kNN approach. Experimental results show that the proposed approach can substantially reduce the text similarity computation, and perform better than iMBL, Naïve Bayesian, and Stacking. Furthermore, the detection model constructed by the proposed approach can be incrementally updated, which has great feasibility in real-world applications.

The study of customer churn prediction model for telecom

JIANG Sheng-yi1, WANG Lian-xi2

J4. 2011, 46(5): 77-81.

Abstract ( 1265 )

PDF (778KB) ( 1525 )

Save

Related Articles | Metrics

The accuracies of existing customer churn prediction models are too low, so a new customer churn prediction model is established by combining the clustering analysis with the classification prediction technique based on statistical learning. According to the results of the model, it is able to distinguish the customers groups and the propensity tendency of different customer groups. And then some customer retention measures are proposed,which can help telecom enterprises to make decisions for customer relationship management.

Privacy preserving approaches for relational multiple sensitive attributes

LI Li, YUAN Fang, XI Ya-hui

J4. 2011, 46(5): 82-85.

Abstract ( 1271 )

PDF (349KB) ( 1185 )

Save

Related Articles | Metrics

Directly applying the existing sensitive attribute privacy protection methods to multiple sensitive attributes privacy protection can divulge the privacy data. Firstl, the thought of lossy join to protect privacy data was inherited, and the records of tables were clustered to guarantee the sensitive rank division in tables. Then, the records were grouped according to the frequency of comparative strategy. And a grouping algorithm aimed at data containing multi-sensitive attributes was proposed based on the cluster. Experimental results indicated that this algorithm could prevent the privacy revelation and strengthen the security of data published.

A dynamic network overlapping communities detecting algorithm based on local betweenness

WANG Li, ZHANG Jing-yang, XU Li-heng

J4. 2011, 46(5): 86-90.

Abstract ( 1307 )

PDF (915KB) ( 723 )

Save

Related Articles | Metrics

Aimed at the problem of distortion in the static network and higher time complexity in the dynamic network, a overlap community detecting algorithm for a dynamic network is proposed, in which the idea is that the edge of the maximum edge betweenness or node of the maximum split betweenness is the key link edge or node. Considering the dynamics and overlapping in the network community, communities found is achieved, based on removing the edge of maximum edge betweenness or dividing the point of maximum split betweenness. Finally, under the control of community detection by modularity, the community structure becomes more reasonable.

Adaptively species-based multimodal particle swarm optimization

LIU Yu, LV Ming-wei, LI Wei-jia, LI Wen-tao

J4. 2011, 46(5): 91-96.

Abstract ( 1154 )

PDF (523KB) ( 666 )

Save

Related Articles | Metrics

The adaptively species-based particle swarm optimization (ASPSO) is proposed based on the analysis of particle swarm optimizer (PSO), niching techniques and multimodal particle swarm optimization algorithms. The ASPSO is comprehensively tested and compared with ANPSO and SPSO. Experimental results show that ASPSO has a success rate as high as ANPSO and SPSO in solving low dimensional problems, and has better performance in solving high dimensional and difficult problems than other existing multimodal particle swarm optimization algorithms.

Research of cooperative PSO for attribute reduction optimization

DING Wei-ping1,2,3, WANG Jian-dong2, DUAN Wei-hua2, SHI Quan1

J4. 2011, 46(5): 97-102.

Abstract ( 1199 )

PDF (680KB) ( 612 )

Save

Related Articles | Metrics

According to the problem of attribute reduction optimization, an improved cooperative PSO algorithm named AR-CPSO for attribute reduction optimization was proposed based on some special optimization advantages of PSO. In the process of searching for the minimal attribute sets, particle swarms could improve its optimization ability by splitting reduction vectors into some parts and learning some social cognition from cooperative neighbour clusters in the attribute spaces. The adaptive reinforcement penalty function method was involved in the algorithm to get the optimization reduction sets. AR-CPSO could maintain the diversity and cooperation of the populations. Furthermore, it could break away from the local optimization. Experimental results showed that AR-CPSO had an outstanding ability to find the global optimization and was better in cooperative attribute reduction.

An improved set description of negative knowledge processing in Fuzzy knowledge

ZHANG Sheng-li1, PAN Zheng-hua2

J4. 2011, 46(5): 103-109.

Abstract ( 1171 )

PDF (464KB) ( 723 )

Save

Related Articles | Metrics

The negation in fuzzy knowledge can be divided into three distinct classes: contradictory negation, opposite negation and medium negation. FScom is a fuzzy set used to differentiate these three negations. On the basis of FScom,the set description of fuzzy knowledge and its different kinds of negations were further investigated, and one improved kind of fuzzy set IFScom is proposed. Also, its characteristic, operation and related properties are also investigated.

Active semi-supervised nearest neighbour learning

YANG Yang, WANG Li-hong*, LIU Qi-cheng

J4. 2011, 46(5): 110-115.

Abstract ( 1327 )

PDF (2872KB) ( 759 )

Save

Related Articles | Metrics

A semi-supervised nearest neighbour classification algorithm was proposed, in which both labeled points and pair-wise constraints were employed to determinate the label of data points. To solve the problem that some data points may not be assigned to any class label, the ratio sorting was designed to reduce the number of conflict points. An active learning strategy based on CitationkNN score was developed to search valuable supervision information and improve the quality of clustering by querying the label of a point incompatible with its neighbours. Experiments show that the learning strategy can improve the clustering performance, and the comparison with COP-kmeans and CCL illustrates the efficiency of the active SNN from the view of CRI.

Establishment of the genetic transformation system of soybean and its optimum conditions

JI Dan-dan, WANG Peng, XIANG Feng-ning*

J4. 2011, 46(5): 116-122.

Abstract ( 1303 )

PDF (1593KB) ( 555 )

Save

Related Articles | Metrics

An exogenous gene transformation method used by germinating embryo co-culture with Agrobacterium tumefaciens GV3101 under vacuum permeation conditions was established in different soybean varieties. Several parameters of Agrobacteriummediated gene transformation were compared to determine the optimum conditions. The results indicated that the optimal pretraining time, infection time, and coculture time are respective 3 days, 6 hours, and 3 days, the optimal concentration of Agrobacterium is OD600=0.6, the optimal concentration of acetosyringone is 200μmol·L-1. Under optimized conditions, the pCAMBIA1304vector was transferred by GV3101 into Soybean cv. Ludou 11 and Wei 6823, obtaining 510 Ludou 11 and 591 Wei 6823 regenerated plants, respectively. Among them, 444 and 462 plants had the resistance on MB+06mg·L-1 IAA medium with 25mg·L1 kanamycin. 13 and 15 transgenic plants were identified from the resistance plants by GUS and PCR analysis. Their transformation efficiencies were 2.2％ and 2.5％ respectively.

Analysis of performance for heterogeneous multi-core processors by Hill-Marty′s deduction

BIAN Dong, ZENG Ming, ZENG Fan-tai

J4. 2011, 46(5): 123-126.

Abstract ( 1223 )

PDF (478KB) ( 811 )

Save

Related Articles | Metrics

By using HillMarty′s deduction of speedup ratio(It does not take into the chip resources expended on shared caches, interconnection networks, and memory controllers), the structure of BCE′s distribution is obtained which make the performance of the heterogeneous multicore more optimal than that of the homogeneous multicore. Two cases are dsicussed: one is that the more powerful core in the heterogeneous multicore processors and the core in the homogeneous multicore processors are the same, and the other is that the less powerful core in the heterogeneous multicore processors and the core in the homogeneous multicore processors are the same. Therefore, the best design solutions of the heterogeneous multicore can be provided.

Table of Content