JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2014, Vol. 49 ›› Issue (12): 30-35.doi: 10.6040/j.issn.1671-9352.3.2014.230

Previous Articles     Next Articles

An variable length code algorithm compression inverted index

MAO Fu-lin, QU You-li   

  1. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
  • Received:2014-08-28 Revised:2014-10-27 Online:2014-12-20 Published:2014-12-20

Abstract: The efficiency of text search engines relies on data structure: inverted index. And the more large space is need to storage the inverted index. A new compression algorithm was proposed. For the given document collections. Terrier, the information retrival tool, was used to build inverted index, and the state-of-the-art compression techniques was used to compress inverted file. Then the compress ratio was confirmed by comparing the file size. Experiments show that the new compression techniques can get much better compress ratio.

Key words: integer compression, inverted index, index compression

CLC Number: 

  • TP391
[1] KOBAYASHI M, TAKEDA K. Information retrieval on the web[J]. ACM Computing Surveys (CSUR), 2000, 32(2):144-173.
[2] ANH V N, MOFFAT A. Improved word-aligned binary compression for text indexing[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(6):857-861.
[3] SHIEH W Y, CHUNG C P. A statistics-based approach to incrementally update inverted files[J]. Information Processing & Management, 2005, 41(2):275-288.
[4] Wikipedia.Unary coding[EB/OL].[2014-03-05].http://en.wikipedia.org/wiki/Unary- coding.
[5] SCHOLER F, WILLIAMS H E, YIANNIS J, et al. Compression of inverted indexes for fast query evaluation[C]//Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2002:222-229.
[6] ELIAS P. Universal codeword sets and representations of the integers [J]. IEEE Transactions on Information Theory, 1975, 21(2):194-203.
[7] RICE R F. Some practical universal noiseless coding techniques [D]. Pasadena: California Institute of Technology, 1979.
[8] MOFFAT A, STUIVER L. Binary interpolative coding for effective index compression[J]. Information Retrieval, 2000, 3(1):25-47.
[9] HEMAN S. Super-scalar database compression between RAM and CPU-cache [D]. Amsterdam: University of Amsterdam, 2005.
[10] SOMASUNDARAM K, DOMNIC S. Extended golomb code for integer representation[J]. IEEE Transactions on Multimedia, 2007, 9(2):239-246.
[11] Wikipedia.Truncated binary encoding[EB/OL].[2014-03-11]. http://en.wikipedia.org/wiki/Truncated-binary-encoding.
[12] GLORY V, DOMNIC S. Re-ordered FEGC and block based FEGC for inverted file compression [J]. International Journal of Information Retrieval Research (IJIRR), 2013, 3(1):71-88.
[1] SONG Xing-shen, YANG Yue-xiang, JIANG Yu. Efficient multiple sets intersection using SIMD instructions [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(3): 54-62.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!