《山东大学学报(理学版)》 ›› 2024, Vol. 59 ›› Issue (7): 27-43.doi: 10.6040/j.issn.1671-9352.1.2023.062
Wei SHAO1,2(),Gaoyu ZHU1,2,Lei YU1,*(),Jiafeng GUO1
摘要:
目前大多研究通过一些降维方法将高维向量转化为低维向量表示,再应用相关向量检索优化技术实现快速相似性检索,从而提高大模型应用表现。当前针对高维数据的降维方法繁多分散,在不同的研究背景下所采用的降维方法不尽相同,同样地,在向量检索技术上也存在大量不同的检索思路与优化方法。通过综述近期的降维和检索算法的主要思路及其优化方法,有助于产生二者之间的启发性联系,支撑后续高维向量检索优化算法研究的展开和深入。
中图分类号:
1 |
LI Hailin . Asynchronism-based principal component analysis for time series data mining[J]. Expert Systems with Applications, 2014, 41 (6): 2842- 2850.
doi: 10.1016/j.eswa.2013.10.019 |
2 |
FISHER R A . The use of multiple measurements in taxonomic problems[J]. Annals of Eugenics, 1936, 7 (2): 179- 188.
doi: 10.1111/j.1469-1809.1936.tb02137.x |
3 | COX T F , COX M A A . Multidimensional scaling[M]. Boca Raton: CRC Press, 2000. |
4 | 何进荣, 丁立新, 李照奎, 等. 基于边界判别投影的数据降维[J]. 软件学报, 2014, 25 (4): 826- 838. |
HE Jinrong , DING Lixin , LI Zhaokui , et al. Margin discriminant projection for dimensionality reduction[J]. Journal of Software, 2014, 25 (4): 826- 838. | |
5 | 张田昊. 数据降维算法研究及其应用[D]. 上海: 上海交通大学, 2008. |
ZHANG Tianhao. Research on dimensionality reduction algorithms and its applications[D]. Shanghai: Shanghai Jiao Tong University, 2008. | |
6 |
ZHANG Zhenyue , ZHA Hongyuan . Principal manifolds and nonlinear dimensionality reduction via tangent space alignment[J]. SIAM Journal on Scientific Computing, 2004, 26 (1): 313- 338.
doi: 10.1137/S1064827502419154 |
7 | 龚铁梁. 数据降维算法研究及其应用[D]. 武汉: 湖北大学, 2012. |
GONG Tieliang. Research on dimensionality reduction algorithms and applications[D]. Wuhan: Hubei University, 2012. | |
8 | SOALE A N . Projection expectile regression for sufficient dimension reduction[J]. Computational Statistics & Data Analysis, 2023, 180, 107666. |
9 | TENENBAUM J. Mapping a manifold of perceptual observations[C]//Proceedings of the 10th International Conference on Neural Information Processing Systems. Denver: ACM, 1997: 682-688. |
10 | LEE J A, ARCHAMBEAU C, VERLEYSEN M. Locally linear embedding versus isotop[C]//ESANN'2003 proceedings-European Symposium on Artificial Neural Networks. Bruges: D-Side, 2003: 527-534 |
11 | BELKIN M, NIYOGI P. Laplacian eigenmaps and spectral techniques for embedding and clustering[C]//Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. Columbia: ACM, 2001: 585-591. |
12 |
WEINBERGER K Q , SAUL L K . Unsupervised learning of image manifolds by semidefinite programming[J]. InternationalJournal of Computer Vision, 2006, 70 (1): 77- 90.
doi: 10.1007/s11263-005-4939-z |
13 | HE X, NIYOGI P. Locality preserving projections[C]//Proceedings of the 16th International Conference on Neural Information Processing Systems. Columbia: ACM, 2003: 153-160. |
14 | 吴晓婷, 闫德勤. 改进的非线性数据降维方法及其应用[J]. 计算机工程与应用, 2011, 47 (2): 156- 159. |
WU Xiaoting , YAN Deqin . Improved non-linear data dimensionality reduction method and its application[J]. Computer Engineering and Applications, 2011, 47 (2): 156- 159. | |
15 |
宋欣, 叶世伟. 基于直接估计梯度思想的数据降维算法[J]. 计算机工程, 2008, 34 (8): 205- 207.
doi: 10.3969/j.issn.1000-3428.2008.08.073 |
SONG Xin , YE Shiwei . Data dimensionality reduction algorithm based on direct estimate grads[J]. Computer Engineering, 2008, 34 (8): 205- 207.
doi: 10.3969/j.issn.1000-3428.2008.08.073 |
|
16 |
马思远, 贺萍. 改进的局部线性嵌入及其混成数据降维算法[J]. 计算机与数字工程, 2022, 50 (12): 2616- 2621.
doi: 10.3969/j.issn.1672-9722.2022.12.002 |
MA Siyuan , HE Ping . Improved locally linear embedding and hybrid algorithm for data dimensionality reduction[J]. Computer & Digital Engineering, 2022, 50 (12): 2616- 2621.
doi: 10.3969/j.issn.1672-9722.2022.12.002 |
|
17 | DONOHO D L , GRIMES C . Hessian eigenmaps: locally linear embedding techniques for high-dimensional data[J]. Proceedings of the National Academy of Sciences of the United States of America, 2003, 100 (10): 5591- 5596. |
18 | 邹艳. 高维数据降维方法的研究[D]. 成都: 西南交通大学, 2012. |
ZOU Yan. Research on dimension reduction methods of high dimensional data[D]. Chengdu: Southwest Jiaotong University, 2012. | |
19 | 宋欣, 叶世伟. 一种在源数据稀疏情况下的数据降维算法[J]. 计算机工程与应用, 2007, 43 (28): 181- 183. |
SONG Xin , YE Shiwei . Data dimensionality reduction algorithm when source data is spare[J]. Computer Engineering and Applications, 2007, 43 (28): 181- 183. | |
20 |
KOKIOPOULOU E , SAAD Y . Orthogonal neighborhood preserving projections: a projection-based dimensionality reduction technique[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29 (12): 2143- 2156.
doi: 10.1109/TPAMI.2007.1131 |
21 | 王宪保, 陈诗文, 姚明海. 基于正则化的半监督等距映射数据降维方法[J]. 电子与信息学报, 2016, 38 (1): 241- 245. |
WANG Xianbao , CHEN Shiwen , YAO Minghai . Data dimensionality reduction method of semi-supervised isometric mapping based on regularization[J]. Journal of Electronics & Information Technology, 2016, 38 (1): 241- 245. | |
22 | LI Haifeng, JIANG Tao, ZHANG Keshu. Efficient and robust feature extraction by maximum margin criterion[C]//Proceedings of the 16th International Conference on Neural Information Processing Systems. Columbia: ACM, 2003: 97-104. |
23 | 罗廷金. 基于流形学习的数据降维算法研究[D]. 长沙: 国防科学技术大学, 2013. |
LUO Tingjin. Research on dimension reduction algorithms based on manifold learning[D]. Changsha: National University of Defense Technology, 2013. | |
24 | 王雷. 基于全局统计与局部几何性质的数据降维算法研究[D]. 合肥: 中国科学技术大学, 2009. |
WANG Lei. Research on global statistic and local geometry based dimensionality reduction algorithm[D]. Hefei: The University of Science and Technology of China, 2009. | |
25 | 胡昭华, 宋耀良. 基于Autoencoder网络的数据降维和重构[J]. 电子与信息学报, 2009, 31 (5): 1189- 1192. |
HU Zhaohua , SONG Yaoliang . Dimensionality reduction and reconstruction of data based on autoencoder network[J]. Journal of Electronics & Information Technology, 2009, 31 (5): 1189- 1192. | |
26 | 申中华. 数据降维技术的建模研究与应用: 特征降维及其应用[D]. 无锡: 江南大学, 2008. |
SHEN Zhonghua. Research and application of dimensionality reduction techniques: feature reduction and its applications[D]. Wuxi: Jiangnan University, 2008. | |
27 | HINTON G E, ROWEIS S. Stochastic neighbor embeddingg[C]//Proceedings of the 15th International Conference on Neural Information Processing Systems. Cambridge: ACM, 2002: 857-864. |
28 | 魏世超, 李歆, 张宜弛, 等. 基于E-t-SNE的混合属性数据降维可视化方法[J]. 计算机工程与应用, 2020, 56 (6): 66- 72. |
WEI Shichao , LI Xin , ZHANG Yichi , et al. Dimension reduction and visualization of mixed-type data based on E-t-SNE[J]. Computer Engineering and Applications, 2020, 56 (6): 66- 72. | |
29 | KHOSHROU A, DORSMAN A B, PAUWELS E J. SVD-based visualisation and approximation for time series data in smart energy systems[C]//2017 IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT-Europe). Turin: IEEE, 2017: 1-6. |
30 | BOWALA I , FERNANDO M . A novel model for time-series data clustering based on piecewise SVD and BIRCH for stock data analysis on Hadoop platform[J]. Advances in Science, Technology and Engineering Systems Journal, 2017, 2 (3): 855- 864. |
31 | COIFMAN R R , LAFON S . Diffusion maps[J]. Applied and Computational Harmonic Analysis, 2006, 21 (1): 5- 30. |
32 | JUVONEN A , SIPOLA T , HÄMÄLÄINEN T . Online anomaly detection using dimensionality reduction techniques for HTTP log analysis[J]. Computer Networks, 2015, 91 (C): 46- 56. |
33 | SHATKAY H . The Fourier transform: a primer[M]. Providence: Brown University, 1995: 1- 20. |
34 | SIFUZZAMAN M , ISLAM M R , ALI M Z . Application of wavelet transform and its advantages compared to Fourier transform[J]. Journal of Physical Sciences, 2009, 13, 121- 134. |
35 | GAO Wenxu , MA Zhengming , XIONG Chenkui , et al. Dimensionality reduction of SPD data based on Riemannian manifold tangent spaces and local affinity[J]. Applied Intelligence, 2023, 53 (2): 1887- 1911. |
36 | FRANCIS D P , RAIMOND K , KATHRINE G J W . Dimensionality reduction of large datasets with explicit feature maps[J]. Intelligent Decision Technologies, 2023, 17 (2): 457- 470. |
37 | BERA D , PRATAP R , VERMA B D . Dimensionality reduction for categorical data[J]. IEEE Transactions on Knowledge and Data Engineering, 2023, 35 (4): 3658- 3671. |
38 | HALLAJI E , FARAJZADEH-ZANJANI M , RAZAVI-FAR R , et al. Constrained generative adversarial learning for dimensionality reduction[J]. IEEE Transactions on Knowledge and Data Engineering, 2023, 35 (3): 2394- 2405. |
39 | WEI Wei , YUE Qin , FENG Kai , et al. Unsupervised dimensionality reduction based on fusing multiple clustering results[J]. IEEE Transactions on Knowledge and Data Engineering, 2023, 35 (3): 3211- 3223. |
40 | ISMKHAN H , IZADI M . Proposing a dimensionality reduction technique with an inequality for unsupervised learning from high-dimensional big data[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53 (6): 3880- 3889. |
41 | BADER J, NELSON D, CHAI-ZHANG T, et al. Neural network for nonlinear dimension reduction through manifold recovery[C]//2019 IEEE MIT Undergraduate Research Technology Conference (URTC). Cambridge: IEEE, 2019: 1-4. |
42 | CUI Yuming, FANG Yinglan. Research on PCA data dimension reduction algorithm based on entropy weight method[C]//2020 2nd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI). Taiyuan: IEEE, 2020: 392-396. |
43 | SHI Yang, ZHANG Aidong. A shrinking-based dimension reduction approach for multi-dimensional analysis[C]//Proceedings of 16th International Conference on Scientific and Statistical Database Management. Santorini Island: IEEE, 2004: 427-428. |
44 | GAO Weixu , MA Zhengming , YUAN Xuejing . Dimensionality reduction algorithm of tensor data based on orthogonal tucker decomposition and local discrimination difference[J]. Applied Intelligence, 2022, 52 (12): 14518- 14540. |
45 | BEHERA A P , SINGH J , VERMA S , et al. Data visualization through non linear dimensionality reduction using feature based Ricci flow embedding[J]. Multimedia Tools and Applications, 2022, 81 (11): 14831- 14850. |
46 | ISLAM M T , XING L . A data-driven dimensionality-reduction algorithm for the exploration of patterns in biomedical data[J]. Nature Biomedical Engineering, 2021, 5 (6): 624- 635. |
47 | VOGELSTEIN J T , BRIDGEFORD E W , TANG M , et al. Supervised dimensionality reduction for big data[J]. Nature Communications, 2021, 12 (1): 2872. |
48 | INDYK P, MOTWANI R. Approximate nearest neighbors: towards removing the curse of dimensionality[C]//Proceedings of the thirtieth annual ACM symposium on Theory of computing-STOC'98. Dallas: ACM, 1998: 604-613. |
49 | YIANILOS P N. Data structures and algorithms for nearest neighbor search in general metric spaces[C]//Proceedings of the Fourth Annual ACM-SIAM symposium on Discrete algorithms. Austin: ACM, 1993: 311-321. |
50 | CAYTON L. Fast nearest neighbor retrieval for bregman divergences[C]//Proceedings of the 25th international conference on Machine learning. Helsinki: ACM, 2008: 112-119. |
51 | BENTLEY J L . Multidimensional binary search trees used for associative searching[J]. Communications of the ACM, 1975, 18 (9): 509- 517. |
52 | SPOTIFY. Github-spotify/annoy: about approximate nearest neighbors in C++/Python optimized for memory usage and loading/saving to disk[EB/OL]. [2023-07-25]. https://github.com/spotify/annoy. |
53 | DASGUPTA S, FREUND Y. Random projection trees and low dimensional manifolds[C]//Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing. Columbia: ACM, 2008: 537-546. |
54 | SILPA-ANAN C, HARTLEY R. Optimised KD-trees for fast image descriptor matching[C]//2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage: IEEE, 2008: 1-8. |
55 | MUJA M , LOWE D G . Scalable nearest neighbor algorithms for high dimensional data[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36 (11): 2227- 2240. |
56 | FUKUNAGA K , NARENDRA P M . A branch and bound algorithm for computing k-nearest neighbors[J]. IEEE Transactions on Computers, 1975, 100 (7): 750- 753. |
57 | NAVARRO G . Searching in metric spaces by spatial approximation[J]. The VLDB Journal, 2002, 11 (1): 28- 46. |
58 | BEYGELZIMER A, KAKADE S, LANGFORD J. Cover trees for nearest neighbor[C]//Proceedings of the 23rd Internatio-nal Conference on Machine Learning. Pittsburgh: ACM, 2006: 97-104. |
59 | WANG H Y, CAO J, SHU L C, et al. Locality sensitive hashing revisited: filling the gap between theory and algorithm analysis[C]//Proceedings of the 22nd ACM International Conference on Information & Knowledge Managementt-CIKM'13. San Francisco: ACM, 2013: 1969-1978. |
60 | DATAR M, IMMORLICA N, INDYK P, et al. Locality-sensitive hashing scheme based on p-stable distributions[C]//Proceedings of the Twentieth Annual Symposium on Computational Geometry. New York: ACM, 2004: 253-262. |
61 | LYU Q, JOSEPHSON W, WANG Z, et al. Multi-probe LSH: efficient indexing for high-dimensional similarity search[C]//Proceedings of the 33rd International Conference on Very Large Data Bases. Vienna: ACM, 2007: 950-961. |
62 | LYU Q , JOSEPHSON W , WANG Z , et al. Intelligent probing for locality sensitive hashing: multi-probe LSH and beyond[J]. Proceedings of the VLDB Endowment, 2017, 10 (12): 2021- 2024. |
63 | ZOLOTAREV V M . One-dimensional stable distributions[M]. Providence: American Mathematical Society, 1986. |
64 | CHARIKAR M S. Similarity estimation techniques from rounding algorithms[C]//Proceedings of the thiry-fourth annual ACM Symposium on Theory of Computing. Montreal: ACM, 2002: 380-388. |
65 | PANIGRAHY R. Entropy based nearest neighbor search in high dimensions[C]//Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm-SODA'06. Miami: ACM, 2006: 1186-1195. |
66 | LIU W, WANG J, KUMAR S, et al. Hashing with graphs[C]//Proceedings of the 28th International Conference on Machine Learning. Bellevue: International Machine Learning Society, 2011: 1-8. |
67 | WANG Jianfeng, WANG Jingdong, YU Nenghai, et al. Order preserving hashing for approximate nearest neighbor search[C]//Proceedings of the 21st ACM International Conference on Multimedia. Barcelona: ACM, 2013: 133-142. |
68 | WANG J, LIU W, SUN A X, et al. Learning hash codes with listwise supervision[C]///2013 IEEE International Conference on Computer Vision. Sydney: IEEE, 2013: 3032-3039. |
69 | JIN Zhongming, HU Yao, LIN Yue, et al. Complementary projection hashing[C]//Proceedings of the IEEE International Conference on Computer Vision. Sydney: IEEE, 2013: 257-264. |
70 | JOLY A, BUISSON O. Random maximum margin hashing[C]//CVPR. Colorado Springs: IEEE, 2011: 873-880. |
71 | JEGOU H , DOUZE M , SCHMID C . Product quantization for nearest neighbor search[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 33 (1): 117- 128. |
72 | GONG Y C , LAZEBNIK S , GORDO A , et al. Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35 (12): 2916- 2929. |
73 | KONG Weihao, LI Wujun. Isotropic hashing[C]//Proceedings of the Advances in Neural Information Processing Systems, [S. l. ]: Curran Associates, 2012: 1-9. |
74 | BABENKO A, LEMPITSKY V. Additive quantization for extreme vector compression[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 931-938. |
75 | ZHANG T, DU C, WANG J D. Composite quantization for approximate nearest neighbor search[C]//Proceedings of the 31st International Conference on Machine Learning-Volume 32. Beijing: ACM, 2014: 838-846. |
76 | SALAKHUTDINOV R , HINTON G . Semantic hashing[J]. International Journal of Approximate Reasoning, 2009, 50 (7): 969- 978. |
77 | GRAY R M , NEUHOFF D L . Quantization[J]. IEEE Transactions on Information Theory, 1998, 44 (6): 2325- 2383. |
78 | GE Tiezheng , HE Kaiming , KE Qifa , et al. Optimized product quantization[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36 (4): 744- 755. |
79 | JEGOU H, DOUZE M, SCHMID C, et al. Aggregating local descriptors into a compact image representation[C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco: IEEE, 2010: 3304-3311. |
80 | WANG Mengzhao, XU Xiangliang, YUE Qiang, et al. A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search[EB/OL]. (2021-01-29)[2023-10-18]. http://arxiv.org/abs/2101.12631. |
81 | VIRMAJOKI O, FRANTI P. Divide-and-conquer algorithm for creating neighborhood graph for clustering[C]//Proceedings of the 17th International Conference on Pattern Recognition. Cambridge: IEEE, 2004: 264-267. |
82 | DONG W, MOSES C, LI K. Efficient k-nearest neighbor graph construction for generic similarity measures[C]//Proceedings of the 20th International Conference on World Wide Web. Hyderabad: ACM, 2011: 577-586. |
83 | HAJEBI K, ABBASI-YADKORI Y, SHAHBAZI H, et al. Fast approximate nearest-neighbor search with k-nearest neighbor graph[C]//Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence-Volume Two. Barcelona: ACM, 2011: 1312-1317. |
84 | UNO T, SUGIYAMA M, TSUDA K. Efficient construction of neighborhood graphs by the multiple sorting method[EB/OL]. (2009-04-21)[203-10-18]. http://arxiv.org/abs/0904.3151. |
85 | FU Cong, CAI Deng. EFANNA: an extremely fast approximate nearest neighbor search algorithm based on kNN graph[EB/OL]. (2016-09-23)[2023-10-18]. http://arxiv.org/abs/1609.07228. |
86 | MALKOV Y , PONOMARENKO A , LOGVINOV A , et al. Approximate nearest neighbor algorithm based on navigable small world graphs[J]. Information Systems, 2014, 45, 61- 68. |
87 | MALKOV Y A , YASHUNIN D A . Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42 (4): 824- 836. |
88 | YAHOOJAPAN. Github-yahoojapan/NGT: neighborhood graph and tree for indexing high-dimensional data[EB/OL]. [2023-08-22]. https://github.com/yahoojapan/NGT. |
89 | FU Cong, XIANG Chao, WANG Changxu, et al. Fast approximate nearest neighbor search with the navigating spreading-out graph[EB/OL]. (2017-07-17)[2023-10-13]. https://arxiv.org/abs/1707.00143. |
90 | JAYARAM Subramanya S, DEVVRIT F, SIMHADRI H V, et al. DiskANN: fast accurate billion-point nearest neighbor search on a single node[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019: 13766-13776. |
91 | HARWOOD B, DRUMMOND T. FANNG: fast approximate nearest neighbour graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 5713-5722. |
92 | BARANCHUK D, PERSIYANOV D, SINITSIN A, et al. Learning to route in similarity graphs[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach: PMLR, 2019: 475-484. |
93 | LI C L, ZHANG M J, ANDERSEN D G, et al. Improving approximate nearest neighbor search through learned adaptive early termination[C]//Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. Portland: ACM, 2020: 2539-2554. |
94 | MUÑOZ J A V , DIAS Z , DA SILVA TORRES R . A genetic programming approach for searching on nearest neighbors graphs[J]. Multimedia Tools and Applications, 2022, 81 (16): 23449- 23472. |
[1] | 张凌,任雪芳. 数据智能分类与分类智能检索-识别[J]. 《山东大学学报(理学版)》, 2020, 55(10): 7-14. |
[2] | 曾雪强,叶震麟,左家莉,万中英,吴水秀. 一种块增量偏最小二乘算法[J]. 《山东大学学报(理学版)》, 2019, 54(3): 93-101. |
|