JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2024, Vol. 59 ›› Issue (3): 95-106.doi: 10.6040/j.issn.1671-9352.7.2023.2681

Identification and statistical analysis methods of personal information disclosure in open government data

Haisu CHEN(),Jiachun LIAO*(),Sicheng YAO   

  1. Research Center of Big Data Technology, Nanhu Laboratory, Jiaxing 314002, Zhejiang, China
  • Received:2023-04-29 Online:2024-03-20 Published:2024-03-06
  • Contact: Jiachun LIAO;


To promote the protection of personal information during data opening, an in-depth analysis of the current status of disclosure of personal information in the open government data is conducted. Firstly, the paper obtains the datasets from relevant platforms and pre-process to classify the datasets that containing personal information based on features such as field and table names, etc. Then, methods of sensitive information identification are applied to identify and extract various types of personal information in the data, and map the information back to individuals to summarise the total number of individuals and detect their associated data. Through data visualizations, the current status of personal information disclosure could be examined. Although some open government data platforms may have implemented certain measures such as data categorization and de-identification, the published open datasets still contain a large amount of personal information, which is required to be improved in terms of data categorization and classification, sensitive information identification and data desensitization in a normative and accurate manner.

Key words: big data privacy, personal information, open government data, information identification, statistical analysis

Application framework"


Plot of data size and number of fields in the datasets from test platform"

Table 1

Samples for the dataset of Marital Conflict (137 records in total)"

主要诉求 调处情况
刘*与丈夫陈**2010年结婚,育有2个孩子…… 联系**派出所询问情况,请派出所帮助开具家暴告诫书……

Table 2

Samples for the dataset of the Elders Information (4 619 records in total)"

门磁ID 老人证件号码 老人姓名
8632170****4067 ******1939******47 许**

Table 3

Listing of baseline algorithms for identification"

识别类型 目标类信息 识别模式
统一编码类信息 身份证 正则表达式:
手机号 正则表达式:
r’^((\+?[0-9]{1, 4})|(\(\+86\)))?(13[0-9]|14[57]|15[012356789]|17[03678]|18[0-9])\d{8}$’
车牌号 正则表达式:
r’^[京津沪渝冀豫云辽黑湘皖鲁新苏浙赣鄂桂甘晋蒙陕吉闽贵粤青藏川宁琼使领][A-HJ-NP-Z](?: ((\d{5}[A-HJK])|([A-HJK][A-HJ-NP-Z0-9][0-9]{4}))|[A-HJ-NP-Z0-9]{4}[A-HJ-NP-Z0-9挂学警港澳])$’
银行卡号 正则表达式:
r’(?<![0-9a-zA-Z\-])[1-9](?: \d{11, 18})(?![0-9a-zA-Z\-])’
姓名信息 姓名 LAC工具命名实体识别

Table 4

Listing of developed algorithms for identification"

识别类型 目标类信息 识别模式
统一编码类信息 身份证 正则表达式:
r’[1-9]\d{5}(?: 18|19|(?: [23]\d))\d{2}(?: (?: 0[1-9])|(?: 10|11|12))(?: (?: [0-2][1-9])|10|20|30|31)\d{3}[0-9Xx]’
手机号 正则表达式:
r’(?<![0-9a-zA-Z\-])(?: \+?86)?1(?: (?: 34[0-8])|(?: 8\d{2})|(?: (?: [35][0-35-9]|4[14-9]|6[567]|7[0-8]|9[12389])\d))\d{7}(?![0-9a-zA-Z\-])’
车牌号 正则表达式:
r’(?<![锅容管瓶梯起索游车]\d{2}(?=[京津沪渝冀豫云辽黑湘皖鲁新苏浙赣鄂桂甘晋蒙陕吉闽贵粤青藏川宁台琼使领军北南成广沈济空海]{1}[A-Z]{1}[A-Z0-9]{4}(?: [A-Z0-9挂领学警港澳]{1}|[A-Z0-9]{2}\(\d{2}\))))[京津沪渝冀豫云辽黑湘皖鲁新苏浙赣鄂桂甘晋蒙陕吉闽贵粤青藏川宁台琼使领军北南成广沈济空海]{1}[A-Z]{1}[A-Z0-9]{4}(?: [A-Z0-9挂领学警港澳]{1}|[A-Z0-9]{2})(?!\d)’
银行卡号 正则表达式:
r’(?<![0-9a-zA-Z\-])[1-9](?: \d{11, 18})(?![0-9a-zA-Z\-])’
姓名信息 姓名 HanLP工具命名实体识别

Table 5

Identification results for the dataset of Marital Conflict"

统一编码类信息(漏检/误检) 姓名(漏检/误检)
基线识别算法 0/4 10/20
本文识别算法 0/0 1/4

Table 6

Identification results for the dataset of the Elders Information"

统一编码类信息(漏检/误检) 姓名(漏检/误检)
基线识别算法 0/11 184/0
本文识别算法 0/0 0/0

Table 7

The summarized number of related people with the direct disclosure of personal information in each single dataset from test platform"

领域标注 披露的个人信息类型涉及人数/人
个人基本信息 个人身份信息 个人健康生理信息 个人教育工作信息 个人财产信息 其他个人信息
社会救助 42 111 24 19 413 0 8 0
市场监督 10 110 3 0 10 094 0 0
科技创新 2 965 0 0 2 965 0 0
气象服务 1 027 0 0 0 0 0
生态环境 1 260 0 0 667 0 0
生活服务 256 7 1 29 17 94
城建住房 159 0 0 0 0 0
教育文化 64 162 2 143 0 61 138 0 0
地理空间 123 0 0 0 0 0
交通运输 42 59 0 59 0 0
信用服务 399 19 0 0 0 0
机构团体 1 0 0 0 0 0
工业农业 16 0 0 0 0 0
其他 12 0 0 12 0 0


The summarized number of related people with re-identification by data linking from test platform"


The summarized number of related people with re-identification by data linking under certain degree of confidence and quantity of information extension from test platform"

