2016, 5(2):0-0. DOI: 10.12146/j.issn.2095-3135.201602000
Abstract:大数据正引领新一轮科技创新，并为社会经济转型升级、国家竞争力提升提 供了新动力和新机遇。为此，许多国家都提出了大数据发展计划。我国国务院于 2015 年 9 月份颁布了《促进大数据发展行动纲要》，要求政府各部门促进数据共 享，并提出“推动大数据发展与科研创新有机结合，推进基础研究和核心技术攻关， 形成大数据产品体系，完善大数据产业链”。大数据具有深刻的科学和技术内涵， 近年来已经引发了各个学科领域的广泛研究，为许多产业带来了技术、模式乃至 思想上的变革。 本期大数据专刊收录的文章来自于中国科学院深圳先进技术研究院（以下简 称“先进院”）、中国科学院计算技术研究所、中国科技大学、深圳大学等单位。 专刊的内容围绕大数据平台和支撑技术、大数据应用以及大数据安全和隐私等主 题来组织。深圳大学陈国良院士和黄哲学教授在《大数据分析平台建设与应用研 究》中，对大数据平台基础设施的关键问题和行业经验进行了探讨。先进院须成 忠和王洋研究员在《大规模分布式文件系统元数据管理综述》中，对大数据文件 系统中的一个关键问题——元数据管理进行了详尽的分析和工作梳理。先进院张 云博士、李晴岚博士、周丰丰研究员以及林裕杰等分别介绍了大数据在多媒体、 气象、生物和互联网方面的研究进展，并展示了他们的近期成果。大数据应用的 推广使得安全与隐私问题日益突出，为此本期专刊特别组织了两篇相关论文。其 中，姜青山研究员针对当前移动恶意软件带来的安全问题，研究了一种新的方法， 对安卓平台的恶意软件提供了有效的检测手段。尹凌博士则考虑手机位置大数据 研究中带来的隐私暴露问题，研究了个体重识别风险和数据可用性之间的关系。 本期专刊展示了大数据工程与科学的若干重要方向上的最新成果，体现了国 内部分科研机构和高校在此领域所做的探索。目前，我国的互联网以及多种应用 市场规模均达到了全球第一，这为大数据研究提供了珍贵的素材和实证机会。我 们相信，在科研人员和产业界的密切合作和不懈努力下，我国的大数据科研水平 和应用能力将迎来更广阔的发展空间。
2016, 5(2):2-18. DOI: 10.12146/j.issn.2095-3135.201602001
Abstract:The big data analytics platform is an indispensable infrastructure for big data processing and applications. Based on our research activities, practical experiences with big data analytics, and lessons learnt from industrial projects, this paper addressed the platform design, mainstream technologies, and industrial cases of big data analytics platforms. Firstly, the main functions and architecture of such platforms were analyzed. Then the key enabling technologies were introduced with a focus on the architecture of Spark and its core components. Finally three application case studies were presented in the areas of massive manufacture, retail, and smart grids.
2016, 5(2):19-28. DOI: 10.12146/j.issn.2095-3135.201602002
Abstract:Mobile phone location data is a newly emerging data source of great potential to support human mobility research. However, recent studies have indicated that many users can be easily re-identified based on their unique activity patterns. Privacy protection procedures will usually change the original data and cause a loss of data utility for analysis purposes. Therefore, the need for detailed data for activity analysis while avoiding potential privacy risks presents a challenge. The aim of this study is to reveal the re-identification risks from a Chinese city’s mobile users and to examine the quantitative relationship between re-identification risk and data utility for an aggregated mobility analysis. The first step was to evaluate the re-identification risks in Shenzhen City, a metropolis in China. A spatial generalization approach to protecting privacy was then proposed and implemented, and spatially aggregated analysis was used to assess the loss of data utility after privacy protection. The results demonstrate that the re-identification risks in Shenzhen City are clearly different from those in regions reported in Western countries, which prove the spatial heterogeneity of reidentification risks in mobile phone location data. A uniform mathematical relationship has also been found between re-identification risk (x) and data utility (y) for both attack models: y＝－axb＋c(a, b, c＞0; 0＜x＜1). The discovered mathematical relationship provides data publishers with useful guidance on choosing the right tradeoff between privacy and utility. Overall, this study contributes to a better understanding of reidentification risks and a privacy-utility tradeoff benchmark for improving privacy protection when sharing detailed trajectory data.
2016, 5(2):29-40. DOI: 10.12146/j.issn.2095-3135.201602003
Abstract:Currently, the number of mobile malware programs is explosively growing, and the increasingly large feature library poses challenges to security solution providers. Traditional detection methods cannot deal with the huge amount of data promptly and effectively. Mobile malware detection methods based on machine learning have problems of excessive numbers of features, low detection accuracy and unbalanced data. In this paper, a feature selection method based on the mean and variance of samples was proposed to reduce the features without affecting classification. Different feature extraction algorithms were implemented to construct an ensemble learning model for high detection accuracy, including Principal Component Analysis, Kaehunen-Loeve Transformation and Independent Component Analysis. To solve the problem of unbalanced data of Android App samples, a multi-level classification model based on the decision tree was also developed. Experimental results show that the proposed methods can detect Android malware effectively, and the accuracy is increased by 6.41%, 3.96% and 3.36%, respectively.
2016, 5(2):41-56. DOI: 10.12146/j.issn.2095-3135.201602004
Abstract:The developments of science and technology have brought rapid growth of data, of which video and image data account for a high percentage. How to efficiently handle these data and find valuable information from them is a hot topic. Big data are characterized by four Vs: volume, velocity, variety, and value, representing large amounts of data, quick data processing, various data types, and low value density, respectively. Video big data share all these characteristics, and often come with much greater data redundancy than other types of data. As a result, they call for more efficient techniques for compression and processing. The research of video big data is primarily carried out along four dimensions: video data representation, intelligent video analysis, video compression and transmission, and video display and quality evaluation. Recent trends show that video representation is becoming more realistic and intelligent, and video analysis more accurate in identification and classification thanks to the deep neural networks. At the same time, video compression promises to be more efficient with new methods to reduce coding complexity, and less redundant with the help of visual perception aware coding algorithms. In accordance with more advanced video representation, video display devices are undergoing hardware upgrades, guided by a comprehensive methodology of video quality evaluation that is centered around quality of experience, instead of the traditional criteria developed for image quality assessment.
2016, 5(2):57-72. DOI: 10.12146/j.issn.2095-3135.201602005
Abstract:Metadata of file systems is the data in charge of maintaining namespace, permission semantics and location of file data blocks. Metadata operations can account for up to 80% of total file system operations. As such, the performance of metadata services substantially impacts the overall performance of distributed file systems, especially with the advent of big data era, posing great pressure on the underlying storage systems. This paper reports the state-of-the-art research on the metadata services in large-scale distributed file systems. The study was conducted from three perspectives that are always used to characterize the file systems: high scalability, high-performance and high-availability, with focus squarely on their respective major challenges as well as their developed mainstream technologies. Additionally, some existing problems in this research were also identified and analyzed, which could be used as a reference for related studies.
2016, 5(2):73-84. DOI: 10.12146/j.issn.2095-3135.201602006
Abstract:Tropical cyclone (TC) is a destructive weather system. Accurate and timely forecast of the TC’s intensity and track is vital for disaster prevention and mitigation. This study proposed statistical regression methods to forecast the TC’s intensity change for 12, 24, 36 and 48 hours in the future over the Northwest Pacific Ocean. In addition to the conventional factors of climatology and persistence, this study took into account the land effect on the TC’s intensity change by introducing a new factor, i.e. the ratio of sea to land, into the statistical regression models. Three sets of TC samples, ocean basin samples, offshore samples, and total TC samples for the years 2000—2014 were applied to develop the intensity forecasting models. Final operational global analysis proposed by 1°×1° National Centers for Environmental Prediction/National Centre for Atmospheric Research were used as the predictors for the environmental effects. Two methods, stepwise regression and principal component analysis, were employed to develop the TC intensity forecasting models. Due to the consideration of the ratio of sea to land, the intensity forecasting performance for offshore TCs was significantly improved. Therefore, the proposed models are valuable for practical disaster prediction.
2016, 5(2):85-96. DOI: 10.12146/j.issn.2095-3135.201602007
Abstract:Genome assembling is one of the challenges in metagenomic analysis. It is usually assumed that the sequencing reads are from the same genome. However, the mobile elements active in microbial genomes raise a critical question mark on this assumption. This work formulated this issue as a binary classification problem. The accurate discrimination of mobile elements from chromosomes could greatly facilitate the metagenomic analysis. After quantifying the sequencing reads in metagenome, the collaboration of binary classification algorithms with feature selection algorithms, including ReliefF, chi-squared test, and Fisher’s t-test was investigated. All feature subsets were tested using the classification algorithms such as logisitic regression, extreme learning machine, support vector machine and random forest. Experimental results demonstrate that the model based on ReliefF algorithm and Random Forest algorithm achieves over 95% in accuracy with only 100 features, which outperforms the model utilizing all 690 features.
2016, 5(2):97-108. DOI: 10.12146/j.issn.2095-3135.201602008
Abstract:In this paper, a sports news search engine, Geeking, was introduced, which contains four functional models: web crawling, champion list building, search processing and user interface. Geeking could provide query correction, query auto-completion, search results sorting, news clustering, keywords highlighting and snapshot visualization. Given a query, the system automatically completes the query according to the search logs and the news hot keywords. If there was no return of result, the system could correct the query and provided the recommended query terms. The related documents were searched quickly according to the champion list. Based on the tf-idf values and other factors like news headlines and release time, the documents’ relevance was calculated. For the clustering of similar news, the longest common subsequence and levenshtein distance were used to measure the similarity between news headlines and the similarity of news headlines could be regarded as the similarity between documents. Test results were given to show that Geeking is fast and stable.