Big data is leading a new round of technological innovation, and it has brought new impetus and opportunities for the transformation and upgrading of social economy and the enhancement of national competitiveness. Therefore, many countries have proposed initiatives to develop big data. In recent years, big data has triggered extensive studies in a variety of disciplines and brought changes in terms of technology, model and ideology to different industries. The special issue was organized around big data platforms and supporting technologies, and big data applications, security and privacy, and it presents recent exploration and achievements in big data engineering and science.

Cheng-Zhong Xu, Chair Professor

IEEE Fellow, Computer and Information Science,

Dean, Faculty of Science and Technology,

Interim Director, Institute of Collaborative Innovation,

University of Macau, Macau, China. 

Prof. Xu’s research interests lie in parallel and distributed computing and cloud computing, big data and data-driven intelligence applications in smart city and self-driving vehicles.


Guang Tan, Professor

School of Intelligent Systems Engineering, Sun Yat-Sen University, Shenzhen, China.

Prof. Tan’s research focuses on perception and network systems.

  • 1  Preface
    XU Chengzhong TAN Guang
    2016, 5(2):0-0. DOI: 10.12146/j.issn.2095-3135.201602000
    [Abstract](584) [HTML](0) [PDF 147.93 K](185)
    2  Review on Construction and Application of Big Data Analytical Platform
    WANG Qiang LI Junjie CHEN Xiaojun HUANG Zhexue CHEN Guoliang
    2016, 5(2):2-18. DOI: 10.12146/j.issn.2095-3135.201602001
    [Abstract](1191) [HTML](0) [PDF 6.63 M](6185)
    The big data analytics platform is an indispensable infrastructure for big data processing and applications. Based on our research activities, practical experiences with big data analytics, and lessons learnt from industrial projects, this paper addressed the platform design, mainstream technologies, and industrial cases of big data analytics platforms. Firstly, the main functions and architecture of such platforms were analyzed. Then the key enabling technologies were introduced with a focus on the architecture of Spark and its core components. Finally three application case studies were presented in the areas of massive manufacture, retail, and smart grids.
    3  Re-Identification Risk Versus Data Utility for Aggregated Mobility Research Using Mobile Phone Location Data
    YIN Ling HU Jinxing WANG Qian WANG Wei CAI Zhiling
    2016, 5(2):19-28. DOI: 10.12146/j.issn.2095-3135.201602002
    [Abstract](338) [HTML](0) [PDF 641.74 K](1851)
    Mobile phone location data is a newly emerging data source of great potential to support human mobility research. However, recent studies have indicated that many users can be easily re-identified based on their unique activity patterns. Privacy protection procedures will usually change the original data and cause a loss of data utility for analysis purposes. Therefore, the need for detailed data for activity analysis while avoiding potential privacy risks presents a challenge. The aim of this study is to reveal the re-identification risks from a Chinese city’s mobile users and to examine the quantitative relationship between re-identification risk and data utility for an aggregated mobility analysis. The first step was to evaluate the re-identification risks in Shenzhen City, a metropolis in China. A spatial generalization approach to protecting privacy was then proposed and implemented, and spatially aggregated analysis was used to assess the loss of data utility after privacy protection. The results demonstrate that the re-identification risks in Shenzhen City are clearly different from those in regions reported in Western countries, which prove the spatial heterogeneity of reidentification risks in mobile phone location data. A uniform mathematical relationship has also been found between re-identification risk (x) and data utility (y) for both attack models: y=-axb+c(a, b, c>0; 0<x<1). The discovered mathematical relationship provides data publishers with useful guidance on choosing the right tradeoff between privacy and utility. Overall, this study contributes to a better understanding of reidentification risks and a privacy-utility tradeoff benchmark for improving privacy protection when sharing detailed trajectory data.
    4  Malware Detection Techniques by Mining Massive Behavioral Data of Mobile Apps
    ZHANG Wei REN Huan ZHANG Kai LI Chengming JIANG Qingshan
    2016, 5(2):29-40. DOI: 10.12146/j.issn.2095-3135.201602003
    [Abstract](862) [HTML](0) [PDF 1.09 M](4198)
    Currently, the number of mobile malware programs is explosively growing, and the increasingly large feature library poses challenges to security solution providers. Traditional detection methods cannot deal with the huge amount of data promptly and effectively. Mobile malware detection methods based on machine learning have problems of excessive numbers of features, low detection accuracy and unbalanced data. In this paper, a feature selection method based on the mean and variance of samples was proposed to reduce the features without affecting classification. Different feature extraction algorithms were implemented to construct an ensemble learning model for high detection accuracy, including Principal Component Analysis, Kaehunen-Loeve Transformation and Independent Component Analysis. To solve the problem of unbalanced data of Android App samples, a multi-level classification model based on the decision tree was also developed. Experimental results show that the proposed methods can detect Android malware effectively, and the accuracy is increased by 6.41%, 3.96% and 3.36%, respectively.
    5  A Review on Video Big Data
    LIU Xiangkai ZHANG Yun ZHANG Huan LI Na FAN Chunling XIE Zuqing ZHU Linwei
    2016, 5(2):41-56. DOI: 10.12146/j.issn.2095-3135.201602004
    [Abstract](670) [HTML](0) [PDF 1.68 M](2968)
    The developments of science and technology have brought rapid growth of data, of which video and image data account for a high percentage. How to efficiently handle these data and find valuable information from them is a hot topic. Big data are characterized by four Vs: volume, velocity, variety, and value, representing large amounts of data, quick data processing, various data types, and low value density, respectively. Video big data share all these characteristics, and often come with much greater data redundancy than other types of data. As a result, they call for more efficient techniques for compression and processing. The research of video big data is primarily carried out along four dimensions: video data representation, intelligent video analysis, video compression and transmission, and video display and quality evaluation. Recent trends show that video representation is becoming more realistic and intelligent, and video analysis more accurate in identification and classification thanks to the deep neural networks. At the same time, video compression promises to be more efficient with new methods to reduce coding complexity, and less redundant with the help of visual perception aware coding algorithms. In accordance with more advanced video representation, video display devices are undergoing hardware upgrades, guided by a comprehensive methodology of video quality evaluation that is centered around quality of experience, instead of the traditional criteria developed for image quality assessment.
    6  A Review on Metadata Management in Large-Scale Distributed File Systems
    WANG Yang LIU Xing XU Chengzhong JIANG Song WANG Gang WEN Tao FAN Xiaopeng LU Ping
    2016, 5(2):57-72. DOI: 10.12146/j.issn.2095-3135.201602005
    [Abstract](1195) [HTML](0) [PDF 609.32 K](2155)
    Metadata of file systems is the data in charge of maintaining namespace, permission semantics and location of file data blocks. Metadata operations can account for up to 80% of total file system operations. As such, the performance of metadata services substantially impacts the overall performance of distributed file systems, especially with the advent of big data era, posing great pressure on the underlying storage systems. This paper reports the state-of-the-art research on the metadata services in large-scale distributed file systems. The study was conducted from three perspectives that are always used to characterize the file systems: high scalability, high-performance and high-availability, with focus squarely on their respective major challenges as well as their developed mainstream technologies. Additionally, some existing problems in this research were also identified and analyzed, which could be used as a reference for related studies.
    7  Statistical Models for Tropical Cyclone Intensity Forecasting Based on Meteorological Big Data
    TANG Tingting LI Qinglan LI Guangxin PENG Yulong
    2016, 5(2):73-84. DOI: 10.12146/j.issn.2095-3135.201602006
    [Abstract](1338) [HTML](0) [PDF 1.51 M](2467)
    Tropical cyclone (TC) is a destructive weather system. Accurate and timely forecast of the TC’s intensity and track is vital for disaster prevention and mitigation. This study proposed statistical regression methods to forecast the TC’s intensity change for 12, 24, 36 and 48 hours in the future over the Northwest Pacific Ocean. In addition to the conventional factors of climatology and persistence, this study took into account the land effect on the TC’s intensity change by introducing a new factor, i.e. the ratio of sea to land, into the statistical regression models. Three sets of TC samples, ocean basin samples, offshore samples, and total TC samples for the years 2000—2014 were applied to develop the intensity forecasting models. Final operational global analysis proposed by 1°×1° National Centers for Environmental Prediction/National Centre for Atmospheric Research were used as the predictors for the environmental effects. Two methods, stepwise regression and principal component analysis, were employed to develop the TC intensity forecasting models. Due to the consideration of the ratio of sea to land, the intensity forecasting performance for offshore TCs was significantly improved. Therefore, the proposed models are valuable for practical disaster prediction.
    8  Accurate Detection of Mobile Sequence in Metagenome
    PENG Chao WANG Pu GE Ruiquan ZHOU Fengfeng
    2016, 5(2):85-96. DOI: 10.12146/j.issn.2095-3135.201602007
    [Abstract](671) [HTML](0) [PDF 1.52 M](2609)
    Genome assembling is one of the challenges in metagenomic analysis. It is usually assumed that the sequencing reads are from the same genome. However, the mobile elements active in microbial genomes raise a critical question mark on this assumption. This work formulated this issue as a binary classification problem. The accurate discrimination of mobile elements from chromosomes could greatly facilitate the metagenomic analysis. After quantifying the sequencing reads in metagenome, the collaboration of binary classification algorithms with feature selection algorithms, including ReliefF, chi-squared test, and Fisher’s t-test was investigated. All feature subsets were tested using the classification algorithms such as logisitic regression, extreme learning machine, support vector machine and random forest. Experimental results demonstrate that the model based on ReliefF algorithm and Random Forest algorithm achieves over 95% in accuracy with only 100 features, which outperforms the model utilizing all 690 features.
    9  Geeking: a Sports News Search Engine System Based on Champion List
    LIN Yujie CHEN Xinquan GAO Yan XIAO Kafei HU Hongxiang HUA Qiang
    2016, 5(2):97-108. DOI: 10.12146/j.issn.2095-3135.201602008
    [Abstract](662) [HTML](0) [PDF 1.44 M](2690)
    In this paper, a sports news search engine, Geeking, was introduced, which contains four functional models: web crawling, champion list building, search processing and user interface. Geeking could provide query correction, query auto-completion, search results sorting, news clustering, keywords highlighting and snapshot visualization. Given a query, the system automatically completes the query according to the search logs and the news hot keywords. If there was no return of result, the system could correct the query and provided the recommended query terms. The related documents were searched quickly according to the champion list. Based on the tf-idf values and other factors like news headlines and release time, the documents’ relevance was calculated. For the clustering of similar news, the longest common subsequence and levenshtein distance were used to measure the similarity between news headlines and the similarity of news headlines could be regarded as the similarity between documents. Test results were given to show that Geeking is fast and stable.

