Geeking:基于胜者表的体育新闻搜索引擎系统
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家自然科学基金(61433012, U1435215, 11171086);河北省自然科学基金(F2013201064)


Geeking: a Sports News Search Engine System Based on Champion List
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    文章介绍了体育新闻搜索引擎系统 Geeking 的框架结构和各项功能,其结构分为网页爬取、胜者表构建、检索处理、用户界面 4 个部分,其主要功能包含查询词校正、自动补全、检索结果排序、相似新闻聚类以及显示页面中关键词高亮并提供网页快照。输入查询请求时,系统根据搜索日志和新闻热词自动补全查询词,搜索不到相关结果时校正查询,给出推荐的查询词。检索新闻文档时,使用胜者表快速查找查询词项的相关文档,综合 tf-idf 权重和新闻标题、发布时间等因素计算文档的相关性并按得分排序。在相似新闻聚类中,结合最长公共子序列和编辑距离衡量新闻标题之间的相似度,以新闻标题相似度代表新闻文档的相似度。测试结果表明,基于胜者表的 Geeking 搜索引擎系统各项功能协调效果好,检索响应速度快。

    Abstract:

    In this paper, a sports news search engine, Geeking, was introduced, which contains four functional models: web crawling, champion list building, search processing and user interface. Geeking could provide query correction, query auto-completion, search results sorting, news clustering, keywords highlighting and snapshot visualization. Given a query, the system automatically completes the query according to the search logs and the news hot keywords. If there was no return of result, the system could correct the query and provided the recommended query terms. The related documents were searched quickly according to the champion list. Based on the tf-idf values and other factors like news headlines and release time, the documents’ relevance was calculated. For the clustering of similar news, the longest common subsequence and levenshtein distance were used to measure the similarity between news headlines and the similarity of news headlines could be regarded as the similarity between documents. Test results were given to show that Geeking is fast and stable.

    参考文献
    相似文献
    引证文献
引用本文

引文格式
林裕杰,陈新荃,高 妍,等. Geeking:基于胜者表的体育新闻搜索引擎系统 [J].集成技术,2016,5(2):97-108

Citing format
LIN Yujie, CHEN Xinquan, GAO Yan, et al. Geeking: a Sports News Search Engine System Based on Champion List[J]. Journal of Integration Technology,2016,5(2):97-108

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2016-04-01
  • 出版日期:
文章二维码