向量空间模型(VSM:Vector Space Model)
TF-IDF(term frequency–inverse document frequency) TF是词频(Term Frequency),IDF是逆文本频率指数(Inverse Document Frequency)
其他理论部分请依据关键词自行探索研究。
二、TF-IDF相关实例1、题目 Q:“gold silver truck” D1:“Shipment of gold damaged in a fire” D2:“Delivery of silver arrived in a silver truck” D3:“Shipment of gold arrived in a truck” 基于TF-IDF向量化方法,求文档Q与文档D1、D2、D3相似程度。
2、分析过程 在这个文档集中,d=3。 lg(d/dfi) = lg(3/1) = 0.477 lg(d/dfi) = lg(3/2) = 0.176 lg(d/dfi) = lg(3/3) = 0
3、代码分享: 直接上完成代码:
import numpy as np import pandas as pdimport math#1.