您现在的位置是:主页 > news > 做网站直播平台/外贸营销网站制作公司

做网站直播平台/外贸营销网站制作公司

admin2025/4/25 10:02:11news

简介做网站直播平台,外贸营销网站制作公司,wordpress 分类文章列表,服务平台图片学渣学习日记,未整理,慎点! 工具: spacy:官网https://spacy.io/ 2014年出的,号称工业级 分词,词性标注,句法分析,命名实体识别,可以下载glove训练好的词向…

做网站直播平台,外贸营销网站制作公司,wordpress 分类文章列表,服务平台图片学渣学习日记,未整理,慎点! 工具: spacy:官网https://spacy.io/ 2014年出的,号称工业级 分词,词性标注,句法分析,命名实体识别,可以下载glove训练好的词向…

学渣学习日记,未整理,慎点!

工具:

spacy:官网https://spacy.io/

2014年出的,号称工业级

分词,词性标注,句法分析,命名实体识别,可以下载glove训练好的词向量数据(多好的工具啊,赶明儿再装一下,以前装过一次,当时不懂词向量,而且感觉它的命名实体识别并不够准确,就弃坑了)

nltk:学术性更强,稳定,目前在这个坑里

功能跟spacy差不多,但是不知道能不能跟词向量有关系

词向量: 

word2vec:包括CBOW和skip-gram 具体介绍可以参看论文《word2vec Parameter Learning Explained》。讲的挺好的,虽然我对于推导过程还是一知半解……谷歌有预先训练好的词向量模型,但是,需要翻墙下载,而我没有办法。

Tensorflow上有实现word2vec的例子,但是那个例子的效率并不高。

glove:据说比word2vec还好,网上可以下载预先训练好的词向量模型。据说耗费的时间比word2vec短,还没有实践过,心虚的不行……

疑问:词向量可以用来表示单词,那么,,,,然后呢?

深度神经网络:

暂时不考虑,先跑一遍文本分类,再上手吧。

先看这个http://blog.csdn.net/kevindelily/article/details/52679125?locationNum=15


文本分类共包括七步:

1.打开文件,读入数据

2.预处理文本,包括分词,词形还原,大小写转换

3.建立词典

4.用向量表示文档

5.将词频语料库转变为tfidf语料库,这样做,每个词不再是简单的编号,而是有了一定的语义成分。

6.降维,转变为lsi模型

7.转变为sklearn可用的矩阵

8.分类,我这里用的是线性判别分析

下面一步一步贴代码:

1.打开文件:

数据文件是csv格式,代码如下:

def read_from_csv(filename= 'F:/corpus/mypersonality_final/mypersonality_final.csv'):record = MyPersonality()data = list()flag = 0with open(filename, 'r', encoding = 'utf-8', errors = 'ignore') as csvfile:reader = csv.DictReader(csvfile)for row in reader:authID = row['#AUTHID']if record.authID != authID:  if flag == 1:data.append(copy.copy(record))flag = 1record.authID = row['#AUTHID']record.status = row['STATUS']record.sExt = row['sEXT']record.sNeu = row['sNEU']record.sAgr = row['sAGR']record.sCon = row['sCON']record.sOpn = row['sOPN']if row['cEXT'] == 'y':record.cExt = 1else:record.cExt = 0if row['cNEU'] == 'y':record.cNeu = 1else:record.cNeu = 0if row['cAGR'] == 'y':record.cAgr = 1else:record.cAgr = 0if row['cCON'] == 'y':record.cCon = 1else:record.cCon = 0if row['cOPN'] == 'y':record.cOpn = 1else:record.cOpn = 0else:record.status = record.status + '\n' + row['STATUS']data.append(record)csvfile.close()print("From",filename, "successfully read data: ", len(data))return data
关键的语句是:

打开文件:open(filename) 

以字典格式读取数据:reader = csv.DictReader(csv_file),这样做可以操作指定的列

最后记得:csv_file.close()

2.预处理文本

def preprocess_word(data,w_filename = 'F:/corpus/mypersonality_final/processed_word.txt'):#新建分句器sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')#新建分词器word_tokenizer = nltk.WordPunctTokenizer()#建立词形还原器,nltk的效果并没用很好lemmatizer = nltk.WordNetLemmatizer()#    去除停用词stop_list = set('! * ( )  + - = < > , . ...  \ " : \' ? \n'.split(' '))#   预处理文本texts = list()
#   读出数据for record in data:texts.append(record.status)
#   每个作者的文章要分开处理,分句,分词,词性还原,变为小写,去除停用词preprocessed_words = list()for text in texts:sentences = sent_tokenizer.tokenize(text)status_list = list()for sentence in sentences:tokens = word_tokenizer.tokenize(sentence)for token in tokens:word = lemmatizer.lemmatize(token).lower()if word not in stop_list:status_list.append(word)preprocessed_words.append(status_list)#    写到文件中with open(w_filename, 'w') as w_file:for text in preprocessed_words:for word in text:w_file.write(word + ' ')w_file.write('\n')w_file.close()print('successfully processed: ',len(preprocessed_words),' records')return preprocessed_words

关键语句:

分句:sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

sentence = sent_tokenizer.tokenize(text)

分词:word_tokenizer = nltk.WordPunctTokenizer()

word = word_tokenizer.tokenize(sentence)

词形还原:lemmatizer = nltk.WordNetLemmatizer()

word = lemmatizer.lemmatize(word)

还有,一般都会去除停用词

3.建立词典

#建立词典
def build_dictionary(processed_word,dict_filename = 'F:/corpus/mypersonality_final/mypersonality_dict.dict'):dictionary = gensim.corpora.Dictionary(processed_word)
#    去掉出现文档数少于5的词,去掉文档数超过0.1(占总量的百分比)的词,不要问我为什么是这两个数,我是个渣渣,随机设的dictionary.filter_extremes(no_below = 5, no_above = 0.1)dictionary.save(dict_filename)return dictionary
关键语句:

dictionary = gensim.corpora.Dictionary(processed_word)

dictionary.filter_extremes(no_below = 5, no_above = 0.1)

4.用向量表示文档:这里用的是词频

#    转变为词频模型corpus_bow = list()for text in processed_word:bow = dictionary.doc2bow(document = text)corpus_bow.append(bow)print('corpus bow is done')
#    保存小可爱千辛万苦转变成的词频语料
corpus_bow_filename = 'F:/corpus/mypersonality_final/mypersonality_corpus_bow.mm'
gensim.corpora.MmCorpus.serialize(corpus_bow_filename,corpus_bow)
关键语句:

dictionary.doc2bow(document = text)

保存:gensim.corpora.MmCorpus.serialize(corpus_bow_filename, corpus_bow)

5.将词频模型的语料库转换为tfidf模型的语料库

#    加入语义内容,转变成tfidfcorpus_tfidf_filename = 'F:/corpus/mypersonality_final/mypersonality_corpus_tfidf.mm'if os.path.exists(corpus_tfidf_filename) == False:#    建立tfidf模型tfidf_model = gensim.models.TfidfModel(corpus = corpus_bow, dictionary = dictionary)#    用tfidf模型分析语料库corpus_tfidf = [tfidf_model[doc] for doc in corpus_bow]gensim.corpora.MmCorpus.serialize(corpus_tfidf_filename, corpus_tfidf)else:corpus_tfidf = gensim.corpora.MmCorpus(corpus_tfidf_filename)print('corpus tfidf is done, length is: ', len(corpus_tfidf))
关键语句:

tfidf_model = gensim.models.TfidfModel(corpus = corpus_bow, dictionary = dictionary)

corpus_tfidf = [tfidf_model[doc] for doc in corpus_bow]

还有保存

6.降维,转变为lsi(latent sentiment indexing)通过奇异值分解,将高维转变为低维

#    降维,转变为lsi模型corpus_lsi_filename = 'F:/corpus/mypersonality_final/mypersonality_corpus_lsi.mm'if os.path.exists(corpus_lsi_filename) == False:#    建立lsi模型,num_topics是转变完后每个文档的维度数lsi_model = gensim.models.LsiModel(corpus = corpus_tfidf, id2word = dictionary, num_topics = 200)#    用lsi模型分析语料库corpus_lsi = [lsi_model[doc] for doc in corpus_tfidf]gensim.corpora.MmCorpus.serialize(corpus_lsi_filename, corpus_lsi)else:corpus_lsi = gensim.corpora.MmCorpus(corpus_lsi_filename)    print('corpus lsi is done,length is: ',len(corpus_lsi))
关键语句:
lsi_model = gensim.models.LsiModel(corpus = corpus_tfidf, id2word = dictionary, num_topics = 200)
7.转变为sklearn可用的矩阵

#转换为csr矩阵
def to_array(corpus):cols = list()rows = list()data = list()line_count = 0for line in corpus:for elem in line:rows.append(line_count)cols.append(elem[0])data.append(elem[1])line_count += 1lsi_csr_matrix = scipy.sparse.csr_matrix((data,(rows, cols))).toarray()return lsi_csr_matrix
关键语句:lsi_csr_matrix = scipy.sparse.csr_matrix((data, (rows, cols))).toarray()

就这一句,开心不?

8.分类

#    生成训练集和验证集rarray = np.random.random(size = len(csr_lsi_matrix))train_set = list()train_tag = list()test_set = list()test_tag = list()for i in range(len(csr_lsi_matrix)):if rarray[i] < 0.8:train_set.append(csr_lsi_matrix[i,:])train_tag.append(data[i].cExt)else:test_set.append(csr_lsi_matrix[i,:])test_tag.append(data[i].cExt)lda = LinearDiscriminantAnalysis(solver = 'svd', store_covariance = True)lda_res = lda.fit(train_set, train_tag)test_pred = lda_res.predict(test_set)test_error = 0for i in range(len(test_tag)):if test_pred[i] != test_tag[i]:test_error += 1print('test_error = ',float(test_error)/float(len(test_tag)))
里面有很多bug,比如没有采用十折交叉验证,最终只计算出来错误率,没有计算F值等,随后再改,

关键语句:

lda = LinearDiscriminantAnalysis(solver = 'svd', store_covariance = True)

lda_res = lda.fit(train_set, train_tag)

test_pred = lda_res.predict(test_set)

这个程序的最终结果很差,只能拿来练练手,还有很多地方需要改进