Question about co-existence matrix formation

Asked 6 days ago, Updated 6 days ago, 1 views

Hello, everyone I'm asking you a question during text analysis. After finishing nlp, we want to form a co-existence matrix based on interword co-existence. I used the code below. I used to use it well before, but it didn't work all of a sudden Post the question. Thank you for reviewing it.

import collections
import pandas as pd
import numpy as np


def co_occurrence(sentences, window_size):
    d = collections.defaultdict(int)
    vocab = set()
    for text in sentences:
        # # preprocessing (use tokenizer instead)
        text = text.lower().split()
        # # iterate over sentences
        for i in range(len(text)):
            token = text[i]
            vocab.add(token)  # add to vocab
            next_token = text[i+1 : i+1+window_size]
            for t in next_token:
                key = tuple( sorted([t, token]) )
                d[key] += 1

    # # formulate the dictionary into dataframe
    vocab = sorted(vocab) # sort vocab
    df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
                      index=vocab,
                      columns=vocab)
    for key, value in d.items():
        df.at[key[0], key[1]] = value
        df.at[key[1], key[0]] = value
    return df


df = pd.read_csv('data.csv', encoding = 'utf-8')

# http://naver.me/x1eYJPQ2 << I put the file here

df['nlp'] = df["nlp"].str.replace("'", "") 
df['nlp'] = df["nlp"].str.replace(",", "") 
df['nlp'] = df["nlp"].str.replace("・", "")
df['nlp'] = df["nlp"].str.replace("・", "")
df['nlp'] = df["nlp"].str.replace("[", "") 
df['nlp'] = df["nlp"].str.replace("]", "") 
corpus = df.corpus.tolist()


df = co_occurrence(corpus, 3)

df.to_csv('co_occurrence.csv', encoding = 'utf-8')


2022-09-20 15:49

1 Answers

I somehow.

Change encoding

# df = pd.read_csv('data2.csv', encoding = 'utf-8')
df = pd.read_csv('data2.csv', encoding = 'euc-kr')

Replace with one with a column name that does not exist

# corpus = df.corpus.tolist()
corpus = df.nlp.tolist()


2022-09-20 15:49

If you have any answers or tips


© 2022 pinfo. All rights reserved.