๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
IT/Python

[Python] 2. Daum ๋‹ค์Œ ์˜ํ™” ์‚ฌ์ดํŠธ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์›นํฌ๋กค๋ง (๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋ฐ ์‹œ๊ฐํ™” ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„, ์ (๋ถ„ํฌ) ๊ทธ๋ž˜ํ”„)

by ITyranno 2023. 12. 11.
728x90
๋ฐ˜์‘ํ˜•

 

 

 

 

 

 

 

 

ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์„ธ๊ณ„๋ฅผ ํƒ๊ตฌํ•ฉ์‹œ๋‹ค.

 

 

 

 

 

 

 

 

1ํŽธ์€ ์ด์ „ ๊ฒŒ์‹œ๊ธ€ ์ฐธ๊ณ  ๋ฐ”๋ž๋‹ˆ๋‹ค.

 

2023.12.08 - [IT/Python] - [Python] 1. Daum ๋‹ค์Œ ์˜ํ™” ์‚ฌ์ดํŠธ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์›นํฌ๋กค๋ง

 

[Python] 1. Daum ๋‹ค์Œ ์˜ํ™” ์‚ฌ์ดํŠธ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์›นํฌ๋กค๋ง

ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์„ธ๊ณ„๋ฅผ ํƒ๊ตฌํ•ฉ์‹œ๋‹ค. ์ˆ˜์ง‘๋ฐ์ดํ„ฐ ์˜ํ™”์ œ๋ชฉ, ํ‰์ , ๋Œ“๊ธ€ ์ƒ์„ฑํ•  ๋ฐ์ดํ„ฐ ๊ธ์ •/๋ถ€์ • URL https://movie.daum.net HOME Daum์˜ํ™”์—์„œ ์ž์„ธํ•œ ๋‚ด์šฉ์„ ํ™•์ธํ•˜์„ธ์š”! movie.daum.net ๋‹ค์Œ์˜ํ™” > ๋žญํ‚น > ๋ฐ•์Šค์˜ค

ityranno.tistory.com

 

 

 

 

 

 

 

<  Daum ๋‹ค์Œ ์˜ํ™” ์‚ฌ์ดํŠธ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์›นํฌ๋กค๋ง (๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋ฐ ์‹œ๊ฐํ™”)  >

 

 

 

 

 

 

์ˆ˜์ง‘๋ฐ์ดํ„ฐ

์˜ํ™”์ œ๋ชฉ, ํ‰์ , ๋Œ“๊ธ€

 


 ์ƒ์„ฑํ•  ๋ฐ์ดํ„ฐ

๊ธ์ •/๋ถ€์ •

 

 

 

 

 

URL

 

https://movie.daum.net

 

HOME

Daum์˜ํ™”์—์„œ ์ž์„ธํ•œ ๋‚ด์šฉ์„ ํ™•์ธํ•˜์„ธ์š”!

movie.daum.net

 

 

 

 

 

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ •์˜

 

 

### ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ •์˜
# - ํ–‰๋ ฌ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
import pandas as pd

 

 

 

 

 

<  ์™ธ๋ถ€ ํŒŒ์ผ ์ฝ์–ด๋“ค์ด๊ธฐ  >

 

file_path = "./data/movie_reviews.txt"
df_org = pd.read_csv(file_path, 
                     ### ๊ตฌ๋ถ„์ž ์•Œ๋ ค์ฃผ๊ธฐ
                     delimiter="\t",
                     names=["title", "score", "comment", "label"])
df_org

### ์ œ๋ชฉ : "title", ํ‰์  : "score", ๋ฆฌ๋ทฐ : "comment", ๊ธ์ •/๋ถ€์ • ํ˜„ํ™ฉ : "label")

 

 

 

 

 

 

 

 

๋ฐ์ดํ„ฐ ์ •๋ณด ํ™•์ธ : ๊ฒฐ์ธก์น˜ ํ™•์ธ

 

### ๋ฐ์ดํ„ฐ ์ •๋ณด ํ™•์ธ : ๊ฒฐ์ธก์น˜ ํ™•์ธ
df_org.info()

 

 

 

 

 

 

 

 

 

๊ธฐ์ดˆํ†ต๊ณ„ ํ™•์ธ : ๋ฐ์ดํ„ฐ

 

 

### ๊ธฐ์ดˆํ†ต๊ณ„ ํ™•์ธ : ๋ฐ์ดํ„ฐ
df_org.describe()

 

 

 

 

 

 

 

 

 

ํ‰์ (score) ํ˜„ํ™ฉ ๋ฐ์ดํ„ฐ ํ™•์ธ

 

 

### ํ‰์ (score) ํ˜„ํ™ฉ ๋ฐ์ดํ„ฐ ํ™•์ธ
df_org["score"].value_counts()

 

 

 

 

 

 

 

 

 

๊ธ์ •/๋ถ€์ • ํ˜„ํ™ฉ ๋ฐ์ดํ„ฐ ํ™•์ธ

 

### ๊ธ์ •/๋ถ€์ • ํ˜„ํ™ฉ ๋ฐ์ดํ„ฐ ํ™•์ธ
df_org["label"].value_counts()

 

 

 

 

 

 

 

์ค‘๋ณต๋ฐ์ดํ„ฐ ํ™•์ธํ•˜๊ธฐ

 

### ์ค‘๋ณต๋ฐ์ดํ„ฐ ํ™•์ธํ•˜๊ธฐ
# - keep=False : ์ค‘๋ณต๋œ ๋ชจ๋“ ํ–‰ ์ฒดํฌ(์ค‘๋ณต์ด ์žˆ์œผ๋ฉด True, ์—†์œผ๋ฉด False)
df_org[df_org.duplicated(keep=False) == True]

 

 

 

 

 

 

 

 

์ค‘๋ณต ์ œ๊ฑฐํ•˜๊ธฐ

 

### ์ค‘๋ณต ๋ฐ์ดํ„ฐ ์ถ”์ถœํ•˜๊ธฐ
# - ์ค‘๋ณต ์ค‘์— ํ•˜๋‚˜๋Š” ์ œ์™ธํ•˜๊ณ  ๋‚˜๋จธ์ง€ ์ค‘๋ณต๋งŒ ์ถ”์ถœ
df_del = df_org[df_org.duplicated() == True]
len(df_org[df_org.duplicated() == True])

 

### ์ค‘๋ณต์ œ๊ฑฐํ•˜๊ธฐ
df_new = df_org.drop_duplicates()
len(df_new)
df_new.info()

 

 

 

 

 

 

 

 

 

<  ๋ฐ์ดํ„ฐ ํƒ์ƒ‰ํ•˜๊ธฐ  >

 

 

์˜ํ™” ์ œ๋ชฉ๋งŒ ์ถ”์ถœํ•˜๊ธฐ

 

### ์˜ํ™” ์ œ๋ชฉ๋งŒ ์ถ”์ถœํ•˜๊ธฐ
df_new["title"].unique()

 

 

 

 

 

 

์˜ํ™” ์ œ๋ชฉ๋ณ„ ๋ฆฌ๋ทฐ ๊ฐฏ์ˆ˜ ํ˜„ํ™ฉ ํ™•์ธํ•˜๊ธฐ

 

 

### ์˜ํ™” ์ œ๋ชฉ๋ณ„ ๋ฆฌ๋ทฐ ๊ฐฏ์ˆ˜ ํ˜„ํ™ฉ ํ™•์ธํ•˜๊ธฐ
df_new["title"].value_counts()

 

 

 

 

 

 

 

 

 

 

๊ฐ ์˜ํ™”๋ณ„ ํŽ‘์  ๊ธฐ์ดˆํ†ต๊ณ„ ํ™•์ธํ•˜๊ธฐ

 

 

### ๊ฐ ์˜ํ™”๋ณ„ ํŽ‘์  ๊ธฐ์ดˆํ†ต๊ณ„ ํ™•์ธํ•˜๊ธฐ
# - ์˜ํ™”์ œ๋ชฉ๋ณ„ ํ‰์ ์— ๋Œ€ํ•œ ๊ทธ๋ฃน์ง‘๊ณ„ํ•˜๊ธฐ
movie_info = df_new.groupby("title")["score"].describe()

### ๊ธฐ์ดˆํ†ต๊ณ„ ํ–‰๋‹จ์œ„ ๋ฐ์ดํ„ฐ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌํ•˜๊ธฐ
movie_info = movie_info.sort_values(by=["count"], axis=0, ascending=False)
movie_info

 

 

 

 

 

 

 

 

 

<  ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”  >

 

 

 

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

 

 

### ์‹œ๊ฐํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
import matplotlib.pyplot as plt

### ํฐํŠธ ์„ค์ • ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
from matplotlib import font_manager, rc

 

 

 

ํฐํŠธ ์„ค์ •

 

### ํ•œ๊ธ€ํฐํŠธ ์„ค์ •
plt.rc("font", family="Malgun Gothic")

### ๋งˆ์ด๋„ˆ์Šค ๊ธฐํ˜ธ ์„ค์ •
plt.rcParams["axes.unicode_minus"] = False

 

 

 

 

<  ์˜ํ™”๋ณ„ ํ‰์  ํ‰๊ท  ์‹œ๊ฐํ™”  >

 

 

์˜ํ™”๋ณ„ ํ‰์  ํ‰๊ท ์ด ๊ฐ€์žฅ ํฐ ์˜ํ™”๋Š” orange์ƒ‰์œผ๋กœ,
๋‚˜๋จธ์ง€๋Š” lightgrey ์ƒ‰์œผ๋กœ ํ‘œํ˜„

 

 

### ํ‰์  ํ‰๊ท  ๊ณ„์‚ฐ์„ ์œ„ํ•ด ์‚ฌ์šฉ
import numpy as np

 

 

 

 

์˜ํ™”์ œ๋ชฉ์„ ๋ฆฌ์ŠคํŠธ ํƒ€์ž…์œผ๋กœ ๋ฐ›์•„์˜ค๊ธฐ

 

### ์˜ํ™”์ œ๋ชฉ์„ ๋ฆฌ์ŠคํŠธ ํƒ€์ž…์œผ๋กœ ๋ฐ›์•„์˜ค๊ธฐ
"""
 - array() : ๋„˜ํŒŒ์ด(numpy)์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฐ์—ด(ํŒŒ์ด์ฌ์˜ ๋ฆฌ์ŠคํŠธ์™€ ๋™์ผ)
           : ๋‹จ, ํ•˜๋‚˜์˜ ํƒ€์ž…๋งŒ ์ €์žฅ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
                 ์ด์™ธ ์‚ฌ์šฉ๋ฒ•์€ ํŒŒ์ด์ฌ์˜ ๋ฆฌ์ŠคํŠธ์™€ ๋™์ผ
"""
# - unique() : numpy์˜ ๋ฐฐ์—ด ํƒ€์ž…์œผ๋กœ ๋ฐ˜ํ™˜ํ•จ
# - tolist() : ํŒŒ์ด์ฌ์˜ ๋ฆฌ์ŠคํŠธ ํƒ€์ž…์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜
movie_title = df_new["title"].unique().tolist()
# movie_title

### ์˜ํ™”๋ณ„ ํ‰์  ํ‰๊ท  ์ถ”์ถœํ•˜๊ธฐ
# - ํ‰์ ํ‰๊ท ๊ฐ’์„ ์ €์žฅํ•  ๋”•์…”๋„ˆ๋ฆฌ ๋ณ€์ˆ˜ ์„ ์–ธ
avg_score = {}

for m_title in movie_title:
    ### ํ‰์  ํ‰๊ท  ๊ณ„์‚ฐํ•˜์—ฌ ๋”•์…”๋„ˆ๋ฆฌ์— ๋„ฃ๊ธฐ
    avg = df_new[df_new["title"] == m_title]["score"].mean()
    # print(f"ํ‰์ ํ‰๊ท  : {avg}")


    ### ๋”•์…”๋„ˆ๋ฆฌ์— ๋‹ด๊ธฐ
    # key๋Š” ์ œ๋ชฉ, value๋Š” ํ‰์ ํ‰๊ท ๊ฐ’
    avg_score[m_title] = avg

print(f"๋”•์…”๋„ˆ๋ฆฌ ์ตœ์ข…๊ฐ’ : {avg_score}")

 

 

 

 

 

 

 

๋ง‰๋Œ€๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ

 

 

### ์˜ํ™”๋ณ„ ํ‰์ ํ‰๊ท  ์‹œ๊ฐํ™”
# - ๊ทธ๋ž˜ํ”„ ๋„ˆ๋น„์™€ ๋†’์ด ์ง€์ •
plt.figure(figsize=(20, 20))

# - ๊ทธ๋ž˜ํ”„ ์ œ๋ชฉ
plt.title("์˜ํ™”๋ณ„ ํ‰์  ํ‰๊ท  ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„", fontsize=17, fontweight="bold")

### ๊ฐ ์˜ํ™”๋ณ„ ํ‰์  ํ‰๊ท  ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
for k, v in avg_score.items():
    ### ์ปฌ๋Ÿฌ๊ฐ’ ์ง€์ •ํ•˜๊ธฐ
    # - array_str() : ๋ฌธ์ž์—ด๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜
    # - where() : ํŒŒ์ด์ฌ์—์„œ if๋ฌธ๊ณผ ๋™์ผํ•œ ์กฐ๊ฑด๋ฌธ
    # - where(์กฐ๊ฑด, ์ฐธ, ๊ฑฐ์ง“) : ์กฐ๊ฑด์ด ์ฐธ์ด๋ฉด ์ฒซ๋ฒˆ์งธ ๊ฐ’, ๊ฑฐ์ง“์ด๋ฉด ๋‘๋ฒˆ์งธ ๊ฐ’ ์ฒ˜๋ฆฌ
    color = np.array_str(np.where(v==max(avg_score.values()), "orange", "lightgrey"))
    # print(color)

    # ๋ง‰๋Œ€ ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
    plt.bar(k, v, color=color)

    ### ๋ง‰๋Œ€ ๊ทธ๋ž˜ํ”„ ์ƒ๋‹จ์— ํ‰์ ํ‰๊ท  ํ…์ŠคํŠธ ํ‘œ์‹œํ•˜๊ธฐ
    # - "%.2f"%v : ํ‘œ์‹œํ•  ๊ฐ’ (์†Œ์ˆซ์  2์ž๋ฆฌ๊นŒ์ง€ ํ‘œํ˜„)
    plt.text(k, v, "%.2f"%v, horizontalalignment="center",
                             verticalalignment="bottom")

### x์ถ•๊ณผ y์ถ• ์ œ๋ชฉ ๋„ฃ๊ธฐ
plt.xlabel("์˜ํ™”์ œ๋ชฉ" , fontweight="bold")
plt.ylabel("ํ‰์ ํ‰๊ท " , fontweight="bold")

### x์ถ•์˜ ๊ฐ’ ๊ฐ๋„ ์กฐ์ ˆํ•˜๊ธฐ
plt.xticks(rotation = 75)

### ๊ทธ๋ž˜ํ”„๋ฅผ ์ด๋ฏธ์ง€๋กœ ์ €์žฅ์‹œํ‚ค๊ธฐ
# - savefig() ํ•จ์ˆ˜๋Š” plt.show() ์ „์— ์ˆ˜ํ–‰๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
plt.savefig("./img/์˜ํ™”๋ณ„ ํ‰์  ํ‰๊ท  ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„.png")

plt.show()

 

 

 

 

 

 

 

 

 

 

 

 

์ (๋ถ„ํฌ) ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ

 

๊ฐ ์˜ํ™”๋ณ„ ํ‰์  ๋ถ„ํฌ๋„ ๊ทธ๋ฆฌ๊ธฐ
 - ํ•˜๋‚˜์˜ ํฐ ๊ทธ๋ž˜ํ”„ ์•ˆ์— 10๊ฐœ์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋„ฃ์–ด์„œ ํ‘œํ˜„
  --> ์ด๋ฅผ subplot์ด๋ผ๊ณ  ์นญํ•ฉ๋‹ˆ๋‹ค.

 

 

 

### ๊ฐ ์˜ํ™”๋ณ„ ํ‰์  ๋ถ„ํฌ๋„ ๊ทธ๋ฆฌ๊ธฐ
# - ํ•˜๋‚˜์˜ ํฐ ๊ทธ๋ž˜ํ”„ ์•ˆ์— 10๊ฐœ์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋„ฃ์–ด์„œ ํ‘œํ˜„
#  --> ์ด๋ฅผ subplot์ด๋ผ๊ณ  ์นญํ•ฉ๋‹ˆ๋‹ค.
# - 5ํ–‰ 2์—ด์˜ subplot ์ƒ์„ฑํ•˜์—ฌ ๊ตฌํ˜„ํ•˜๊ธฐ
# - subplots(ํ–‰๊ฐฏ์ˆ˜, ์—ด๊ฐฏ์ˆ˜, ์ „์ฒด ๊ทธ๋ž˜ํ”„ ํฌ๊ธฐ)
# - fig : ํฐ ๊ทธ๋ž˜ํ”„ ์ •๋ณด
# - axs : 5ํ–‰ 2์—ด์˜ ๋‚ด๋ถ€ ๊ทธ๋ž˜ํ”„ ๊ณต๊ฐ„ ์ •๋ณด
fig, axs = plt.subplots(5, 2, figsize=(15, 25))

### ์—ฌ๋Ÿฌ๊ฐœ์˜ ๊ทธ๋ž˜ํ”„๋ฅผ for๋ฌธ์„ ์ด์šฉํ•ด์„œ ํ‘œํ˜„ํ•˜๊ณ ์ž ํ• ๋Œ€ ์•„๋ž˜ ๋จผ์ € ์ˆ˜ํ–‰
# - flatten() : ํ‹€ ์ •๋ ฌํ•˜๊ธฐ -> 5ํ–‰ 2์—ด์˜ ํ‹€์„ ์ •๋ ฌํ•˜๊ธฐ
axs = axs.flatten()

### ๊ฐ ๊ทธ๋ž˜ํ”„๋ฅผ ํ–‰๋ ฌ ๊ณต๊ฐ„์˜ subplot์— ๋„ฃ๊ธฐ
for title, avg, ax in zip(avg_score.keys(), avg_score.values(), axs):
    # print(f"{title} / {score} / {ax}")
    
    ### x์ถ•์—๋Š” ์˜ํ™” ๋ฆฌ๋ทฐ ๊ฐฏ์ˆ˜, y์ถ•์—๋Š” ํ‰์ ํ‰๊ท 
    ### ๋ฆฌ๋ทฐ ๊ฐฏ์ˆ˜ ์ถ”์ถœํ•˜๊ธฐ
    num_reviews = len(df_new[df_new["title"] == title])
    # - arange(num) : 0๋ถ€ํ„ฐ num๊นŒ์ง€์˜ ๊ฐ’์„ ์ˆœ์ฐจ์ ์œผ๋กœ ๋งŒ๋“ค๊ธฐ
    x = np.arange(num_reviews)
    # print(f"x ------> {x}")

    ### y์ถ•์—๋Š” ํ‰์  ์ถ”๊ฐ€
    y = df_new[df_new["title"] == title]["score"]
    # print(f"y--------- {y}")

    ### ๊ฐ ๊ทธ๋ž˜ํ”„์— ์ œ๋ชฉ ๋„ฃ๊ธฐ
    subtitle = f"{title} ({num_reviews}๋ช…)"
    ax.set_title(subtitle, fontsize=15, fontweight="bold")

    ### ์  ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
    # - "o" : ์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋งˆ์ปค ๊ธฐํ˜ธ
    ax.plot(x, y, "o")

    ### ๊ฐ ์˜ํ™”๋ณ„ ํ‰์ ํ‰๊ท ์„ ๋นจ๊ฐ•์ƒ‰ ์ ์„ (line)์œผ๋กœ ํ‘œ์‹œํ•˜๊ธฐ
    # - axhline() : ๊ฐ subplot ๊ณต๊ฐ„์— ์ˆ˜ํ‰์„  ๊ทธ๋ฆฌ๊ธฐ
    ax.axhline(avg, color="red", linestyle="--")


plt.savefig("./img/๊ฐ ์˜ํ™”๋ณ„ ํ‰์  ๋ถ„ํฌ๋„.png")

    

plt.show()

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

728x90
๋ฐ˜์‘ํ˜•

loading