추천 시스템 | TensorFlow Recommenders 리뷰 & 실습

Algorithm

TensorFlow Recommenders, TFRS?!

구글에서 제공하는 TFRS의 주요 컨셉은 retrieval과 ranking stage입니다. 수 백 수 천 개의 아이템들과 유저 간의 전체 유사도를 짧은 시간에 계산해서 추천하는 것은 현실적으로 어렵고 비효율적이라는 한계를 짚으며, TFRS에서는 먼저 retrieval 단계에서 전체 아이템에서 사용자가 좋아할 만한 아이템 후보군을 추리고, ranking 단계에서 추려진 후보군에 순위를 매겨 추천하는 단계로 진행됩니다.

Retrieval : 후보군을 추리는 과정으로 사용자가 관심이 없는 후보군을 효율적으로 제거하는 것이 목적
- 수백만 개의 후보를 처리해야 하기 때문에 계산적으로 효율적이어야 함(ScaNN)
- user embedding과 item embedding의 내적으로 계산
- implicit feedback 데이터 활용(유저가 간접적으로 나타내는 선호, 취향을 나타냄. ex/시청기록, 구매기록) with FactorizedTopK metrics

retrieval two-tower model(query tower & candicate tower)

Ranking : 추려진 후보군들에 순위를 매기는 과정
- 이미 추려진 후보군을 가지고 학습하기 때문에 계산적으로 효율적일 필요가 없어 복잡한 레이어 적용이 가능하고 유연하게 적용 가능함
- explicit feedback 데이터 활용(유저가 직접적으로 나타내는 선호, 취향 나타냄. ex/영화평점) with rmse metric

TFRS 실습해 보기

TFRS의 주요 설계 목표 중 하나가 ease-of-use 인 만큼 간단하게 추천 시스템(배포까지 가능)을 구축할 수 있습니다. 자세한 내용은 아래 실습을 따라가 보며 살펴보겠습니다.

TFRS는 다양한 확장이 가능한데, 추가 확보된 유저/아이템에 대한 정보(이미지, 텍스트 등 다양한 확장이 가능)를 추가하거나 레이어를 추가해 심층학습 등이 가능하고 모두 튜토리얼에 소개되어 있습니다.

(TFRS 라이브러리에 movielens 데이터도 들어가 있어서 간단하게 테스트해 보기도 좋습니다)

본 실습에서는 사용자들의 이전 채용 이력을 바탕으로 적합한 채용 공고를 추천하는 시스템을 만들어 보려고 합니다.

활용한 데이터 및 정보는 데이콘 - 국민대학교 AI 분석 경진대회에서 확인할 수 있고, 채용 이력(implicit feedback)을 활용한 추천시스템으로 retrieval model만을 가지고 채용 공고를 추천해 보겠습니다.

우선 필요한 라이브러리를 불러옵니다.

!pip install -q --upgrade tensorflow-recommenders tensorflow-datasets

import os
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import time
import datetime

import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

대회에서 제공하는 추가정보는 많았지만, 사용자의 지원이력(resume)만 가지고 retrieval 모델을 만들겠습니다.

apply_train_df = pd.read_csv(path+'apply_train.csv') # 57946
resume = pd.read_csv(path+'resume.csv') # 8482/ 8482
resume_certificate = pd.read_csv(path+'resume_certificate.csv') #12975/ 5976
resume_education = pd.read_csv(path+'resume_education.csv') # 8482/ 8482
resume_language = pd.read_csv(path+'resume_language.csv') # 869/ 820
recruitment = pd.read_csv(path+'recruitment.csv') # 6695/ 6695
company = pd.read_csv(path+'company.csv') # 2377

resume.head(3)
# 	resume_seq	recruitment_seq
# 0	U05833	R03838
# 1	U06456	R02144

간단한 전처리를 해주고 dataframe에서 tensor로 데이터셋을 구성해줘야 합니다.

apply_train_df['resume_seq'] = apply_train_df['resume_seq'].astype(str)
apply_train_df['recruitment_seq'] = apply_train_df['recruitment_seq'].astype(str)

ratings = apply_train_tf.map(lambda x: {
    "movie_id": x["recruitment_seq"],
    "user_id": x["resume_seq"],
})

movies = tf.data.Dataset.from_tensor_slices(dict(apply_train_df[['recruitment_seq']]))
movies = movies.map(lambda x : {'movie_id': x["recruitment_seq"]})
movies = movies.map(lambda x :  x["movie_id"])

timestamps = np.concatenate(list(ratings.map(lambda x: x["timestamp"]).batch(100)))

max_timestamp = timestamps.max()
min_timestamp = timestamps.min()

timestamp_buckets = np.linspace(
    min_timestamp, max_timestamp, num=1000,
)

unique_movie_titles = np.unique(np.concatenate(list(movies.batch(1000))))
unique_user_ids = np.unique(np.concatenate(list(ratings.batch(1_000).map(
    lambda x: x["user_id"]))))

query tower 모델을 만들어 줍니다.

embedding_dimension = 32 # 값이 높을수록 정확할 수 있지만 느리고 과적합될 수 있음
user_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=unique_user_ids, mask_token=None),
  # We add an additional embedding to account for unknown tokens.
  tf.keras.layers.Embedding(len(unique_user_ids) + 1, embedding_dimension)
])

candidate tower 모델도 동일하게 만들어 줍니다.

movie_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=unique_movie_titles, mask_token=None),
  tf.keras.layers.Embedding(len(unique_movie_titles) + 1, embedding_dimension)
])

implicit feedback 데이터이기 때문에 FactorizedTopK metrics를 활용해 학습을 진행합니다.

Loss를 반환하는 task layer도 함께 구성합니다.

task = tfrs.tasks.Retrieval(
  metrics=tfrs.metrics.FactorizedTopK(
  candidates=movies.batch(128).map(movie_model)
)
)

위에서 만든 것들을 하나로 통합해 최종 모델을 만들어 학습까지 해보겠습니다.

class RetrievalModel(tfrs.Model):

  def __init__(self, user_model, movie_model):
    super().__init__()
    self.movie_model: tf.keras.Model = movie_model
    self.user_model: tf.keras.Model = user_model
    self.task: tf.keras.layers.Layer = task

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    user_embeddings = self.user_model(features["user_id"])
    movie_embeddings = self.movie_model(features["movie_id"])

    return self.task(user_embeddings, movie_embeddings)

model = RetrievalModel(user_model, movie_model)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

tf.random.set_seed(1025)
train = apply_train_tf.map(lambda x: {
    "movie_id": x["recruitment_seq"],
    "user_id": x["resume_seq"]
}).shuffle(400_000, seed=1025).batch(1000).cache()

model.fit(train, epochs=100)
# Epoch 1/100 58/58 [==============================] - 2s 27ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_5_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_10_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_50_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_100_categorical_accuracy: 0.0000e+00 - loss: 6884.5233 - regularization_loss: 0.0000e+00 - total_loss: 6884.5233 
# Epoch 2/100 58/58 [==============================] - 1s 22ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_5_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_10_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_50_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_100_categorical_accuracy: 0.0000e+00 - loss: 6415.6956 - regularization_loss: 0.0000e+00 - total_loss: 6415.6956

학습한 모델을 가지고 추천을 해보겠습니다.

index = tfrs.layers.factorized_top_k.BruteForce(model.user_model, k=5)
index.index_from_dataset(
    tf.data.Dataset.zip((articles.batch(100), articles.batch(100).map(model.movie_model)))
)

# Get recommendations.
_, titles = index(tf.constant(resume.resume_seq.values))

preds = titles.numpy().astype(str)
preds[1]
# array(['R00395', 'R00395', 'R00395', 'R00395', 'R00395'], dtype='<U6')

이렇게 간단하게 실습까지 해봤는데요, 튜토리얼이 워낙 잘 정리되어 있어서 유저나 아이템에 대한 정보를 추가하거나 레이어를 더 쌓는 것도 쉽게 해 볼 수 있습니다. 구글팀에서 소개하는 유튜브도 보시면 이해하기 좋아서 추천합니다 : )

참고 사이트

소개 유튜브 : https://www.youtube.com/watch?v=jz0-satrmrA

튜토리얼 : https://www.tensorflow.org/recommenders?hl=ko

TensorFlow Recommenders

TensorFlow Recommenders는 TensorFlow용 추천 시스템 라이브러리입니다.

www.tensorflow.org

github : https://github.com/tensorflow/recommenders

GitHub - tensorflow/recommenders: TensorFlow Recommenders is a library for building recommender system models using TensorFlow.

TensorFlow Recommenders is a library for building recommender system models using TensorFlow. - GitHub - tensorflow/recommenders: TensorFlow Recommenders is a library for building recommender syste...

github.com

'Algorithm' 카테고리의 다른 글

Tabular Data 분류, 아직도 Tree모델을 사용하고 있는 이유는?! (1)	2024.12.16
DBSCAN 차근차근 이해하기 (1)	2022.10.19
모델 앙상블 방법 \| Stacking, Blending, Voting (0)	2022.10.19
[최적해 찾기] Multiple response optimization (0)	2022.10.19
[데이터 전처리] Yeo Johnson 변환 (0)	2022.10.19

현재글추천 시스템 | TensorFlow Recommenders 리뷰 & 실습

어쩌다통계