Google Developers Japan: TensorFlow の Dataset と Estimator の紹介

Local blog for Japanese speaking developers

TensorFlow の Dataset と Estimator の紹介

2017年10月18日水曜日

この記事は TensorFlow チームによる Google Developers Blog の記事 "Introduction to TensorFlow Datasets and Estimators" を元に翻訳・加筆したものです。詳しくは元記事をご覧ください。

Dataset: 入力パイプライン（プログラムにデータを読み込む部分）を作成する新しい API です

Estimator: TensorFlow モデルを作成する高レベル API です。一般的な機械学習タスク用に事前に作成されたモデルを提供します。独自のカスタムモデルも作成できます

サンプルモデルこちらがく片花弁

左から順番にヒオウギアヤメ（Radomil より、CC BY-SA 3.0）、ブルーフラッグ（Dlanglois より、CC BY-SA 3.0）、ヴァージニアアイリス（Frank Mayfield より、CC BY-SA 2.0） float32

Dataset の概要

Dataset: Dataset を作成して変換するメソッドを含む基底クラスです。メモリ内のデータ、または Python ジェネレータから Dataset を初期化する機能を提供します

TextLineDataset: テキストファイルから行を読み取ります

TFRecordDataset: TFRecord ファイルからデータを読み取ります

FixedLengthRecordDataset: バイナリファイルから固定サイズのデータを読み取ります

Iterator: Dataset の各要素にひとつずつアクセスするために使います

サンプルコードのデータセット

ヒオウギアヤメは 0

ブルーフラッグは 1

ヴァージニアアイリスは 2

Dataset を定義するfeature_names = [ 'SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth'] def input_fn(): ...<code>... return ({ 'SepalLength':[values], ..<etc>.., 'PetalWidth':[values] }, [IrisFlowerType])

最初の要素には辞書（dict）を返します。各入力特徴量のキーと、学習バッチの値のリストのペアです

2 番目の要素は、学習バッチのラベルのリストとします

この input_fn

file_path: 読み取るデータファイルです

perform_shuffle: データをシャッフルするかどうかを指定します

repeat_count: Dataset の行の読み取りを繰り返す回数を指定します。たとえば 1 を指定すると、各行を 1 回だけ読みとる Dataset となります。None を指定すると、各行を繰り返し読み取り可能な Dataset となります

def my_input_fn(file_path, perform_shuffle=False, repeat_count=1): def decode_csv(line): parsed_line = tf.decode_csv(line, [[0.], [0.], [0.], [0.], [0]]) label = parsed_line[-1:] # Last element is the label del parsed_line[-1] # Delete last element features = parsed_line # Everything (but last element) are the features d = dict(zip(feature_names, features)), label return d dataset = (tf.contrib.data.TextLineDataset(file_path) # Read text file .skip(1) # Skip header row .map(decode_csv)) # Transform each elem by applying decode_csv fn if perform_shuffle: # Randomizes input using a window of 256 elements (read into memory) dataset = dataset.shuffle(buffer_size=256) dataset = dataset.repeat(repeat_count) # Repeats dataset this # times dataset = dataset.batch(32) # Batch size to use iterator = dataset.make_one_shot_iterator() batch_features, batch_labels = iterator.get_next() return batch_features, batch_labels

TextLineDataset: この Dataset API は、ファイルベースのデータセットを使用する際に必要となる様々なメモリ管理を行います。たとえば、メモリよりも大幅に大きいデータセットを読み込んだり、リストを引数として指定した複数のファイルを読み込んだりできます

shuffle: buffer_size で指定した数の要素を読み取り、その順番をシャッフルします

map: データセットの個々の要素を引数として decode_csv 関数を呼び出します（この例では TextLineDataset を使用するため、つまり CSV テキストの各行を decode_csv に渡すことを意味します）。

decode_csv: 各行をフィールドに分割し、必要に応じてデフォルト値を指定します。つづいて、フィールドのキーと値の辞書を返します。上述の map 関数は、この辞書を使って各要素（行）を書き換えます

next_batch = my_input_fn(FILE, True) # Will return 32 random elements # Now let's try it out, retrieving and printing one batch of data. # Although this code looks strange, you don't need to understand # the details. with tf.Session() as sess: first_batch = sess.run(next_batch) print(first_batch) # Output ({'SepalLength': array([ 5.4000001, ...<repeat to 32 elems>], dtype=float32), 'PetalWidth': array([ 0.40000001, ...<repeat to 32 elems>], dtype=float32), ... }, [array([[2], ...<repeat to 32 elems>], dtype=int32) # Labels ) Estimator の概要

定義済み Estimator を使う - 特定の種類のモデルを生成するために事前に定義された Estimator です。この記事では、そのひとつである DNNClassifier を使う例を紹介します

Estimator クラス（基底クラス）を使う - model_fn 関数を使用して、モデルの作成方法をカスタマイズできる方法です。この方法については、また別の機会に紹介します

input_fn# Create the feature_columns, which specifies the input to our model. # All our input features are numeric, so use numeric_column for each one. feature_columns = [tf.feature_column.numeric_column(k) for k in feature_names] # Create a deep neural network regression classifier. # Use the DNNClassifier pre-made estimator classifier = tf.estimator.DNNClassifier( feature_columns=feature_columns, # The input features to our model hidden_units=[10, 10], # Two layers, each with 10 neurons n_classes=3, model_dir=PATH) # Path to where checkpoints etc are stored モデルのトレーニング# Train our model, use the previously function my_input_fn # Input to training is a file with training example # Stop training after 8 iterations of train data (epochs) classifier.train( input_fn=lambda: my_input_fn(FILE_TRAIN, True, 8))

lambda:
my_input_fn(FILE_TRAIN, True, 8)

file_path、shuffle setting、repeat_count

FILE_TRAIN: 学習用のデータファイルのパスです

True: データのシャッフルを指定します

8: データセットを 8 回繰り返すよう指定します

学習したモデルの評価# Evaluate our model using the examples contained in FILE_TEST # Return value will contain evaluation_metrics such as: loss & average_loss evaluate_result = estimator.evaluate( input_fn=lambda: my_input_fn(FILE_TEST, False, 4) print("Evaluation results") for key in evaluate_result: print(" {}, was: {}".format(key, evaluate_result[key]))model_dir=PATHDNNClassifiermodel_dir=PATH トレーニングされたモデルを使用した予測# Let's predict the examples in FILE_TEST, repeat only once. predict_results = classifier.predict( input_fn=lambda: my_input_fn(FILE_TEST, False, 1)) print("Predictions on test file") for prediction in predict_results: # Will print the predicted class, i.e: 0, 1, or 2 if the prediction # is Iris Sentosa, Vericolor, Virginica, respectively. print prediction["class_ids"][0] オンメモリのデータで予測FILE_TESTpredict# Let create a memory dataset for prediction. # We've taken the first 3 examples in FILE_TEST. prediction_input = [[5.9, 3.0, 4.2, 1.5], # -> 1, Iris Versicolor [6.9, 3.1, 5.4, 2.1], # -> 2, Iris Virginica [5.1, 3.3, 1.7, 0.5]] # -> 0, Iris Sentosa def new_input_fn(): def decode(x): x = tf.split(x, 4) # Need to split into our 4 features # When predicting, we don't need (or have) any labels return dict(zip(feature_names, x)) # Then build a dict from them # The from_tensor_slices function will use a memory structure as input dataset = tf.contrib.data.Dataset.from_tensor_slices(prediction_input) dataset = dataset.map(decode) iterator = dataset.make_one_shot_iterator() next_feature_batch = iterator.get_next() return next_feature_batch, None # In prediction, we have no labels # Predict all our prediction_input predict_results = classifier.predict(input_fn=new_input_fn) # Print results print("Predictions on memory data") for idx, prediction in enumerate(predict_results): type = prediction["class_ids"][0] # Get the predicted class (index) if type == 0: print("I think: {}, is Iris Sentosa".format(prediction_input[idx])) elif type == 1: print("I think: {}, is Iris Versicolor".format(prediction_input[idx])) else: print("I think: {}, is Iris Virginica".format(prediction_input[idx])Dataset.from_tensor_slides()TextLineDataset TensorBoard で可視化# Replace PATH with the actual path passed as model_dir argument when the # DNNRegressor estimator was created. tensorboard --logdir=PATH

まとめ

この記事で使用したソースコードはこちらで公開しています

Josh Gordon が作成した、これら API の使い方を解説した Jupyter notebook。この Jupyter notebook を使用すると、この記事では触れられていない様々な種類の入力特徴量を含む、より大規模なサンプルの使い方を学べます（この記事では数値の特徴量しか扱いませんでした）

Dataset については、プログラマーガイドの新しい章とリファレンスドキュメントを参照してください

Estimator については、プログラマーガイドの新しい章とリファレンスドキュメントを参照してください

.blogimg img { width: 100%; border: 0; margin: 0; padding: 10px 0 10px 0; } .blogcptn { font-size: 85%; font-style: italic; text-align: center !important; } .kwd { font-weight: bold; } .com { font-style: italic; } Kaz Sato - Staff Developer Advocate, Google Cloud