关于 Keras 的 validation_split 与 shuffle

model.fit(X_train, y_train, epochs=2, verbose=1, validation_split=0.9, shuffle=True)

  • Keras 的 validation_split 并不会随机抽取验证集,而是直接取数据的最后 10% (validation_split=0.9时)
  • Keras 的 shuffle 是在 validation_split 之后进行的

结论:样本分布不均匀的时候最好事先手动 shuffle,然后再用 validation_split 取验证集,不要依赖 Keras 自带的 shuffle

在做比赛的时候碰到的,用 validation_split 取验证集,没有事先手动 shuffle,结果用来验证的都是负样本(而且刚好负样本就占 10% 左右),当然就结果喜人了……

然后才发现 shuffle 是在 split 之后才进行的,而且 validation_split 并不是随机抽取。

源码里关于 validation_split 的注解

validation_split: Float between 0 and 1.
Fraction of the training data to be used as validation data.
The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch.
The validation data is selected from the last samples in the x and y data provided, before shuffling.

validation_split实现源码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
elif validation_split and 0. < validation_split < 1.:
if any(K.is_tensor(t) for t in x):
raise ValueError(
'If your data is in the form of symbolic tensors, '
'you cannot use `validation_split`.')
do_validation = True
if hasattr(x[0], 'shape'):
split_at = int(int(x[0].shape[0]) * (1. - validation_split))
else:
split_at = int(len(x[0]) * (1. - validation_split))
x, val_x = (slice_arrays(x, 0, split_at),
slice_arrays(x, split_at))
y, val_y = (slice_arrays(y, 0, split_at),
slice_arrays(y, split_at))
sample_weights, val_sample_weights = (
slice_arrays(sample_weights, 0, split_at),
slice_arrays(sample_weights, split_at))
if self._uses_dynamic_learning_phase():
val_ins = val_x + val_y + val_sample_weights + [0.]
else:
val_ins = val_x + val_y + val_sample_weights

参考:Keras 源码