model.fit(X_train, y_train, epochs=2, verbose=1, validation_split=0.9, shuffle=True)
- Keras 的
validation_split
并不会随机抽取验证集,而是直接取数据的最后 10% (validation_split=0.9
时) - Keras 的
shuffle
是在validation_split
之后进行的
结论:样本分布不均匀的时候最好事先手动 shuffle,然后再用 validation_split 取验证集,不要依赖 Keras 自带的 shuffle
在做比赛的时候碰到的,用 validation_split
取验证集,没有事先手动 shuffle,结果用来验证的都是负样本(而且刚好负样本就占 10% 左右),当然就结果喜人了……
然后才发现 shuffle
是在 split 之后才进行的,而且 validation_split
并不是随机抽取。
源码里关于 validation_split
的注解
validation_split: Float between 0 and 1.
Fraction of the training data to be used as validation data.
The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch.
The validation data is selected from the last samples in thex
andy
data provided, before shuffling.
validation_split
实现源码 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21elif validation_split and 0. < validation_split < 1.:
if any(K.is_tensor(t) for t in x):
raise ValueError(
'If your data is in the form of symbolic tensors, '
'you cannot use `validation_split`.')
do_validation = True
if hasattr(x[0], 'shape'):
split_at = int(int(x[0].shape[0]) * (1. - validation_split))
else:
split_at = int(len(x[0]) * (1. - validation_split))
x, val_x = (slice_arrays(x, 0, split_at),
slice_arrays(x, split_at))
y, val_y = (slice_arrays(y, 0, split_at),
slice_arrays(y, split_at))
sample_weights, val_sample_weights = (
slice_arrays(sample_weights, 0, split_at),
slice_arrays(sample_weights, split_at))
if self._uses_dynamic_learning_phase():
val_ins = val_x + val_y + val_sample_weights + [0.]
else:
val_ins = val_x + val_y + val_sample_weights
参考:Keras 源码