使用TensorFlow实现逻辑回归

文章目录
  1. 1. 题目大意
  2. 2. 数据预处理
  3. 3. 构造计算图阶段
  4. 4. 运行计算图阶段
  5. 5. 模型评价
  6. 6. 可视化预测结果
  7. 7. 实现附加功能

题目大意

使用TensorFlow实现小批量梯度下降的逻辑回归。
数据集:moons dataset
— From《Hands-on Machine Learning with Scikit-Learn and TensorFlow》Chapter9 Exercise12


*附加功能:

• 在logistic_regression()函数中定义计算图,以便复用
• 在训练的时候定期保存检查点,并在训练结束的时候保存最终的模型
• 若训练中断,则从检查点中恢复
• 使用命名域来定义图
• 增加summaries日志记录,在TensorBoard中可视化学习曲线
• 调参(如,学习率、批数据大小等)并观察学习曲线


数据预处理

首先载入数据集

1
2
3
4
from sklearn.datasets import make_moons
m = 1000 # 样本数量
# 载入数据集
X_moons, y_moons = make_moons(m, noise=0.1, random_state=42)

接着将数据集可视化,以便有一个直观的感受

1
2
3
4
5
6
7
8
# 数据可视化
import matplotlib.pyplot as plt

# y_moons == 1提取正样本的索引
plt.plot(X_moons[y_moons == 1, 0], X_moons[y_moons == 1, 1], 'go', label='Positive')
plt.plot(X_moons[y_moons == 0, 0], X_moons[y_moons == 0, 1], 'r^', label='Negative')
plt.legend()
plt.show()

moons dataset

为每个样本在第0维上添加bias

1
2
# 添加bias
X_moons_with_bias = np.c_[np.ones((m, 1)), X_moons]

标签形状需要从(m, )reshape为(m, 1)

1
2
# 将数据标签从 1-D reshape成 2-D
y_moons_column_vector = y_moons.reshape(-1, 1)

从整个数据集中以8:2的比例,划分出训练集和测试集

1
2
3
4
5
6
7
8
9
10
11
12
13
# 测试集占整个数据集的比例
test_ratio = 0.2

# 测试样本数量
test_size = int(m * test_ratio)

# 划分训练集
X_train = X_moons_with_bias[:-test_size]
y_train = y_moons_column_vector[:-test_size]

# 划分测试集
X_test = X_moons_with_bias[-test_size:]
y_test = y_moons_column_vector[-test_size:]

定义一个随机划分批数据函数,方便后续训练

1
2
3
4
5
6
7
8
9
10
11
12
def random_batch(X_train, y_train, batch_size):
''' # 随机划分批数据

:param X_train: 整个训练集样本
:param y_train: 整个训练集标签
:param batch_size: 每个batch的大小
:return: 样本和标签的批数据
'''
rnd_indices = np.random.randint(0, len(X_train), size=batch_size)
X_batch = X_train[rnd_indices]
y_batch = y_train[rnd_indices]
return X_batch, y_batch


构造计算图阶段

moons dataset的特征只有2个

1
2
# 特征数量
n_inputs = 2

构造计算图

1
2
3
4
5
6
7
8
9
# 输入数据
X = tf.placeholder(tf.float32, shape=(None, n_inputs + 1), name='X')
y = tf.placeholder(tf.float32, shape=(None, 1), name='y')

# 权值初始化
theta = tf.Variable(tf.random_uniform([n_inputs + 1, 1], -1.0, 1.0, seed=42), name='theta')

# 计算logits
logits = tf.matmul(X, theta, name='logits')

sigmod函数的计算方式

1
2
3
4
5
# 方式一(根据定义)
y_proba = 1 / (1 + tf.exp(-logits))

# 方式二(内建函数)
y_proba = tf.sigmoid(logits)

计算逻辑回归的损失函数

1
2
3
4
5
6
# 方式一(根据定义)
epsilon = 1e-7 # 避免运算溢出
loss = -tf.reduce_mean(y * tf.log(y_proba + epsilon) + (1 - y) * tf.log(1 - y_proba + epsilon), name = 'loss')

# 方式二(内建函数)
loss = tf.losses.log_loss(y, y_proba, epsilon=epsilon)

定义学习率、梯度下降优化器及变量初始化节点

1
2
3
4
5
6
7
8
9
learning_rate = 0.01 # 学习率

# 梯度下降优化器
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
# 训练节点
training_op = optimizer.minimize(loss)

# 变量初始化节点
init = tf.global_variables_initializer()

训练相关的参数

1
2
3
4
5
6
7
8
# 训练的epoch次数(即遍历epoch次整个数据集)
n_epochs = 1000

# 每次批训练的样本数量
batch_size = 50

# 完成一次epoch所需要的批训练次数
n_batches = int(np.ceil(m / batch_size))


运行计算图阶段

1
2
3
4
5
6
7
8
9
10
11
12
13
with tf.Session() as sess:
sess.run(init)
for epoch in range(n_epochs):
for batch_index in range(n_batches):
# 获取批数据
X_batch, y_batch = random_batch(X_train, y_train, batch_size)
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
loss_val = loss.eval(feed_dict={X: X_test, y: y_test})
# 每训练100个epoch打印当前的loss值
if epoch % 100 == 0:
print('Epoch:', epoch, '\tLoss:', loss_val)
# 在测试集上预测
y_proba_val = y_proba.eval(feed_dict={X: X_test, y: y_test})

将概率大等于0.5的样本预测为正类

1
y_pred = (y_proba_val >= 0.5)


模型评价

使用准确率(precision)和召回率(recall)来评价模型

1
2
3
4
5
6
7
8
9
from sklearn.metrics import precision_score, recall_score

# 准确率
p_score = precision_score(y_test, y_pred)
# 召回率
r_score = recall_score(y_test, y_pred)

print('Precision score:', p_score)
print('Recall score:', r_score)


可视化预测结果

1
2
3
4
5
6
y_pred_idx = y_pred.reshape(-1)

plt.plot(X_test[y_pred_idx, 1], X_test[y_pred_idx, 2], 'go', label='Positive')
plt.plot(X_test[~y_pred_idx, 1], X_test[~y_pred_idx, 2], 'r^', label='Negative')
plt.legend()
plt.show()

预测结果可视化


实现附加功能

由于逻辑回归是一个线性分类器(从上面可视化的预测结果也可看出),效果不是特别好。
因此,我们使用多项式回归,即额外增加4个特征($x_{1}^2$、$x_{2}^2$、$x_{1}^3$、$x_{2}^3$)

1
2
3
4
5
6
7
8
9
10
# 增加4个特征
X_train_enhanced = np.c_[X_train, X_train[:, 1] ** 2,
X_train[:, 2] ** 2,
X_train[:, 1] ** 3,
X_train[:, 2] ** 3,]
X_test_enhanced = np.c_[X_test,
X_test[:, 1] ** 2,
X_test[:, 2] ** 2,
X_test[:, 1] ** 3,
X_test[:, 2] ** 3,]

将逻辑回归封装成一个函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def logistic_regression(X, y, initializer=None, seed=42, learning_rate=0.01):
''' 逻辑回归

:param X: 样本
:param y: 标签
:param initializer: 权值初始化器
:param seed: 随机数种子
:param learning_rate: 学习率
:return: sigmod概率, 损失函数, 训练节点, loss日志记录, 保存器
'''
n_inputs_with_bias = int(X.get_shape()[1])
with tf.name_scope('logistic_regression'): # 使用命名域
with tf.name_scope('model'):
if initializer is None:
initializer = tf.random_uniform([n_inputs_with_bias, 1], -1.0, 1.0, seed=seed)
theta = tf.Variable(initializer, name='theta')
logits = tf.matmul(X, theta)
y_proba = tf.sigmoid(logits)
with tf.name_scope('train'):
loss = tf.losses.log_loss(y, y_proba, scope='loss')
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
loss_summary = tf.summary.scalar('log_loss', loss)
with tf.name_scope('init'):
init = tf.global_variables_initializer()
with tf.name_scope('save'):
saver = tf.train.Saver()
return y_proba, loss, training_op, loss_summary, init, saver

构造日志文件目录

1
2
3
4
5
6
7
8
9
from datetime import datetime

def log_dir(prefix=''):
now = datetime.utcnow().strftime('%Y%m%d%H%M%S')
root_logdir = 'tf_logs'
if prefix:
prefix += '-'
name = prefix + 'run-' + now
return '{}/{}/'.format(root_logdir, name)

构造计算图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 特征数量,注意额外增加了4个特征
n_inputs = 2 + 4

# 日志文件目录
logdir = log_dir('logreg')

X = tf.placeholder(tf.float32, shape=(None, n_inputs + 1), name='X')
y = tf.placeholder(tf.float32, shape=(None, 1), name='y')

# 封装好logistic_regression,直接调用
y_proba, loss, training_op, loss_summary, init, saver = logistic_regression(X, y)

# 保存计算图结构
file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())

运行计算图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import os

n_epochs = 10001
batch_size = 50
n_batches = int(np.ceil(m / batch_size))

checkpoint_path = './tmp/my_logreg_model.ckpt'
checkpoint_epoch_path = checkpoint_path + '.epoch'
final_model_path = './my_logreg_model'

with tf.Session() as sess:
# 判断checkpoint_epoch_path文件是否存在
if os.path.isfile(checkpoint_epoch_path):
with open(checkpoint_epoch_path, 'rb') as f:
# 文件记录了最后一次保存的epoch编号
start_epoch = int(f.read())
print('Training was interrupted. Continuing at epoch', start_epoch)
# 恢复会话
saver.restore(sess, checkpoint_path)
else:
# 重新开始
start_epoch = 0
sess.run(init)

for epoch in range(start_epoch, n_epochs):
for batch_index in range(n_batches):
X_batch, y_batch = random_batch(X_train_enhanced, y_train, batch_size)
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})

# 计算每个epoch的loss值及其日志记录
loss_val, summary_str = sess.run([loss, loss_summary], feed_dict={X: X_test_enhanced, y: y_test})
# 追加loss日志记录,注意当前epoch的编号也要记录
file_writer.add_summary(summary_str, epoch)

# 每500个epoch保存checkpoint
if epoch % 500 == 0:
print('Epoch:', epoch, '\tLoss:', loss_val)
saver.save(sess, checkpoint_path)
# 每次覆盖写入新的epoch编号
with open(checkpoint_epoch_path, 'wb') as f:
f.write(b'%d' % (epoch + 1))

# 保存最终模型
saver.save(sess, final_model_path)
# 在测试集上进行预测
y_proba_val = y_proba.eval(feed_dict={X: X_test_enhanced, y: y_test})
# 若训练未中断,则删除checkpoint_epoch_path文件
os.remove(checkpoint_epoch_path)

预测结果

1
y_pred = (y_proba_val >= 0.5)

输出准确率和召回率

1
2
print('Precision score:', precision_score(y_test, y_pred))
print('Recall score:', recall_score(y_test, y_pred))

可视化预测结果

1
2
3
4
5
6
y_pred_idx = y_pred.reshape(-1)

plt.plot(X_test[y_pred_idx, 1], X_test[y_pred_idx, 2], 'go', label='Positive')
plt.plot(X_test[~y_pred_idx, 1], X_test[~y_pred_idx, 2], 'r^', label='Negative')
plt.legend()
plt.show()

可视化预测结果

可以看出,增加额外4个特征,能够显著提高预测结果


开始对learning ratebatch size进行玄学调参…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
from scipy.stats import reciprocal

n_search_iterations = 10

for search_iteration in range(n_search_iterations):
batch_size = np.random.randint(1, 100)
# reciprocal为倒数分布
# 详见https://en.wikipedia.org/wiki/Reciprocal_distribution
# 一般来说,如果对超参数的最优量级没把握的话,可以使用该分布进行调参
learning_rate = reciprocal.rvs(0.0001, 0.1, random_state=search_iteration)

n_inputs = 2 + 4
logdir = log_dir('logdir')

print('Iteration', search_iteration)
print(' logdir:', logdir)
print(' batch size:', batch_size)
print(' learning rate:', learning_rate)
print(' training: ', end='')

X = tf.placeholder(tf.float32, shape=(None, n_inputs + 1), name='X')
y = tf.placeholder(tf.float32, shape=(None, 1), name='y')

y_proba, loss, training_op, loss_summary, init, saver = logistic_regression(X, y, learning_rate=learning_rate)

file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())

n_epochs = 10001
n_batches = int(np.ceil(m / batch_size))

final_model_path = './model/my_logreg_model_%d' % search_iteration

with tf.Session() as sess:
sess.run(init)
for epoch in range(n_epochs):
for batch_index in range(n_batches):
X_batch, y_batch = random_batch(X_train_enhanced, y_train, batch_size)
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
loss_val, summary_str = sess.run([loss, loss_summary], feed_dict={X: X_test_enhanced, y: y_test})
file_writer.add_summary(summary_str, epoch)
if epoch % 500 == 0:
print('.', end='')
print()

saver.save(sess, final_model_path)

y_proba_val = y_proba.eval(feed_dict={X: X_test_enhanced, y: y_test})
y_pred = (y_proba_val >= 0.5)

print(' Precision:', precision_score(y_test, y_pred))
print(' Recall:', recall_score(y_test, y_pred))

打印训练信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
Iteration 0
logdir: tf_logs/logdir-run-20171202101244/
batch size: 54
learning rate: 0.00443037524522
training: .....................
Precision: 0.979797979798
Recall: 0.979797979798
Iteration 1
logdir: tf_logs/logdir-run-20171202101623/
batch size: 22
learning rate: 0.00178264971514
training: .....................
Precision: 0.979797979798
Recall: 0.979797979798
Iteration 2
logdir: tf_logs/logdir-run-20171202102501/
batch size: 74
learning rate: 0.00203228544324
training: .....................
Precision: 0.969696969697
Recall: 0.969696969697
Iteration 3
logdir: tf_logs/logdir-run-20171202102742/
batch size: 58
learning rate: 0.00449152382514
training: .....................
Precision: 0.979797979798
Recall: 0.979797979798
Iteration 4
logdir: tf_logs/logdir-run-20171202103106/
batch size: 61
learning rate: 0.0796323472178
training: .....................
Precision: 0.980198019802
Recall: 1.0
Iteration 5
logdir: tf_logs/logdir-run-20171202103417/
batch size: 92
learning rate: 0.000463425058329
training: .....................
Precision: 0.912621359223
Recall: 0.949494949495
Iteration 6
logdir: tf_logs/logdir-run-20171202103630/
batch size: 74
learning rate: 0.0477068184194
training: .....................
Precision: 0.98
Recall: 0.989898989899
Iteration 7
logdir: tf_logs/logdir-run-20171202103916/
batch size: 58
learning rate: 0.000169404470952
training: .....................
Precision: 0.9
Recall: 0.909090909091
Iteration 8
logdir: tf_logs/logdir-run-20171202104242/
batch size: 61
learning rate: 0.0417146119941
training: .....................
Precision: 0.980198019802
Recall: 1.0
Iteration 9
logdir: tf_logs/logdir-run-20171202104602/
batch size: 92
learning rate: 0.000107429229684
training: .....................
Precision: 0.882352941176
Recall: 0.757575757576

让我们打开TensorBoard观察10次训练的学习曲线

10次训练的学习曲线

可以看出,第4次(从0开始)的loss值最小
最终,找到的超参数为

超参数 取值
learning rate 0.0796323472178
batch size 61
分享到 评论