Python 机器学习入门：从零基础到实战应用

概述

机器学习是人工智能的核心分支，它让计算机能够从数据中学习规律并做出预测。本文将带你从零开始，系统学习机器学习的基础概念，并通过 5 个完整的实战案例，掌握使用 Python 进行机器学习开发的核心技能。

无论你是编程新手还是有一定基础的开发人员，本文都将为你提供清晰的学习路径和可运行的代码示例。学完本教程后，你将能够独立完成常见的机器学习任务，为深入学习深度学习打下坚实基础。

第一章：机器学习基础概念

1.1 什么是机器学习？

机器学习（Machine Learning）是一种让计算机系统通过经验（数据）自动改进性能的技术。与传统编程不同，机器学习不是通过明确的规则来解决问题，而是通过分析数据来发现规律。

传统编程 vs 机器学习：

传统编程	机器学习
输入数据 + 规则 → 输出结果	输入数据 + 期望结果 → 学习规则
需要人工定义所有逻辑	自动从数据中发现模式
适合规则明确的问题	适合模式复杂、难以描述的问题

1.2 机器学习的三大类型

1.2.1 监督学习（Supervised Learning）

监督学习是最常见的机器学习类型。我们提供带有标签的训练数据，模型学习输入与输出之间的映射关系。

常见任务：

分类（Classification）：预测离散类别，如垃圾邮件检测
回归（Regression）：预测连续值，如房价预测

典型算法：

线性回归、逻辑回归
决策树、随机森林
支持向量机（SVM）
神经网络

1.2.2 无监督学习（Unsupervised Learning）

无监督学习处理没有标签的数据，目标是发现数据中的隐藏结构。

常见任务：

聚类（Clustering）：将相似数据分组
降维（Dimensionality Reduction）：减少特征数量
异常检测（Anomaly Detection）：识别异常数据点

典型算法：

K-Means 聚类
层次聚类
主成分分析（PCA）
自编码器

1.2.3 强化学习（Reinforcement Learning）

强化学习通过与环境交互来学习最优策略，根据奖励信号调整行为。

应用场景：

游戏 AI（如 AlphaGo）
机器人控制
推荐系统
自动驾驶

1.3 机器学习工作流程

一个完整的机器学习项目通常包含以下步骤：

数据收集 → 数据预处理 → 特征工程 → 模型选择 → 模型训练 → 模型评估 → 模型部署

关键要点：

数据质量决定模型上限
特征工程往往比模型选择更重要
需要防止过拟合和欠拟合
持续监控和更新模型

第二章：Python 机器学习环境搭建

2.1 安装必要的库

bash

# 创建虚拟环境（推荐）
python -m venv ml_env
source ml_env/bin/activate  # Linux/Mac
# 或 ml_env\Scripts\activate  # Windows

# 安装核心库
pip install numpy pandas matplotlib scikit-learn jupyter

2.2 核心库介绍

库名	用途
NumPy	数值计算，多维数组操作
Pandas	数据处理和分析
Matplotlib	数据可视化
Scikit-learn	机器学习算法实现
Jupyter	交互式开发环境

2.3 验证安装

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets

print(f"NumPy 版本：{np.__version__}")
print(f"Pandas 版本：{pd.__version__}")
print(f"Scikit-learn 版本：{sklearn.__version__}")
print("环境配置成功！")

第三章：实战案例

案例 1：鸢尾花分类（多分类问题）

这是机器学习的"Hello World"，我们将使用经典的鸢尾花数据集进行多分类任务。

3.1.1 数据探索

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# 加载数据
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

# 创建 DataFrame 便于查看
df = pd.DataFrame(X, columns=feature_names)
df['species'] = pd.Categorical.from_codes(y, target_names)

print(f"数据集形状：{X.shape}")
print(f"特征名称：{feature_names}")
print(f"类别：{target_names}")
print(df.head())

# 数据可视化
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for i, ax in enumerate(axes.flat):
    if i < len(feature_names):
        for species in target_names:
            mask = df['species'] == species
            ax.hist(df.loc[mask, feature_names[i]], 
                   alpha=0.5, label=species, bins=15)
        ax.set_xlabel(feature_names[i])
        ax.set_ylabel('频数')
        ax.legend()
        ax.set_title(f'{feature_names[i]} 分布')

plt.tight_layout()
plt.savefig('iris_distribution.png', dpi=150)
plt.show()

3.1.2 模型训练与评估

python

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 特征标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 训练 KNN 模型
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# 预测
y_pred = knn.predict(X_test_scaled)

# 评估
print(f"准确率：{accuracy_score(y_test, y_pred):.4f}")
print("\n分类报告：")
print(classification_report(y_test, y_pred, target_names=target_names))
print("\n混淆矩阵：")
print(confusion_matrix(y_test, y_pred))

# 可视化混淆矩阵
plt.figure(figsize=(8, 6))
plt.imshow(confusion_matrix(y_test, y_pred), cmap='Blues')
plt.colorbar()
plt.xlabel('预测标签')
plt.ylabel('真实标签')
plt.title('KNN 鸢尾花分类混淆矩阵')
for i in range(3):
    for j in range(3):
        plt.text(j, i, str(confusion_matrix(y_test, y_pred)[i, j]),
                ha='center', va='center', color='red' if i != j else 'black')
plt.savefig('iris_confusion_matrix.png', dpi=150)
plt.show()

3.1.3 模型优化

python

from sklearn.model_selection import cross_val_score, GridSearchCV

# 交叉验证
k_range = range(1, 31)
cv_scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_scaled, y_train, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())

# 绘制 K 值与准确率关系
plt.figure(figsize=(10, 6))
plt.plot(k_range, cv_scores, marker='o')
plt.xlabel('K 值')
plt.ylabel('交叉验证准确率')
plt.title('K 值选择')
plt.grid(True, alpha=0.3)
plt.savefig('k_value_selection.png', dpi=150)
plt.show()

# 网格搜索最优参数
param_grid = {
    'n_neighbors': range(1, 31),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

grid_search = GridSearchCV(
    KNeighborsClassifier(),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid_search.fit(X_train_scaled, y_train)

print(f"最优参数：{grid_search.best_params_}")
print(f"最优交叉验证得分：{grid_search.best_score_:.4f}")

# 使用最优模型测试
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test_scaled)
print(f"测试集准确率：{accuracy_score(y_test, y_pred_best):.4f}")

案例要点总结：

数据探索是建模前的必要步骤
特征标准化对距离-based 算法很重要
交叉验证帮助选择最优超参数
混淆矩阵帮助理解模型错误类型

案例 2：波士顿房价预测（回归问题）

房价预测是经典的回归问题，我们将预测房屋的中位价值。

3.2.1 数据加载与探索

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# 加载加州房价数据集（波士顿数据集已废弃）
housing = fetch_california_housing()
X = housing.data
y = housing.target
feature_names = housing.feature_names

# 创建 DataFrame
df = pd.DataFrame(X, columns=feature_names)
df['MedHouseValue'] = y

print(f"数据集形状：{X.shape}")
print(f"特征：{feature_names}")
print(df.describe())

# 相关性分析
plt.figure(figsize=(12, 10))
correlation_matrix = df.corr()
plt.imshow(correlation_matrix, cmap='coolwarm', aspect='auto')
plt.colorbar()
plt.xticks(range(len(feature_names)), feature_names, rotation=45, ha='right')
plt.yticks(range(len(feature_names)), feature_names + ['MedHouseValue'])
plt.title('特征相关性热力图')
for i in range(len(feature_names)):
    for j in range(len(feature_names) + 1):
        text = f'{correlation_matrix.iloc[i, j]:.2f}'
        color = 'white' if abs(correlation_matrix.iloc[i, j]) > 0.5 else 'black'
        plt.text(j, i, text, ha='center', va='center', color=color, fontsize=8)
plt.tight_layout()
plt.savefig('housing_correlation.png', dpi=150)
plt.show()

3.2.2 多模型对比

python

# 数据划分
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 特征标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 定义多个模型
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=0.1),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42)
}

# 训练和评估
results = []

for name, model in models.items():
    if name in ['Linear Regression', 'Ridge Regression', 'Lasso Regression']:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    results.append({
        '模型': name,
        'RMSE': rmse,
        'MAE': mae,
        'R²': r2
    })
    
    print(f"{name}:")
    print(f"  RMSE: {rmse:.4f}")
    print(f"  MAE: {mae:.4f}")
    print(f"  R²: {r2:.4f}")
    print()

# 结果对比可视化
results_df = pd.DataFrame(results)
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.bar(results_df['模型'], results_df['RMSE'])
plt.xlabel('模型')
plt.ylabel('RMSE')
plt.title('模型 RMSE 对比（越低越好）')
plt.xticks(rotation=45, ha='right')

plt.subplot(1, 2, 2)
plt.bar(results_df['模型'], results_df['R²'])
plt.xlabel('模型')
plt.ylabel('R²')
plt.title('模型 R²对比（越高越好）')
plt.xticks(rotation=45, ha='right')

plt.tight_layout()
plt.savefig('housing_model_comparison.png', dpi=150)
plt.show()

3.2.3 特征重要性分析

python

# 随机森林特征重要性
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

importances = rf_model.feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(12, 6))
plt.bar(range(len(importances)), importances[indices])
plt.xticks(range(len(importances)), [feature_names[i] for i in indices], rotation=45, ha='right')
plt.xlabel('特征')
plt.ylabel('重要性')
plt.title('随机森林特征重要性')
plt.tight_layout()
plt.savefig('housing_feature_importance.png', dpi=150)
plt.show()

# 预测结果可视化
best_model = GradientBoostingRegressor(n_estimators=100, random_state=42)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('真实值')
plt.ylabel('预测值')
plt.title('预测 vs 真实')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('预测值')
plt.ylabel('残差')
plt.title('残差图')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('housing_predictions.png', dpi=150)
plt.show()

print(f"最终模型（Gradient Boosting）:")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"  MAE: {mean_absolute_error(y_test, y_pred):.4f}")
print(f"  R²: {r2_score(y_test, y_pred):.4f}")

案例要点总结：

回归问题使用 RMSE、MAE、R²等指标评估
多模型对比帮助选择最佳算法
特征重要性分析帮助理解模型
残差分析检查模型假设

案例 3：手写数字识别（图像分类）

使用 MNIST 数据集进行手写数字识别，这是计算机视觉的经典入门任务。

3.3.1 数据加载与可视化

python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# 加载 MNIST 数据集
print("正在加载 MNIST 数据集...")
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X = mnist.data
y = mnist.target.astype(np.uint8)

print(f"数据集形状：{X.shape}")
print(f"标签形状：{y.shape}")

# 可视化样本
fig, axes = plt.subplots(5, 10, figsize=(15, 8))
for i, ax in enumerate(axes.flat):
    ax.imshow(X[i].reshape(28, 28), cmap='gray')
    ax.set_title(f'标签：{y[i]}')
    ax.axis('off')
plt.tight_layout()
plt.savefig('mnist_samples.png', dpi=150)
plt.show()

# 数据划分（使用较小样本加快训练）
X_sample = X[:10000]
y_sample = y[:10000]

X_train, X_test, y_train, y_test = train_test_split(
    X_sample, y_sample, test_size=0.2, random_state=42, stratify=y_sample
)

print(f"训练集：{X_train.shape}, 测试集：{X_test.shape}")

3.3.2 模型训练

python

# 特征标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 训练 SVM 模型
print("训练 SVM 模型...")
svm_model = SVC(kernel='rbf', C=10, gamma='scale', random_state=42)
svm_model.fit(X_train_scaled[:5000], y_train[:5000])  # 使用部分数据加快训练

# 预测
y_pred_svm = svm_model.predict(X_test_scaled)

# 评估
print(f"SVM 准确率：{accuracy_score(y_test, y_pred_svm):.4f}")
print("\nSVM 分类报告：")
print(classification_report(y_test, y_pred_svm))

# 训练随机森林模型
print("\n训练随机森林模型...")
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)

# 预测
y_pred_rf = rf_model.predict(X_test)

# 评估
print(f"随机森林准确率：{accuracy_score(y_test, y_pred_rf):.4f}")
print("\n随机森林分类报告：")
print(classification_report(y_test, y_pred_rf))

3.3.3 错误分析

python

# 混淆矩阵
cm = confusion_matrix(y_test, y_pred_svm)

plt.figure(figsize=(10, 8))
plt.imshow(cm, cmap='Blues')
plt.colorbar()
plt.xlabel('预测标签')
plt.ylabel('真实标签')
plt.title('SVM 混淆矩阵')
for i in range(10):
    for j in range(10):
        plt.text(j, i, str(cm[i, j]), ha='center', va='center', 
                color='red' if cm[i, j] < 10 else 'black')
plt.tight_layout()
plt.savefig('mnist_confusion_matrix.png', dpi=150)
plt.show()

# 查看错误分类的样本
errors = y_pred_svm != y_test
error_indices = np.where(errors)[0]

if len(error_indices) > 0:
    fig, axes = plt.subplots(3, 5, figsize=(15, 4))
    for i, ax in enumerate(axes.flat):
        if i < len(error_indices):
            idx = error_indices[i]
            ax.imshow(X_test[idx].reshape(28, 28), cmap='gray')
            ax.set_title(f'真实：{y_test[idx]}, 预测：{y_pred_svm[idx]}')
            ax.axis('off')
    plt.tight_layout()
    plt.savefig('mnist_errors.png', dpi=150)
    plt.show()
    print(f"错误分类样本数：{len(error_indices)}")
    print(f"错误率：{len(error_indices) / len(y_test):.4f}")

案例要点总结：

图像数据需要展平和标准化
SVM 在小规模图像分类上表现良好
错误分析帮助理解模型弱点
大规模数据可考虑深度学习

案例 4：客户分群（无监督学习 - 聚类）

使用 K-Means 对客户数据进行分群，帮助业务理解客户特征。

3.4.1 数据生成与探索

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, davies_bouldin_score

# 生成模拟客户数据
np.random.seed(42)
n_samples = 500

# 创建 4 个客户群体
centers = [
    [20, 3000, 5],    # 年轻低消费群体
    [35, 8000, 15],   # 中年中消费群体
    [50, 15000, 25],  # 成熟高消费群体
    [28, 5000, 8]     # 青年成长群体
]
X, y_true = make_blobs(n_samples=n_samples, centers=centers, 
                       cluster_std=[3, 2000, 3000, 2], random_state=42)

# 创建 DataFrame
df = pd.DataFrame(X, columns=['Age', 'Annual Income', 'Spending Score'])
df['Age'] = df['Age'].clip(18, 70)  # 限制年龄范围
df['Annual Income'] = df['Annual Income'].clip(1000, 50000)
df['Spending Score'] = df['Spending Score'].clip(1, 100)

print(f"数据集形状：{df.shape}")
print(df.describe())

# 数据可视化
fig = plt.figure(figsize=(15, 5))

ax1 = fig.add_subplot(131, projection='3d')
ax1.scatter(df['Age'], df['Annual Income'], df['Spending Score'], 
           c='skyblue', alpha=0.6, s=50)
ax1.set_xlabel('Age')
ax1.set_ylabel('Annual Income')
ax1.set_zlabel('Spending Score')
ax1.set_title('3D 数据分布')

ax2 = fig.add_subplot(132)
scatter = ax2.scatter(df['Age'], df['Annual Income'], 
                     c=df['Spending Score'], cmap='viridis', alpha=0.6)
ax2.set_xlabel('Age')
ax2.set_ylabel('Annual Income')
ax2.set_title('年龄 vs 收入（颜色=消费评分）')
plt.colorbar(scatter, label='Spending Score')

ax3 = fig.add_subplot(133)
df[['Annual Income', 'Spending Score']].plot.hexbin(
    x='Annual Income', y='Spending Score', gridsize=20, cmap='YlOrRd', ax=ax3)
ax3.set_xlabel('Annual Income')
ax3.set_ylabel('Spending Score')
ax3.set_title('收入 vs 消费评分热力图')

plt.tight_layout()
plt.savefig('customer_data_exploration.png', dpi=150)
plt.show()

3.4.2 K-Means 聚类

python

# 数据标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

# 手肘法确定最优 K 值
inertias = []
silhouette_scores = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_scaled)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_scaled, labels))

# 绘制手肘图和轮廓系数
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(K_range, inertias, marker='o', linewidth=2)
ax1.set_xlabel('聚类数 K')
ax1.set_ylabel('惯性（Inertia）')
ax1.set_title('手肘法 - 确定最优 K 值')
ax1.grid(True, alpha=0.3)
ax1.annotate('手肘点', xy=(4, inertias[2]), xytext=(6, inertias[2] + 5000),
            arrowprops=dict(arrowstyle='->', color='red'))

ax2.plot(K_range, silhouette_scores, marker='s', color='green', linewidth=2)
ax2.set_xlabel('聚类数 K')
ax2.set_ylabel('轮廓系数')
ax2.set_title('轮廓系数 - 聚类质量评估')
ax2.grid(True, alpha=0.3)
ax2.axhline(y=max(silhouette_scores), color='red', linestyle='--', 
           label=f'最大值：{max(silhouette_scores):.3f}')
ax2.legend()

plt.tight_layout()
plt.savefig('kmeans_optimal_k.png', dpi=150)
plt.show()

# 使用最优 K 值进行聚类
optimal_k = 4  # 根据手肘图选择
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
df['Cluster'] = kmeans.fit_predict(X_scaled)

# 评估聚类质量
print(f"轮廓系数：{silhouette_score(X_scaled, df['Cluster']):.4f}")
print(f"Davies-Bouldin 指数：{davies_bouldin_score(X_scaled, df['Cluster']):.4f}")

# 聚类中心
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
centers_df = pd.DataFrame(cluster_centers, columns=df.columns[:-1])
centers_df['Cluster'] = range(optimal_k)
centers_df['样本数'] = df['Cluster'].value_counts().sort_index().values

print("\n聚类中心：")
print(centers_df)

3.4.3 聚类结果可视化与分析

python

# PCA 降维可视化
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(15, 5))

# 原始聚类
plt.subplot(1, 3, 1)
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['Cluster'], 
                     cmap='viridis', alpha=0.6, s=50)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%})')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%})')
plt.title('K-Means 聚类结果（PCA 降维）')
plt.colorbar(scatter, label='Cluster')

# 按年龄和收入着色
plt.subplot(1, 3, 2)
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['Annual Income'], 
                     cmap='RdYlBu_r', alpha=0.6, s=50)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%})')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%})')
plt.title('按年收入着色')
plt.colorbar(scatter, label='Annual Income')

# 按消费评分着色
plt.subplot(1, 3, 3)
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['Spending Score'], 
                     cmap='YlOrRd', alpha=0.6, s=50)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%})')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%})')
plt.title('按消费评分着色')
plt.colorbar(scatter, label='Spending Score')

plt.tight_layout()
plt.savefig('customer_clustering_visualization.png', dpi=150)
plt.show()

# 聚类特征分析
print("\n各聚类特征分析：")
for cluster in range(optimal_k):
    cluster_data = df[df['Cluster'] == cluster]
    print(f"\n聚类 {cluster} ({len(cluster_data)} 个客户):")
    print(f"  平均年龄：{cluster_data['Age'].mean():.1f} 岁")
    print(f"  平均年收入：${cluster_data['Annual Income'].mean():.0f}")
    print(f"  平均消费评分：{cluster_data['Spending Score'].mean():.1f}")

案例要点总结：

无监督学习不需要标签数据
手肘法和轮廓系数帮助确定最优聚类数
特征标准化对聚类很重要
聚类结果需要业务解读

案例 5：垃圾邮件检测（文本分类）

使用朴素贝叶斯进行垃圾邮件分类，展示文本数据处理流程。

3.5.1 数据准备

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import re

# 加载 20 新闻组数据（使用 comp.graphics 和 rec.sport.baseball 作为示例）
print("加载数据...")
categories = ['comp.graphics', 'rec.sport.baseball']
train_data = fetch_20newsgroups(subset='train', categories=categories, 
                                remove=('headers', 'footers', 'quotes'))
test_data = fetch_20newsgroups(subset='test', categories=categories,
                               remove=('headers', 'footers', 'quotes'))

X_train = train_data.data
y_train = train_data.target
X_test = test_data.data
y_test = test_data.target

print(f"训练集：{len(X_train)} 样本")
print(f"测试集：{len(X_test)} 样本")
print(f"类别：{train_data.target_names}")

# 文本预处理函数
def preprocess_text(text):
    # 转小写
    text = text.lower()
    # 移除特殊字符和数字
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # 移除多余空格
    text = ' '.join(text.split())
    return text

# 应用预处理
X_train_clean = [preprocess_text(text) for text in X_train]
X_test_clean = [preprocess_text(text) for text in X_test]

# 查看样本
print("\n原始文本样本：")
print(X_train[0][:200] + "...")
print("\n预处理后：")
print(X_train_clean[0][:200] + "...")

3.5.2 特征提取与模型训练

python

# 使用 TF-IDF 进行特征提取
tfidf_vectorizer = TfidfVectorizer(
    max_features=5000,
    min_df=2,
    max_df=0.8,
    ngram_range=(1, 2)  # 使用 unigram 和 bigram
)

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_clean)
X_test_tfidf = tfidf_vectorizer.transform(X_test_clean)

print(f"特征维度：{X_train_tfidf.shape}")
print(f"词汇表大小：{len(tfidf_vectorizer.vocabulary_)}")

# 训练朴素贝叶斯模型
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)

# 预测
y_pred_nb = nb_model.predict(X_test_tfidf)

# 评估
print("\n=== 朴素贝叶斯模型 ===")
print(f"准确率：{accuracy_score(y_test, y_pred_nb):.4f}")
print("\n分类报告：")
print(classification_report(y_test, y_pred_nb, target_names=train_data.target_names))

# 训练逻辑回归模型对比
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train_tfidf, y_train)
y_pred_lr = lr_model.predict(X_test_tfidf)

print("\n=== 逻辑回归模型 ===")
print(f"准确率：{accuracy_score(y_test, y_pred_lr):.4f}")
print("\n分类报告：")
print(classification_report(y_test, y_pred_lr, target_names=train_data.target_names))

3.5.3 模型分析与优化

python

# 混淆矩阵
cm = confusion_matrix(y_test, y_pred_nb)

plt.figure(figsize=(10, 8))
plt.imshow(cm, cmap='Blues')
plt.colorbar()
plt.xlabel('预测标签')
plt.ylabel('真实标签')
plt.title('朴素贝叶斯混淆矩阵')
for i in range(2):
    for j in range(2):
        plt.text(j, i, str(cm[i, j]), ha='center', va='center', 
                color='red' if cm[i, j] < 10 else 'black', fontsize=20)
plt.xticks(range(2), train_data.target_names, rotation=45, ha='right')
plt.yticks(range(2), train_data.target_names)
plt.tight_layout()
plt.savefig('spam_confusion_matrix.png', dpi=150)
plt.show()

# 特征重要性（Top 20 关键词）
feature_names = tfidf_vectorizer.get_feature_names_out()

# 获取每个类别的重要特征
def get_top_features(model, feature_names, n=20):
    top_features = {}
    for i, class_name in enumerate(train_data.target_names):
        coef = model.coef_[i]
        top_indices = np.argsort(coef)[-n:][::-1]
        top_features[class_name] = [(feature_names[idx], coef[idx]) for idx in top_indices]
    return top_features

top_features_nb = get_top_features(nb_model, feature_names)

print("\n各类别 Top 10 特征词：")
for class_name, features in top_features_nb.items():
    print(f"\n{class_name}:")
    for word, score in features[:10]:
        print(f"  {word}: {score:.4f}")

# 预测概率分析
y_proba = nb_model.predict_proba(X_test_tfidf)

# 查看不确定样本
uncertainty = np.abs(y_proba[:, 0] - 0.5)
uncertain_indices = np.argsort(uncertainty)[:5]

print("\n最不确定的 5 个样本：")
for idx in uncertain_indices:
    print(f"\n样本 {idx}:")
    print(f"文本：{X_test_clean[idx][:150]}...")
    print(f"预测概率：{y_proba[idx]}")
    print(f"真实标签：{train_data.target_names[y_test[idx]]}")
    print(f"预测标签：{train_data.target_names[y_pred_nb[idx]]}")

案例要点总结：

文本数据需要预处理和特征提取
TF-IDF 是常用的文本特征表示方法
朴素贝叶斯适合文本分类任务
特征词分析帮助理解模型决策

第四章：机器学习最佳实践

4.1 数据质量优先

python

# 数据质量检查清单
def data_quality_check(df):
    print("=== 数据质量检查 ===")
    print(f"数据形状：{df.shape}")
    print(f"\n缺失值统计：")
    print(df.isnull().sum())
    print(f"\n缺失值比例：")
    print((df.isnull().sum() / len(df) * 100).round(2))
    print(f"\n数据类型：")
    print(df.dtypes)
    print(f"\n描述性统计：")
    print(df.describe())
    print(f"\n重复值：{df.duplicated().sum()}")

4.2 防止过拟合

常见方法：

交叉验证
正则化（L1/L2）
Dropout（神经网络）
早停（Early Stopping）
数据增强
简化模型

4.3 特征工程技巧

特征缩放：标准化、归一化
编码：One-Hot、Label Encoding
特征选择：过滤法、包装法、嵌入法
特征构造：多项式特征、交互特征
处理不平衡：过采样、欠采样、SMOTE

4.4 模型评估指标选择

问题类型	主要指标	辅助指标
二分类	Accuracy, F1, AUC	Precision, Recall
多分类	Accuracy, Macro-F1	Confusion Matrix
回归	RMSE, MAE	R², Adjusted R²
聚类	Silhouette, DBI	Inertia

第五章：常见问题解答（FAQ）

Q1: 应该选择哪个机器学习算法？

答：没有万能算法，选择取决于：

数据量大小
特征类型（数值/类别）
问题类型（分类/回归/聚类）
可解释性要求
训练时间限制

建议流程：

从简单模型开始（线性模型、决策树）
尝试多个模型对比
根据结果调整和优化

Q2: 如何处理缺失值？

答：根据情况选择：

删除：缺失比例高（>50%）
填充：均值/中位数/众数
预测填充：用其他特征预测
标记缺失：添加缺失指示变量

Q3: 训练集和测试集比例多少合适？

答：常见比例：

小数据集：80/20 或 70/30
大数据集：90/10 或 95/5
使用交叉验证更可靠

Q4: 模型准确率低怎么办？

答：排查步骤：

检查数据质量
尝试不同特征工程
调整模型超参数
尝试不同算法
收集更多数据

Q5: 如何保存和加载模型？

python

import joblib

# 保存模型
joblib.dump(model, 'model.pkl')

# 加载模型
model = joblib.load('model.pkl')

第六章：总结与下一步

6.1 本教程要点回顾

基础概念：理解了机器学习的三大类型和工作流程
环境搭建：掌握了 Python 机器学习核心库
实战案例：完成了 5 个不同类型的机器学习项目
最佳实践：学习了数据质量、防过拟合、特征工程等技巧

6.2 学习路线建议

初级（已完成）：

✅ 掌握基础算法（线性回归、KNN、决策树）
✅ 熟悉 scikit-learn 使用
✅ 完成基础实战项目

中级（下一步）：

深入学习集成学习（Random Forest, XGBoost, LightGBM）
学习深度学习基础（神经网络、CNN、RNN）
参与 Kaggle 竞赛

高级：

研究前沿论文和算法
学习模型部署（Flask、FastAPI、Docker）
掌握 MLOps 流程

6.3 推荐资源

书籍：

《机器学习》（周志华）
《Hands-On Machine Learning》
《Pattern Recognition and Machine Learning》

在线课程：

Coursera: Machine Learning (Andrew Ng)
fast.ai: Practical Deep Learning
吴恩达深度学习专项课程

实践平台：

Kaggle: 数据科学竞赛
Google Colab: 免费 GPU
Hugging Face: 预训练模型

作者注： 机器学习是一个实践性很强的领域，最好的学习方式是动手做项目。希望本教程能为你打下坚实基础，开启机器学习之旅！

更新日期： 2026-03-21
字数统计： 约 8500 字

Python 机器学习入门：从零基础到实战应用 ​

概述 ​

第一章：机器学习基础概念 ​

1.1 什么是机器学习？ ​

1.2 机器学习的三大类型 ​

1.2.1 监督学习（Supervised Learning） ​

1.2.2 无监督学习（Unsupervised Learning） ​

1.2.3 强化学习（Reinforcement Learning） ​

1.3 机器学习工作流程 ​

第二章：Python 机器学习环境搭建 ​

2.1 安装必要的库 ​

2.2 核心库介绍 ​

2.3 验证安装 ​

第三章：实战案例 ​

案例 1：鸢尾花分类（多分类问题） ​

3.1.1 数据探索 ​

3.1.2 模型训练与评估 ​

3.1.3 模型优化 ​

案例 2：波士顿房价预测（回归问题） ​

3.2.1 数据加载与探索 ​

3.2.2 多模型对比 ​

3.2.3 特征重要性分析 ​

案例 3：手写数字识别（图像分类） ​

3.3.1 数据加载与可视化 ​

3.3.2 模型训练 ​

3.3.3 错误分析 ​

案例 4：客户分群（无监督学习 - 聚类） ​

3.4.1 数据生成与探索 ​

3.4.2 K-Means 聚类 ​

3.4.3 聚类结果可视化与分析 ​

案例 5：垃圾邮件检测（文本分类） ​

3.5.1 数据准备 ​

3.5.2 特征提取与模型训练 ​

3.5.3 模型分析与优化 ​

第四章：机器学习最佳实践 ​

4.1 数据质量优先 ​

4.2 防止过拟合 ​

4.3 特征工程技巧 ​

4.4 模型评估指标选择 ​

第五章：常见问题解答（FAQ） ​

Q1: 应该选择哪个机器学习算法？ ​

Q2: 如何处理缺失值？ ​

Q3: 训练集和测试集比例多少合适？ ​

Q4: 模型准确率低怎么办？ ​

Q5: 如何保存和加载模型？ ​

第六章：总结与下一步 ​

6.1 本教程要点回顾 ​

6.2 学习路线建议 ​

6.3 推荐资源 ​

Python 机器学习入门：从零基础到实战应用

概述

第一章：机器学习基础概念

1.1 什么是机器学习？

1.2 机器学习的三大类型

1.2.1 监督学习（Supervised Learning）

1.2.2 无监督学习（Unsupervised Learning）

1.2.3 强化学习（Reinforcement Learning）

1.3 机器学习工作流程

第二章：Python 机器学习环境搭建

2.1 安装必要的库

2.2 核心库介绍

2.3 验证安装

第三章：实战案例

案例 1：鸢尾花分类（多分类问题）

3.1.1 数据探索

3.1.2 模型训练与评估

3.1.3 模型优化

案例 2：波士顿房价预测（回归问题）

3.2.1 数据加载与探索

3.2.2 多模型对比

3.2.3 特征重要性分析

案例 3：手写数字识别（图像分类）

3.3.1 数据加载与可视化

3.3.2 模型训练

3.3.3 错误分析

案例 4：客户分群（无监督学习 - 聚类）

3.4.1 数据生成与探索

3.4.2 K-Means 聚类

3.4.3 聚类结果可视化与分析

案例 5：垃圾邮件检测（文本分类）

3.5.1 数据准备

3.5.2 特征提取与模型训练

3.5.3 模型分析与优化

第四章：机器学习最佳实践

4.1 数据质量优先

4.2 防止过拟合

4.3 特征工程技巧

4.4 模型评估指标选择

第五章：常见问题解答（FAQ）

Q1: 应该选择哪个机器学习算法？

Q2: 如何处理缺失值？

Q3: 训练集和测试集比例多少合适？

Q4: 模型准确率低怎么办？

Q5: 如何保存和加载模型？

第六章：总结与下一步

6.1 本教程要点回顾

6.2 学习路线建议

6.3 推荐资源