2026-07-01 05:44:16 网络安全文章来源：ZONE.CI 全球网 0 阅读模式

文章总结： 该文档探讨了多臂老虎机和UCB算法在网络安全领域的应用，重点分析了如何通过探索-利用权衡机制优化钓鱼邮件与反钓鱼系统的对抗策略。文档提供了完整的Python代码实现，包括MultiArmedBandit类定义、ε-greedy策略和UCB算法在防火墙策略选择中的具体应用，展示了通过动态调整策略选择概率来提升防御效果的方法。 综合评分： 80 文章分类： AI安全,安全工具,安全运营,恶意软件,安全意识

cover_image

Ai养蛊:让钓鱼邮件和反钓鱼邮件系统打一架

秋名山上的小柠秋名山上的小柠

蚁景网络安全

2026年6月30日 17:40 湖南

在小说阅读器读本章

去阅读

mab

多臂老虎机，又称为mab

同一个环境，动作，状态下有可能返回1，有可能返回0

也就是说环境反馈它不是一个固定的值

可以假设为有五个函数，也就是相当于五种反馈，第一个函数返回1的概率是20％，返回0的概率是80％

代码实现：

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineimport&nbsp;numpy&nbsp;as&nbsp;npimport&nbsp;pandas&nbsp;as&nbsp;pd
class&nbsp;MultiArmedBandit:&nbsp; &nbsp;&nbsp;def&nbsp;__init__(self, n_arms, true_rewards):&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.n_arms = n_arms&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.true_rewards = true_rewards&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.estimates = np.zeros(n_arms) &nbsp;# 每个臂的奖励估计&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.action_counts = np.zeros(n_arms) &nbsp;# 每个臂被选择的次数
&nbsp; &nbsp;&nbsp;def&nbsp;select_arm(self, epsilon):&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;if&nbsp;np.random.rand() < epsilon:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;np.random.randint(self.n_arms) &nbsp;# 探索&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;else:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;np.argmax(self.estimates) &nbsp;# 开发
&nbsp; &nbsp;&nbsp;def&nbsp;update_estimates(self, chosen_arm, reward):&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.action_counts[chosen_arm] +=&nbsp;1&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# 更新奖励估计&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.estimates[chosen_arm] += (reward -&nbsp;self.estimates[chosen_arm]) /&nbsp;self.action_counts[chosen_arm]
def&nbsp;simulate_bandit(n_arms, true_rewards, n_rounds, epsilon):&nbsp; &nbsp; bandit = MultiArmedBandit(n_arms, true_rewards)&nbsp; &nbsp; rewards = np.zeros(n_rounds)&nbsp; &nbsp; cumulative_rewards = np.zeros(n_rounds)
&nbsp; &nbsp;&nbsp;for&nbsp;round&nbsp;in&nbsp;range(n_rounds):&nbsp; &nbsp; &nbsp; &nbsp; chosen_arm = bandit.select_arm(epsilon)&nbsp; &nbsp; &nbsp; &nbsp; reward = np.random.normal(true_rewards[chosen_arm],&nbsp;1) &nbsp;# 奖励是正态分布&nbsp; &nbsp; &nbsp; &nbsp; bandit.update_estimates(chosen_arm, reward)&nbsp; &nbsp; &nbsp; &nbsp; rewards[round] = reward&nbsp; &nbsp; &nbsp; &nbsp; cumulative_rewards[round] = np.sum(rewards)
&nbsp; &nbsp;&nbsp;return&nbsp;cumulative_rewards
# 参数设置n_arms =&nbsp;5true_rewards = [1.0,&nbsp;1.5,&nbsp;2.0,&nbsp;0.5,&nbsp;1.2] &nbsp;# 每个臂的真实奖励均值n_rounds =&nbsp;1000epsilon =&nbsp;0.1
cumulative_rewards = simulate_bandit(n_arms, true_rewards, n_rounds, epsilon)results_df = pd.DataFrame({&nbsp; &nbsp;&nbsp;'Round': np.arange(1, n_rounds +&nbsp;1),&nbsp; &nbsp;&nbsp;'Cumulative Rewards': cumulative_rewards})
results_df

类定义：MultiArmedBandit

n_arms 老虎机的数量

true_rewards 每个臂的真实平均奖励

estimates 目前认为每个臂的平均回报是多少，初始全为0

action_counts 记录每个臂被拉了多少次，用于更新均值

选择臂：select_arm(self, epsilon)

然后定义一个随机数

以概率 ε 进行探索，也就是随机选一个臂，以概率 1 – ε 进行开发（选当前估计奖励最高的臂）。

比如说当 epsilon = 0.1：

10% 概率随机探索；
90% 概率选估计最好的那一个

更新估计值：update_estimates()

R 是这次的实际奖励；N 是该臂被选过的次数；Q 是对该臂期望奖励的估计

模拟函数：simulate_bandit()

初始化一个 MultiArmedBandit 实例；
进行多轮（n_rounds）实验；
每一轮：

用 select_arm() 决定拉哪一台机器；
根据真实均值 true_rewards[chosen_arm] 生成一个服从正态分布的奖励；
用 update_estimates() 更新估计；
记录当前的奖励和累计奖励

效果如图所示：

ucb

UCB算法是一种用于解决探索与利用问题的策略选择方法，广泛应用于多臂老虎机问题

其核心思想是通过估计每个选项的潜在收益来平衡探索新选项和利用已知最佳选项之间的权衡

基本原理 1. 探索与利用：探索：尝试新的选项以获取更多的信息，利用：选择当前已知的最佳选项以最大化收益

UCB值计算：对于每个选项，UCB算法计算一个上置信界值也就是UCB值，该值结合了成功率和探索因子

计算公式：

X_i 是选项 i 的成功率,即平均收益; n 是当前总的尝试次数; n_i 是选项 i 的尝试次数

第一项是指当前已知的平均成功率；第二项是指置信区间，也就是越没试过的策略，这项越大；比如说你去饭堂吃饭，吃过 10 次的店你知道它一般，但没吃过的店你可能会想试一试，这就是 UCB 的探索机制

应用场景 UCB算法广泛应用于在线广告推荐、A/B测试、动态定价、机器学习模型选择等领域，尤其是在需要实时决策和反馈的环境中

ucb的通俗解释：一个左撇子，用手拿东西的时候，用右手的概率是20% ，用左手的概率是80%由于第一次选择的时候左右都会选，但是概率不同，选择不同手的频率就会影响两边ubc（可以理解为Q表）的值那么我们就可以根据两边受频率影响的值动态调整我们是否选择高的那边的概率

防火墙策略

假设有五个防火墙策略，并且拦截攻击的成功率都不一致

但是在实际项目中，不用都写出成功率出来，毕竟只要知道哪个防火墙拦截的成功率高，那肯定优先选择那个防火墙

现在是不知道概率多少

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineimport&nbsp;numpy&nbsp;as&nbsp;npimport&nbsp;pandas&nbsp;as&nbsp;pd
def&nbsp;check1(payload):&nbsp; &nbsp;&nbsp;return&nbsp;np.random.rand() <&nbsp;0.5&nbsp;&nbsp;# 50%成功率
def&nbsp;check2(payload):&nbsp; &nbsp;&nbsp;return&nbsp;np.random.rand() <&nbsp;0.7&nbsp;&nbsp;# 70%成功率
def&nbsp;check3(payload):&nbsp; &nbsp;&nbsp;return&nbsp;np.random.rand() <&nbsp;0.4&nbsp;&nbsp;# 40%成功率
def&nbsp;check4(payload):&nbsp; &nbsp;&nbsp;return&nbsp;np.random.rand() <&nbsp;0.3&nbsp;&nbsp;# 30%成功率
def&nbsp;check5(payload):&nbsp; &nbsp;&nbsp;return&nbsp;np.random.rand() <&nbsp;0.6&nbsp;&nbsp;# 60%成功率
# 将所有检查函数放入列表中check_functions = [check1, check2, check3, check4, check5]
# 定义防火墙策略选择器类class&nbsp;FirewallPolicySelector:&nbsp; &nbsp;&nbsp;def&nbsp;__init__(self, n_policies):&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.n_policies = n_policies&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.successes = np.zeros(n_policies)&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.attempts = np.zeros(n_policies)
&nbsp; &nbsp;&nbsp;def&nbsp;select_policy(self):&nbsp; &nbsp; &nbsp; &nbsp; total_attempts = np.sum(self.attempts)&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;if&nbsp;total_attempts ==&nbsp;0:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;np.random.randint(self.n_policies) &nbsp;# 如果没有尝试过，随机选择&nbsp; &nbsp; &nbsp; &nbsp; ucb_values =&nbsp;self.successes / (self.attempts +&nbsp;1e-5) + np.sqrt(2&nbsp;* np.log(total_attempts) / (self.attempts +&nbsp;1e-5))&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;np.argmax(ucb_values) &nbsp;# 选择UCB值最高的策略
&nbsp; &nbsp;&nbsp;def&nbsp;update(self, chosen_policy, success):&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.attempts[chosen_policy] +=&nbsp;1&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.successes[chosen_policy] += success
# 模拟防火墙策略优化过程def&nbsp;simulate_firewall(n_policies, n_rounds):&nbsp; &nbsp; policy_selector = FirewallPolicySelector(n_policies)&nbsp; &nbsp; results = []
&nbsp; &nbsp;&nbsp;for&nbsp;round&nbsp;in&nbsp;range(n_rounds):&nbsp; &nbsp; &nbsp; &nbsp; chosen_policy = policy_selector.select_policy()&nbsp; &nbsp; &nbsp; &nbsp; payload = np.random.randint(0,&nbsp;100) &nbsp;# 生成随机攻击样本&nbsp; &nbsp; &nbsp; &nbsp; success = check_functions[chosen_policy](payload) &nbsp;# 使用选定的check函数&nbsp; &nbsp; &nbsp; &nbsp; policy_selector.update(chosen_policy, success)&nbsp; &nbsp; &nbsp; &nbsp; results.append((round&nbsp;+&nbsp;1, chosen_policy, success))
&nbsp; &nbsp; results_df = pd.DataFrame(results, columns=['轮次',&nbsp;'选择的策略',&nbsp;'成功拦截'])&nbsp; &nbsp;&nbsp;return&nbsp;results_df
# 参数设置n_policies =&nbsp;len(check_functions) &nbsp;# 策略数量n_rounds =&nbsp;1000
# 运行模拟results_df = simulate_firewall(n_policies, n_rounds)
# 筛选出成功拦截的部分successful_results = results_df[results_df['成功拦截'] ==&nbsp;1]
# 输出每个策略的成功率print("\n每个策略的成功率：")print(results_df.groupby('选择的策略')['成功拦截'].mean())
# 显示成功拦截的结果print("\n成功拦截的结果：")print(successful_results)# 统计每个策略的选择次数policy_counts = results_df['选择的策略'].value_counts()
# 创建 DataFrame 显示所有策略及其选择次数result_df = pd.DataFrame({&nbsp; &nbsp;&nbsp;'选择次数': policy_counts}).reset_index()
# 重命名列result_df.columns = ['选择的策略',&nbsp;'选择次数']
# 设置行标题result_df.index = [f'策略&nbsp;{i+1}'&nbsp;for&nbsp;i&nbsp;in&nbsp;range(len(result_df))]result_df

防火墙策略选择器类 FirewallPolicySelector

n_policies: 策略数量;successes[i]: 第 i 个策略成功的次数;attempts[i]: 第 i 个策略被尝试的次数

策略选择核心 select_policy()

这里用的ucb计算公式，在上述已贴出

模拟防火墙运行：simulate_firewall()

循环共执行 n_rounds，比如 1000 轮：

选择一个策略，然后模拟生成攻击，接着判断是否成功拦截，最后更新策略统计

简单来说，这份代码就是模拟了一个基于UCB算法的自适应防火墙策略选择系统，它通过统计每个检测策略的历史成功率和尝试次数，自动在多轮攻击中选择最有效的策略，在“探索新方法”和“利用已知最优”之间取得平衡，最终趋向于选择拦截率最高的策略

效果如图：

其实还有其他场景也适合，比如说什么恶意代码识别，邮箱识别，毕竟是策略选择

邮件攻防

假设现在有个角色A 通过mba模型实现强化学习下的优化钓鱼邮件内容

还有一个角色B 通过Q-learning的方式实现强化学习下的钓鱼邮件内容识别

当然也可以换成一边是恶意软件，一边杀毒软件，做一个养蛊哈哈

整个流程就是攻击方不断发送不同类型的钓鱼邮件，防御方在识别的过程中逐渐学习，而攻击方也会记录哪些内容更容易成功，从而倾向选择这些高成功率内容

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineimport&nbsp;numpy&nbsp;as&nbsp;npimport&nbsp;pandas&nbsp;as&nbsp;pd
class&nbsp;PhishingContentOptimizer:&nbsp; &nbsp;&nbsp;def&nbsp;__init__(self, contents, phishing_probabilities, epsilon=0.1):&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.contents = contents &nbsp;# 钓鱼邮件内容列表&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.phishing_probabilities = phishing_probabilities &nbsp;# 各内容被识别为钓鱼邮件的概率&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.epsilon = epsilon &nbsp;# 探索率&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.success_counts = np.zeros(len(contents)) &nbsp;# 各内容成功次数&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.total_counts = np.zeros(len(contents)) &nbsp;# 各内容尝试次数
&nbsp; &nbsp;&nbsp;def&nbsp;select_content(self):&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;if&nbsp;np.random.rand() <&nbsp;self.epsilon:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;np.random.choice(self.contents) &nbsp;# 随机选择&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;else:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; success_rates =&nbsp;self.success_counts / (self.total_counts +&nbsp;1e-5) &nbsp;# 避免除零&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;self.contents[np.argmax(success_rates)] &nbsp;# 选择成功率最高的内容
&nbsp; &nbsp;&nbsp;def&nbsp;update(self, chosen_content, success):&nbsp; &nbsp; &nbsp; &nbsp; index =&nbsp;self.contents.index(chosen_content)&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.total_counts[index] +=&nbsp;1&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;if&nbsp;success:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.success_counts[index] +=&nbsp;1
class&nbsp;QLearningPhishingDetector:&nbsp; &nbsp;&nbsp;def&nbsp;__init__(self, actions, learning_rate=0.1, discount_factor=0.9, exploration_rate=1.0):&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.q_table = {} &nbsp;# Q值表&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.actions = actions &nbsp;# 可采取的动作&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.learning_rate = learning_rate &nbsp;# 学习率&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.discount_factor = discount_factor &nbsp;# 折扣因子&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.exploration_rate = exploration_rate &nbsp;# 探索率&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.exploration_decay =&nbsp;0.99&nbsp;&nbsp;# 探索率衰减
&nbsp; &nbsp;&nbsp;def&nbsp;get_action(self, state):&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;if&nbsp;state&nbsp;not&nbsp;in&nbsp;self.q_table:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.q_table[state] = [0] *&nbsp;len(self.actions)&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;if&nbsp;np.random.rand() <&nbsp;self.exploration_rate:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;np.random.choice(self.actions) &nbsp;# 探索&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;else:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;self.actions[np.argmax(self.q_table[state])] &nbsp;# 利用
&nbsp; &nbsp;&nbsp;def&nbsp;update_q_value(self, state, action, reward, next_state):&nbsp; &nbsp; &nbsp; &nbsp; current_q =&nbsp;self.q_table[state]&nbsp; &nbsp; &nbsp; &nbsp; max_future_q =&nbsp;max(self.q_table.get(next_state, [0] *&nbsp;len(self.actions)))&nbsp; &nbsp; &nbsp; &nbsp; current_q[action] +=&nbsp;self.learning_rate * (reward +&nbsp;self.discount_factor * max_future_q - current_q[action]) &nbsp;# 更新Q值
&nbsp; &nbsp;&nbsp;def&nbsp;decay_exploration(self):&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.exploration_rate *=&nbsp;self.exploration_decay
# 示例钓鱼邮件内容及其被识别为钓鱼邮件的概率contents = [&nbsp; &nbsp;&nbsp;"您的账户存在异常，请立即验证。",&nbsp; &nbsp;&nbsp;"恭喜您获得奖品，请点击链接领取。",&nbsp; &nbsp;&nbsp;"重要通知：请更新您的账户信息。",&nbsp; &nbsp;&nbsp;"您有新的消息，请查看。",&nbsp; &nbsp;&nbsp;"系统升级，请确认您的信息。",]
# 各内容被识别为钓鱼邮件的概率phishing_probabilities = {&nbsp; &nbsp; contents[0]:&nbsp;0.1,&nbsp; &nbsp; contents[1]:&nbsp;0.3,&nbsp; &nbsp; contents[2]:&nbsp;0.6,&nbsp; &nbsp; contents[3]:&nbsp;0.5,&nbsp; &nbsp; contents[4]:&nbsp;0.4,}
# 初始化角色A（内容优化器）optimizer = PhishingContentOptimizer(contents, phishing_probabilities)
# 初始化角色B（钓鱼邮件识别器）actions = [0,&nbsp;1] &nbsp;# 0: 正常邮件, 1: 钓鱼邮件detector = QLearningPhishingDetector(actions)
# 预训练阶段pretrain_steps =&nbsp;50&nbsp;&nbsp;# 预训练步骤数for&nbsp;_&nbsp;in&nbsp;range(pretrain_steps):&nbsp; &nbsp; chosen_content = np.random.choice(contents) &nbsp;# 随机选择内容&nbsp; &nbsp; action = detector.get_action(chosen_content) &nbsp;# 识别邮件&nbsp; &nbsp;&nbsp;# 根据内容的钓鱼概率判断&nbsp; &nbsp; success = np.random.rand() < phishing_probabilities[chosen_content]&nbsp;if&nbsp;action ==&nbsp;1&nbsp;else&nbsp;False&nbsp; &nbsp; reward =&nbsp;1&nbsp;if&nbsp;action ==&nbsp;1&nbsp;and&nbsp;success&nbsp;else&nbsp;-1&nbsp;&nbsp;# 奖励机制&nbsp; &nbsp; detector.update_q_value(chosen_content, action, reward, chosen_content) &nbsp;# 更新Q值&nbsp; &nbsp; detector.decay_exploration() &nbsp;# 衰减探索率
# 模拟钓鱼攻击过程results = []
for&nbsp;_&nbsp;in&nbsp;range(100): &nbsp;# 模拟100次钓鱼攻击&nbsp; &nbsp; chosen_content = optimizer.select_content()
&nbsp; &nbsp;&nbsp;# 角色B识别邮件&nbsp; &nbsp; action = detector.get_action(chosen_content) &nbsp;# 识别邮件&nbsp; &nbsp; results.append({&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;'选择的内容': chosen_content,&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;'识别结果':&nbsp;'钓鱼邮件'&nbsp;if&nbsp;action ==&nbsp;1&nbsp;else&nbsp;'正常邮件'&nbsp; &nbsp; })
# 统计识别结果的成功率for&nbsp;result&nbsp;in&nbsp;results:&nbsp; &nbsp;&nbsp;if&nbsp;result['识别结果'] ==&nbsp;'钓鱼邮件':&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# 根据内容的钓鱼概率判断&nbsp; &nbsp; &nbsp; &nbsp; success = np.random.rand() < phishing_probabilities[result['选择的内容']]&nbsp; &nbsp;&nbsp;else:&nbsp; &nbsp; &nbsp; &nbsp; success =&nbsp;False&nbsp;&nbsp;# 正常邮件识别为钓鱼邮件的成功率为0
&nbsp; &nbsp;&nbsp;# 更新角色A的成功与否&nbsp; &nbsp; optimizer.update(result['选择的内容'], success)
&nbsp; &nbsp;&nbsp;# 更新角色B的Q值&nbsp; &nbsp; reward =&nbsp;1&nbsp;if&nbsp;action ==&nbsp;1&nbsp;and&nbsp;success&nbsp;else&nbsp;-1&nbsp;&nbsp;# 奖励机制&nbsp; &nbsp; detector.update_q_value(result['选择的内容'], action, reward, result['选择的内容']) &nbsp;# 更新Q值&nbsp; &nbsp; detector.decay_exploration() &nbsp;# 衰减探索率
# 转换为DataFrameresults_df = pd.DataFrame(results)
# 输出结果print(results_df)
# 统计每个内容的使用频率content_counts = results_df['选择的内容'].value_counts()most_used_content = content_counts.idxmax()most_used_count = content_counts.max()
# 筛选出使用最多的内容的结果most_used_results = results_df[results_df['选择的内容'] == most_used_content]
# 输出使用最多的内容print(f"\n使用最多的内容:&nbsp;{most_used_content}, 使用次数:&nbsp;{most_used_count}")print("\n使用最多内容的结果：")print(most_used_results)
# 统计识别结果为正常邮件的百分比normal_email_count = results_df[results_df['识别结果'] ==&nbsp;'正常邮件'].shape[0]total_count = results_df.shape[0]normal_email_percentage = (normal_email_count / total_count) *&nbsp;100
print(f"\n识别结果为正常邮件的百分比:&nbsp;{normal_email_percentage:.2f}%")

钓鱼内容优化器 PhishingContentOptimizer

contents: 所有钓鱼邮件的模板内容；

phishing_probabilities: 每种内容被识别为钓鱼的概率，也就是被识破的难度；

epsilon: ε-贪婪算法中的“探索率”，比如 0.1 意味着 10% 概率随机探索；

success_counts: 各邮件“成功骗过检测”的次数；

total_counts: 每个内容被使用的次数

选择内容 select_content

每轮发送邮件前，优化器根据历史成功率决定发哪种内容：

90% 概率选择成功率最高的邮件；
10% 概率随机选一个探索新的可能

这样攻击方会逐渐聚焦在最有效的邮件内容上

钓鱼邮件检测器 QLearningPhishingDetector

更新 Q 值 update_q_value

在状态 s 采取动作 a 后，得到奖励 r，下一状态 s’ 的最大潜在价值是 max_future_q，于是把当前的 Q 值往新的期望值方向更新一点

预训练阶段，让检测器先学习

模拟 50 封训练邮件，让检测器初步学会识别钓鱼概率高的邮件

奖励逻辑：

检测为钓鱼且确实钓鱼 → 奖励 +1；
否则 → 惩罚 -1

如果检测器判断为“钓鱼邮件”，就按对应概率看它是否真识别成功，否则认为识别失败

然后，攻击方更新该邮件的成功率，防御方更新Q值，探索率继续衰减

这里其实还有个预训练，先让钓鱼邮件识别器跑起来，学习里面一些东西，分辨出哪个是钓鱼邮件，哪个是正常邮件

然后再去模拟钓鱼邮件攻击的过程，结果如下图所示

结果看起来比较发散，没有那么真实，其实可以把Q-Learing算法那一部分改为神经网络

GAN网络其实就是Ai和Ai之间对打的过程

免责声明：

本文所载程序、技术方法仅面向合法合规的安全研究与教学场景，旨在提升网络安全防护能力，具有明确的技术研究属性。

任何单位或个人未经授权，将本文内容用于攻击、破坏等非法用途的，由此引发的全部法律责任、民事赔偿及连带责任，均由行为人独立承担，本站不承担任何连带责任。

本站内容均为技术交流与知识分享目的发布，若存在版权侵权或其他异议，请通过邮件联系处理，具体联系方式可点击页面上方的联系我。

本文转载自：蚁景网络安全秋名山上的小柠秋名山上的小柠《Ai养蛊:让钓鱼邮件和反钓鱼邮件系统打一架》