2026-06-30 07:31:29 网络安全文章来源：ZONE.CI 全球网 0 阅读模式

文章总结： 本文详细解析Redis缓存穿透、击穿和雪崩三大问题的核心区别与防护策略。穿透指查询不存在数据导致数据库压力，可通过缓存空值、布隆过滤器拦截；击穿是热点key过期瞬间大量请求回源，需加互斥锁或逻辑过期；雪崩因大量key同时过期或缓存故障，应随机化过期时间并构建多级缓存。文档提供Java/Python代码示例、Redis配置方案及四层防护体系，强调监控告警与故障演练的重要性。 综合评分： 82 文章分类： 解决方案,实战经验,安全建设,性能优化,其他

cover_image

Redis 缓存穿透/击穿/雪崩：面试必问+生产防护策略

点击关注 👉 点击关注 👉

马哥Linux运维

2026年6月27日 13:00 广东

在小说阅读器读本章

去阅读

Redis 缓存穿透/击穿/雪崩：面试必问+生产防护策略

一、问题背景

缓存是后端服务提速的标配，Redis 是绝大多数互联网公司的首选。但只要把缓存和数据库组合在一起用，”缓存失效后数据库被打挂”的故事几乎每个团队都经历过。最常见的三种事故形态，被业内叫做缓存穿透、缓存击穿、缓存雪崩。

这三类问题在面试里是必问题，在生产环境是高频故障。它们表面上看都是”缓存没命中”，但根因不同，防护策略也不同，混在一起讲就容易抓不住要害。

典型故事场景大致是这样：

渗透：营销活动页面，URL 上带了用户 ID，黑客用脚本按 1-999999 顺序遍历用户 ID，请求商品详情接口。缓存里没有这些用户的数据，数据库里大部分也没有（用户不存在），但每次请求都打到数据库。数据库 QPS 突然从 200 飙升到 5000，CPU 100%。
击穿：商品详情页的某个爆品（iPhone 新品首发、双 11 爆款）缓存刚好过期，几万 QPS 全部绕过缓存打到数据库。数据库连接池瞬间打满，所有接口全部超时。
雪崩：凌晨 3 点批量预热的缓存统一设置为 6 小时后过期，6 小时后所有 key 同时失效，几千万 QPS 全部回源到数据库。数据库崩了，全站 502。

不同场景下的根因、现象、防护方式都不一样。这一篇把三类问题拆开讲清楚，并落到具体命令、配置、脚本和回滚方式上。

写作目标：

让初中级工程师能够在面试里讲清楚穿透/击穿/雪崩的差异
让读者在生产环境能够识别属于哪类问题
让读者能够针对自己的业务场景设计防护策略
给出可直接落地的 Lua 脚本、配置示例、监控告警

二、适用场景

任何以 Redis 作为缓存组件的服务
高 QPS 接口（商品详情、用户主页、推荐流、Feed 流）
营销活动、秒杀、限时抢购、首发、新品发布
用户规模大、Key 数量大的业务（亿级 Key）
数据冷热分明的业务（热数据 20%，冷数据 80%）
缓存预热、定时刷新、批量加载
多级缓存架构（本地缓存 + Redis）
高可用 Redis 架构（主从、哨兵、Cluster）
黑产攻击场景（CC、爬虫、刷单、撞库）
数据库连接池较小的服务（最容易被缓存失效打挂）
缓存一致性要求高的业务（金融、库存、抢购）
缓存容量评估与扩缩容
Redis 持久化策略选择（RDB / AOF / 混合）
大 Key、热 Key 治理
Redis 监控告警与故障演练

三、核心知识点

3.1 三类问题的核心区别

3.2 缓存穿透的细节

缓存穿透指的是查询一个缓存和数据库中都不存在的数据。这种请求每次都会绕过缓存直接打到数据库。

为什么会存在这种数据：

业务设计：用户 ID、订单号、SKU 都是自然递增，黑客可以遍历
系统迁移：老数据被清理，新数据还没预热
数据清理：定期清理过期的冷数据
攻击：黑产故意用不存在的 ID 撞库

穿透的特征：

Redis 缓存命中率正常或偏低，但数据库查询次数与 Redis 未命中次数基本对得上
数据库的查询都返回空，但 QPS 很高
数据库连接池使用率上升，但不是因为慢 SQL
Redis 内存使用率不变（因为没数据可缓存）

3.3 缓存击穿的细节

缓存击穿指的是某个热点 key 在过期的瞬间，大量请求同时打到数据库。

为什么会发生：

热点 key 设置了过期时间，到点统一过期
热点 key 被 DEL 主动删除
热点 key 因为内存淘汰被清掉（maxmemory-policy）

击穿的特征：

单一 key 的 QPS 突增
数据库连接池使用率飙升
同样的 SQL 在数据库慢日志里大量出现
Redis 里这个 key 不存在，但其他 key 都还在

3.4 缓存雪崩的细节

缓存雪崩指的是大量 key 在同一时间过期，或缓存整体不可用，导致请求全部回源到数据库。

常见原因：

批量预热时把过期时间设置成一样的
缓存服务整体宕机（Redis 进程崩溃、磁盘满、OOM、网络分区）
Redis 主从切换导致短暂不可用
大 key 清理时阻塞 Redis

雪崩的特征：

大量 key 同时 miss
数据库 QPS 瞬时飙升
Redis 监控显示 hit 率断崖式下降
Redis 进程、内存、网络异常

3.5 Redis 的几个关键机制

3.5.1 过期策略

Redis 的过期策略是惰性删除 + 定期删除：

惰性删除：访问 key 时才检查是否过期
定期删除：每隔 100ms 抽样检查一部分过期 key

如果大量 key 同时过期，惰性删除会导致请求穿过，定期删除不能保证全部删除。

3.5.2 内存淘汰策略

maxmemory-policy 配置决定内存满时如何淘汰 key：

noeviction

：拒绝写入（默认）
allkeys-lru

：所有 key 中淘汰最久未使用的
allkeys-lfu

：所有 key 中淘汰使用频率最低的（Redis 4.0+）
volatile-lru

：过期 key 中淘汰最久未使用的
volatile-lfu

：过期 key 中淘汰使用频率最低的
allkeys-random

：所有 key 中随机淘汰
volatile-random

：过期 key 中随机淘汰
volatile-ttl

：过期 key 中淘汰最快过期的

生产环境最常用 allkeys-lru 或 volatile-lru。

3.5.3 持久化机制

RDB：定时快照，重启恢复快，但可能丢失最后一次快照后的数据
AOF：记录每次写操作，数据更安全，但文件更大、恢复更慢
混合持久化（Redis 4.0+）：RDB 快照 + 增量 AOF

3.5.4 高可用架构

主从：读写分离，从节点做副本
哨兵：自动主从切换
Cluster：分片集群，水平扩展

3.6 常见误区

误区一：缓存命中率高就万事大吉。

错。命中率只代表”已存在的 key 被访问时的命中情况”，无法反映穿透、击穿、雪崩。穿透的攻击会绕过命中率统计。

误区二：缓存穿透可以用布隆过滤器完美解决。

错。布隆过滤器有误判率，配置不当会有穿透漏过；而且布隆过滤器需要预热，启动时是个空窗。

误区三：缓存击穿只需要加锁。

错。锁会导致请求排队等待，超时风险高。要结合逻辑过期、提前刷新、永不过期等手段。

误区四：缓存雪崩只需要过期时间加随机值。

错。还要考虑 Redis 自身高可用、多级缓存、熔断降级等。

误区五：Redis 集群了就高枕无忧。

错。Cluster 在主从切换、slot 迁移、节点扩容时都有抖动，要做好业务侧容错。

四、整体排查与防护思路

4.1 排查三步法

拿到缓存问题告警后，按这三步走：

1. 定位是穿透/击穿/雪崩中的哪一类
2. 拿到具体证据：miss 率、key 分布、数据库 QPS、慢 SQL
3. 实施防护或临时止血

判断依据：

单 key QPS 突增 + 数据库单 SQL 集中 → 击穿
大量 key 同时 miss + 数据库整体压力 → 雪崩
数据库返回大量空结果 + 缓存命中低 → 穿透
Redis 整体不可用 + 数据库压力大 → 雪崩（基础设施级别）

4.2 防护的层次

防护分四层，从外到内：

客户端限流 → 接入层（WAF/CDN/网关）→ 应用层（业务防护）→ 数据层（数据库防护）

每一层都要做防护，而不是只靠一层。

4.3 防护的总体原则

任何缓存策略都要考虑”缓存失效”的兜底
任何数据库访问都要考虑”被击穿”的兜底
任何 Redis 操作都要考虑”Redis 不可用”的兜底
防护策略要可灰度、可回滚、可验证

五、实战步骤

5.1 缓存穿透的实战防护

5.1.1 方案一：缓存空值

核心思想：查询数据库返回空时，也往缓存里写一个标记。

java

// Java 示例
public&nbsp;Object&nbsp;getData(String key)&nbsp;{
&nbsp; &nbsp;&nbsp;Object&nbsp;value&nbsp;=&nbsp;redis.get(key);
&nbsp; &nbsp;&nbsp;if&nbsp;(value !=&nbsp;null) {
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;value;
&nbsp; &nbsp; }

&nbsp; &nbsp;&nbsp;// 查数据库
&nbsp; &nbsp;&nbsp;Object&nbsp;data&nbsp;=&nbsp;db.query(key);

&nbsp; &nbsp;&nbsp;if&nbsp;(data ==&nbsp;null) {
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;// 缓存空值，TTL 短一些
&nbsp; &nbsp; &nbsp; &nbsp; redis.setex("null:"&nbsp;+ key,&nbsp;60,&nbsp;"1");
&nbsp; &nbsp; }&nbsp;else&nbsp;{
&nbsp; &nbsp; &nbsp; &nbsp; redis.setex(key,&nbsp;3600, serialize(data));
&nbsp; &nbsp; }

&nbsp; &nbsp;&nbsp;return&nbsp;data;
}

注意：

缓存空值的 key 要加前缀区分（如 null:xxx）
缓存空值的 TTL 要比正常值短得多（60s vs 3600s）
数据库真的写入这个 key 时，要删除空值缓存

风险：

攻击者持续用不存在的 key，会写满大量 null 缓存
解决：限制空值 key 的数量，或者用布隆过滤器

5.1.2 方案二：布隆过滤器

核心思想：把所有存在的 key 预先加载到布隆过滤器，请求过来时先过布隆过滤器。

布隆过滤器的特性：

可能误判（不在的可能被判为在）
不会漏判（在的一定判为在）
占用空间小
插入和查询都是 O(1)

Redis 4.0+ 提供 redisbloom 模块：

bash

# 安装 redisbloom
# Debian/Ubuntu
apt install redis-tools
# 然后从 Redis 官方下载 redisbloom.so 加载
redis-server --loadmodule /path/to/redisbloom.so

# 创建布隆过滤器
BF.RESERVE myfilter 0.01 1000000
# 0.01 误判率
# 1000000 预期元素数

# 添加元素
BF.ADD myfilter user:100
BF.MADD myfilter user:101 user:102 user:103

# 查询
BF.EXISTS myfilter user:100
# 返回 1 表示可能存在
# 返回 0 表示一定不存在

# 批量查询
BF.MEXISTS myfilter user:101 user:200 user:300

应用代码：

java

public&nbsp;Object&nbsp;getData(String key)&nbsp;{
&nbsp; &nbsp;&nbsp;// 1. 布隆过滤器判断
&nbsp; &nbsp;&nbsp;Boolean&nbsp;exists&nbsp;=&nbsp;redisTemplate.execute((RedisCallback<Boolean>) conn ->
&nbsp; &nbsp; &nbsp; &nbsp; conn.scriptingCommands().eval(
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"return redis.call('BF.EXISTS', KEYS[1], ARGV[1])".getBytes(),
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ReturnType.BOOLEAN,&nbsp;1,
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"myfilter".getBytes(),
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; key.getBytes()
&nbsp; &nbsp; &nbsp; &nbsp; )
&nbsp; &nbsp; );

&nbsp; &nbsp;&nbsp;if&nbsp;(exists ==&nbsp;null&nbsp;|| !exists) {
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;null; &nbsp;// 一定不存在，直接返回
&nbsp; &nbsp; }

&nbsp; &nbsp;&nbsp;// 2. 查 Redis
&nbsp; &nbsp;&nbsp;Object&nbsp;value&nbsp;=&nbsp;redis.get(key);
&nbsp; &nbsp;&nbsp;if&nbsp;(value !=&nbsp;null) {
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;value;
&nbsp; &nbsp; }

&nbsp; &nbsp;&nbsp;// 3. 查数据库
&nbsp; &nbsp;&nbsp;Object&nbsp;data&nbsp;=&nbsp;db.query(key);
&nbsp; &nbsp;&nbsp;if&nbsp;(data !=&nbsp;null) {
&nbsp; &nbsp; &nbsp; &nbsp; redis.setex(key,&nbsp;3600, serialize(data));
&nbsp; &nbsp; }

&nbsp; &nbsp;&nbsp;return&nbsp;data;
}

风险与注意：

布隆过滤器需要预热，启动时是空窗
误判率不能设太低（0.01% 会让内存占用激增）
删除元素需要 BF.RESERVE 重建，复杂
不适合频繁增删的业务

5.1.3 方案三：参数合法性校验

核心思想：在应用层先做基本校验，过滤明显非法的请求。

python

# Python 示例
def&nbsp;get_user_info(user_id):
&nbsp; &nbsp;&nbsp;# 1. 基础校验
&nbsp; &nbsp;&nbsp;if&nbsp;not&nbsp;user_id&nbsp;or&nbsp;not&nbsp;user_id.isdigit():
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;None
&nbsp; &nbsp; user_id =&nbsp;int(user_id)
&nbsp; &nbsp;&nbsp;if&nbsp;user_id <&nbsp;1&nbsp;or&nbsp;user_id >&nbsp;99999999:
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;None

&nbsp; &nbsp;&nbsp;# 2. 查缓存
&nbsp; &nbsp; cache_key =&nbsp;f"user:{user_id}"
&nbsp; &nbsp; data = redis.get(cache_key)
&nbsp; &nbsp;&nbsp;if&nbsp;data:
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;data

&nbsp; &nbsp;&nbsp;# 3. 查数据库
&nbsp; &nbsp; data = db.query("SELECT * FROM users WHERE id = %s", user_id)
&nbsp; &nbsp;&nbsp;if&nbsp;data:
&nbsp; &nbsp; &nbsp; &nbsp; redis.setex(cache_key,&nbsp;3600, data)
&nbsp; &nbsp;&nbsp;return&nbsp;data

适用场景：

参数有明确范围（用户 ID、订单号、SKU）
业务允许提前校验
不适合完全无规律的数据（搜索词）

5.1.4 方案四：限流

对单个 IP / 单个用户 / 单个接口的 QPS 做限流，超出限制直接拒绝。

java

// 用 Redis 实现滑动窗口限流
public&nbsp;boolean&nbsp;rateLimit(String key,&nbsp;int&nbsp;limit,&nbsp;int&nbsp;windowSec)&nbsp;{
&nbsp; &nbsp;&nbsp;long&nbsp;now&nbsp;=&nbsp;System.currentTimeMillis();
&nbsp; &nbsp;&nbsp;long&nbsp;windowStart&nbsp;=&nbsp;now - windowSec *&nbsp;1000L;

&nbsp; &nbsp;&nbsp;// ZADD 加入当前时间
&nbsp; &nbsp; redis.zadd(key, now, String.valueOf(now));
&nbsp; &nbsp;&nbsp;// 删除窗口外的
&nbsp; &nbsp; redis.zremrangeByScore(key,&nbsp;0, windowStart);
&nbsp; &nbsp;&nbsp;// 计数
&nbsp; &nbsp;&nbsp;long&nbsp;count&nbsp;=&nbsp;redis.zcard(key);
&nbsp; &nbsp;&nbsp;// 设置过期
&nbsp; &nbsp; redis.expire(key, windowSec);

&nbsp; &nbsp;&nbsp;return&nbsp;count <= limit;
}

5.1.5 方案对比

| 方案 | 适用场景 | 优点 | 缺点 | | — | — | — | — | | 缓存空值 | 攻击不持续 | 简单 | 攻击持续会写满内存 | | 布隆过滤器 | key 范围已知 | 空间小、效率高 | 启动空窗、不支持删除 | | 参数校验 | 参数有规律 | 性能最高 | 不适合所有场景 | | 限流 | 任何场景 | 通用 | 误伤正常用户 |

生产环境推荐组合：布隆过滤器 + 缓存空值 + 限流。

5.2 缓存击穿的实战防护

5.2.1 方案一：互斥锁（分布式锁）

核心思想：只有一个线程能加载数据，其他线程等待。

java

public&nbsp;Object&nbsp;getData(String key)&nbsp;{
&nbsp; &nbsp;&nbsp;Object&nbsp;value&nbsp;=&nbsp;redis.get(key);
&nbsp; &nbsp;&nbsp;if&nbsp;(value !=&nbsp;null) {
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;value;
&nbsp; &nbsp; }

&nbsp; &nbsp;&nbsp;// 尝试加锁
&nbsp; &nbsp;&nbsp;String&nbsp;lockKey&nbsp;=&nbsp;"lock:"&nbsp;+ key;
&nbsp; &nbsp;&nbsp;String&nbsp;lockValue&nbsp;=&nbsp;UUID.randomUUID().toString();
&nbsp; &nbsp;&nbsp;boolean&nbsp;locked&nbsp;=&nbsp;redis.set(lockKey, lockValue,&nbsp;"NX",&nbsp;"EX",&nbsp;10);

&nbsp; &nbsp;&nbsp;if&nbsp;(locked) {
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;try&nbsp;{
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;// 二次检查
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; value = redis.get(key);
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;if&nbsp;(value !=&nbsp;null) {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;value;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;// 查数据库
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; value = db.query(key);
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; redis.setex(key,&nbsp;3600, serialize(value));
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;value;
&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp;finally&nbsp;{
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;// 释放锁（Lua 脚本保证原子性）
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;String&nbsp;script&nbsp;=&nbsp;"if redis.call('get', KEYS[1]) == ARGV[1] then return redis.call('del', KEYS[1]) else return 0 end";
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; redis.eval(script,&nbsp;1, lockKey, lockValue);
&nbsp; &nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; }&nbsp;else&nbsp;{
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;// 没拿到锁，等一会再试
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;try&nbsp;{
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Thread.sleep(50);
&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp;catch&nbsp;(InterruptedException e) {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Thread.currentThread().interrupt();
&nbsp; &nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;getData(key);
&nbsp; &nbsp; }
}

注意：

锁的 value 必须是唯一的（UUID），避免误删别人的锁
锁要有过期时间，避免死锁
释放锁必须用 Lua 脚本保证原子性
等待时间不能太长，避免请求堆积

风险：

锁等待会导致请求延迟升高
锁过期但任务没完成，会导致多个线程同时加载

5.2.2 方案二：逻辑过期

核心思想：key 不设置 TTL，业务层维护一个 expire_at 字段。

java

public&nbsp;class&nbsp;CacheData<T> {
&nbsp; &nbsp;&nbsp;private&nbsp;T data;
&nbsp; &nbsp;&nbsp;private&nbsp;long&nbsp;expireAt; &nbsp;// 逻辑过期时间
}

public&nbsp;Object&nbsp;getData(String key)&nbsp;{
&nbsp; &nbsp;&nbsp;CacheData&nbsp;data&nbsp;=&nbsp;redis.get(key);
&nbsp; &nbsp;&nbsp;if&nbsp;(data ==&nbsp;null) {
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;// 缓存不存在，直接查 DB
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;loadFromDb(key);
&nbsp; &nbsp; }

&nbsp; &nbsp;&nbsp;if&nbsp;(data.getExpireAt() > System.currentTimeMillis()) {
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;data.getData(); &nbsp;// 未过期
&nbsp; &nbsp; }

&nbsp; &nbsp;&nbsp;// 已过期，异步刷新
&nbsp; &nbsp; asyncRefresh(key);
&nbsp; &nbsp;&nbsp;return&nbsp;data.getData(); &nbsp;// 返回旧数据
}

private&nbsp;void&nbsp;asyncRefresh(String key)&nbsp;{
&nbsp; &nbsp;&nbsp;// 异步线程池刷新
&nbsp; &nbsp; executor.submit(() -> {
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;Object&nbsp;newData&nbsp;=&nbsp;db.query(key);
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;CacheData&nbsp;cache&nbsp;=&nbsp;new&nbsp;CacheData();
&nbsp; &nbsp; &nbsp; &nbsp; cache.setData(newData);
&nbsp; &nbsp; &nbsp; &nbsp; cache.setExpireAt(System.currentTimeMillis() +&nbsp;3600&nbsp;*&nbsp;1000);
&nbsp; &nbsp; &nbsp; &nbsp; redis.set(key, cache);
&nbsp; &nbsp; });
}

优点：不会出现击穿，因为始终能返回数据（即使过期）。

缺点：

数据可能不是最新的（用户接受窗口期内的不一致）
实现复杂
异步刷新会消耗额外资源

5.2.3 方案三：永不过期 + 后台刷新

核心思想：key 不设 TTL，由后台定时刷新。

java

@Scheduled(fixedRate = 60000)&nbsp;&nbsp;// 每 60 秒刷新一次
public&nbsp;void&nbsp;refreshHotKeys()&nbsp;{
&nbsp; &nbsp;&nbsp;for&nbsp;(String key : hotKeys) {
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;Object&nbsp;data&nbsp;=&nbsp;db.query(key);
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;CacheData&nbsp;cache&nbsp;=&nbsp;new&nbsp;CacheData();
&nbsp; &nbsp; &nbsp; &nbsp; cache.setData(data);
&nbsp; &nbsp; &nbsp; &nbsp; cache.setExpireAt(Long.MAX_VALUE); &nbsp;// 永不过期
&nbsp; &nbsp; &nbsp; &nbsp; redis.set(key, cache);
&nbsp; &nbsp; }
}

优点：彻底避免击穿。

缺点：

数据更新滞后
占用内存
需要识别”热 key”

5.2.4 方案四：提前刷新

核心思想：在 key 过期前主动刷新。

java

@Scheduled(fixedRate = 30000)
public&nbsp;void&nbsp;preRefreshHotKeys()&nbsp;{
&nbsp; &nbsp;&nbsp;for&nbsp;(String key : hotKeys) {
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;CacheData&nbsp;data&nbsp;=&nbsp;redis.get(key);
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;if&nbsp;(data ==&nbsp;null) {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;continue;
&nbsp; &nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;long&nbsp;remaining&nbsp;=&nbsp;data.getExpireAt() - System.currentTimeMillis();
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;if&nbsp;(remaining <&nbsp;300000) { &nbsp;// 剩余不到 5 分钟
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;// 异步刷新
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; asyncRefresh(key);
&nbsp; &nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; }
}

5.2.5 方案对比

| 方案 | 优点 | 缺点 | 适用场景 | | — | — | — | — | | 互斥锁 | 数据强一致 | 锁等待延迟 | 写后立即读 | | 逻辑过期 | 不阻塞 | 数据可能过期 | 可接受短暂不一致 | | 永不过期 | 简单 | 数据滞后 | 容忍延迟更新 | | 提前刷新 | 不阻塞 | 实现复杂 | 可预知热点 |

生产环境推荐：逻辑过期 + 互斥锁 组合，逻辑过期兜底，互斥锁保证一致。

5.3 缓存雪崩的实战防护

5.3.1 方案一：过期时间随机化

核心思想：避免大量 key 同时过期。

java

int&nbsp;baseTtl&nbsp;=&nbsp;3600; &nbsp;// 基础 TTL
int&nbsp;randomTtl&nbsp;=&nbsp;new&nbsp;Random().nextInt(300); &nbsp;// 0-300 秒随机
int&nbsp;realTtl&nbsp;=&nbsp;baseTtl + randomTtl;
redis.setex(key, realTtl, data);

python

# Python
import&nbsp;random
base_ttl =&nbsp;3600
real_ttl = base_ttl + random.randint(0,&nbsp;300)
redis.setex(key, real_ttl, data)

注意：

随机范围要合理（太大导致部分 key 存活很久）
关键业务的随机范围可以更大

5.3.2 方案二：多级缓存

核心思想：本地缓存 + Redis，本地缓存兜底。

java

public&nbsp;Object&nbsp;getData(String key)&nbsp;{
&nbsp; &nbsp;&nbsp;// L1: 本地缓存（Caffeine / Guava）
&nbsp; &nbsp;&nbsp;Object&nbsp;value&nbsp;=&nbsp;localCache.getIfPresent(key);
&nbsp; &nbsp;&nbsp;if&nbsp;(value !=&nbsp;null) {
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;value;
&nbsp; &nbsp; }

&nbsp; &nbsp;&nbsp;// L2: Redis
&nbsp; &nbsp; value = redis.get(key);
&nbsp; &nbsp;&nbsp;if&nbsp;(value !=&nbsp;null) {
&nbsp; &nbsp; &nbsp; &nbsp; localCache.put(key, value); &nbsp;// 回填 L1
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;value;
&nbsp; &nbsp; }

&nbsp; &nbsp;&nbsp;// L3: 数据库
&nbsp; &nbsp; value = db.query(key);
&nbsp; &nbsp;&nbsp;if&nbsp;(value !=&nbsp;null) {
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;int&nbsp;realTtl&nbsp;=&nbsp;3600&nbsp;+&nbsp;new&nbsp;Random().nextInt(300);
&nbsp; &nbsp; &nbsp; &nbsp; redis.setex(key, realTtl, value);
&nbsp; &nbsp; &nbsp; &nbsp; localCache.put(key, value);
&nbsp; &nbsp; }
&nbsp; &nbsp;&nbsp;return&nbsp;value;
}

注意：

L1 缓存的过期时间要短，避免内存堆积
L1 缓存更新要广播（多节点时）
L1 缓存不适合数据一致性要求高的场景

5.3.3 方案三：Redis 高可用

核心思想：避免 Redis 单点。

三种高可用方案：

主从：手动切换，简单但恢复慢
哨兵：自动切换，可用性高
Cluster：分片 + 自动切换，扩展性好

bash

# Redis Sentinel 配置（sentinel.conf）
sentinel monitor mymaster 10.0.0.11 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 60000
sentinel parallel-syncs mymaster 1

bash

# Redis Cluster 配置（redis.conf）
cluster-enabled&nbsp;yes
cluster-config-file nodes.conf
cluster-node-timeout 5000
appendonly&nbsp;yes

5.3.4 方案四：熔断降级

核心思想：Redis 不可用时，返回降级数据或错误。

java

@HystrixCommand(fallbackMethod = "fallbackGetData")
public&nbsp;Object&nbsp;getData(String key)&nbsp;{
&nbsp; &nbsp;&nbsp;Object&nbsp;value&nbsp;=&nbsp;redis.get(key);
&nbsp; &nbsp;&nbsp;if&nbsp;(value !=&nbsp;null) {
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;value;
&nbsp; &nbsp; }
&nbsp; &nbsp;&nbsp;return&nbsp;db.query(key);
}

public&nbsp;Object&nbsp;fallbackGetData(String key)&nbsp;{
&nbsp; &nbsp;&nbsp;// 降级：返回静态数据、缓存数据、错误码
&nbsp; &nbsp;&nbsp;return&nbsp;defaultData;
}

5.3.5 方案五：缓存预热

核心思想：避免冷启动时大量请求回源。

bash

# 预热脚本示例
#!/bin/bash
# 启动时批量加载热点数据到 Redis
mysql -h db_host -uuser -ppass db_name -e&nbsp;"SELECT id, data FROM hot_keys"&nbsp;| \
while&nbsp;read&nbsp;line;&nbsp;do
&nbsp; &nbsp; key=$(echo&nbsp;$line&nbsp;| awk&nbsp;'{print $1}')
&nbsp; &nbsp; value=$(echo&nbsp;$line&nbsp;| awk&nbsp;'{$1=""; print $0}')
&nbsp; &nbsp; redis-cli -h redis_host SETEX&nbsp;"key:$key"&nbsp;7200&nbsp;"$value"
done

5.3.6 方案对比

| 方案 | 防护目标 | 实施难度 | 收益 | | — | — | — | — | | 过期时间随机化 | 同时过期 | 低 | 高 | | 多级缓存 | Redis 不可用 | 中 | 高 | | Redis 高可用 | Redis 不可用 | 中 | 高 | | 熔断降级 | Redis 不可用 | 中 | 中 | | 缓存预热 | 冷启动 | 中 | 中 |

生产环境推荐组合：过期时间随机化 + 多级缓存 + Redis 高可用 + 熔断降级。

六、常用命令

6.1 Redis 基础命令

bash

# 连接 Redis
redis-cli -h 10.0.0.1 -p 6379 -a <password>

# 看基本信息
redis-cli INFO server
redis-cli INFO clients
redis-cli INFO memory
redis-cli INFO stats
redis-cli INFO replication
redis-cli INFO keyspace

# 看 key 列表（生产环境慎用）
redis-cli --scan --pattern&nbsp;'user:*'&nbsp;|&nbsp;head&nbsp;-20

# 看 key 的 TTL
redis-cli TTL key:user:100

# 看 key 的类型
redis-cli TYPE key:user:100

# 看 key 的 value
redis-cli GET key:user:100

# 删 key
redis-cli DEL key:user:100

# 看慢查询
redis-cli SLOWLOG GET 10
redis-cli SLOWLOG LEN
redis-cli SLOWLOG RESET &nbsp;# 慎用

# 看客户端连接
redis-cli CLIENT LIST
redis-cli CLIENT KILL ID <id>

# 看内存
redis-cli MEMORY USAGE key:user:100
redis-cli MEMORY STATS

6.2 监控 Redis 状态

bash

# 实时监控 QPS
redis-cli --stat

# 输出示例：
# ------- data ------ --------------------- load -------------------- - child -
# keys &nbsp; &nbsp; &nbsp; mem &nbsp; &nbsp; &nbsp;clients blocked requests &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;connections
# 1000 &nbsp; &nbsp; &nbsp; 1.20M &nbsp; &nbsp;50 &nbsp; &nbsp; &nbsp;0 &nbsp; &nbsp; &nbsp; 100 (+0.1%) &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 100
# 1001 &nbsp; &nbsp; &nbsp; 1.21M &nbsp; &nbsp;55 &nbsp; &nbsp; &nbsp;0 &nbsp; &nbsp; &nbsp; 200 (+100.0%) &nbsp; &nbsp; &nbsp; &nbsp; 110

# 看延迟
redis-cli --latency
# 输出示例：
# min: 0, max: 1, avg: 0.15 (1234 samples)

# 看延迟历史
redis-cli --latency-history
# 输出示例：
# min: 0, max: 1, avg: 0.15 (1234 samples)
# min: 0, max: 2, avg: 0.18 (5678 samples)

# 看延迟分布
redis-cli --latency-dist

# 看 big key
redis-cli --bigkeys

# 看热点 key（需要 redis-cli 7+ 或额外工具）
redis-cli --hotkeys

6.3 Redis 集群命令

bash

# 集群信息
redis-cli -c -h <host> -p <port> CLUSTER INFO

# 集群节点
redis-cli -c -h <host> -p <port> CLUSTER NODES

# 集群 slots
redis-cli -c -h <host> -p <port> CLUSTER SLOTS

# 看 key 在哪个 slot
redis-cli -c -h <host> -p <port> CLUSTER KEYSLOT key:user:100

# 手动 failover
redis-cli -c -h <host> -p <port> CLUSTER FAILOVER

6.4 复制相关命令

bash

# 看主从状态（5.0+）
redis-cli INFO replication

# 输出示例：
# role:master
# connected_slaves:2
# slave0:ip=10.0.0.12,port=6379,state=online,offset=12345,lag=0
# slave1:ip=10.0.0.13,port=6379,state=online,offset=12345,lag=0
# master_repl_offset:12345

# 看复制延迟
redis-cli -h <slave> DEBUG SLEEP 1
# 然后在 master 上看 offset 增长情况

# 强制同步（慎用）
redis-cli SLAVEOF <master_ip> <master_port>

6.5 Lua 脚本：分布式锁

lua

-- lock.lua
-- KEYS[1]: lock key
-- ARGV[1]: lock value (UUID)
-- ARGV[2]: expire seconds
if&nbsp;redis.call('SET', KEYS[1], ARGV[1],&nbsp;'NX',&nbsp;'EX', ARGV[2])&nbsp;then
&nbsp; &nbsp;&nbsp;return&nbsp;1
else
&nbsp; &nbsp;&nbsp;return&nbsp;0
end

lua

-- unlock.lua
-- KEYS[1]: lock key
-- ARGV[1]: lock value
if&nbsp;redis.call('GET', KEYS[1]) == ARGV[1]&nbsp;then
&nbsp; &nbsp;&nbsp;return&nbsp;redis.call('DEL', KEYS[1])
else
&nbsp; &nbsp;&nbsp;return&nbsp;0
end

调用：

bash

# 加锁
redis-cli --eval&nbsp;lock.lua lock:user:100 , <uuid> 10

# 释放锁
redis-cli --eval&nbsp;unlock.lua lock:user:100 , <uuid>

6.6 Lua 脚本：限流

lua

-- rate_limit.lua
-- KEYS[1]: rate limit key
-- ARGV[1]: current timestamp (ms)
-- ARGV[2]: window (ms)
-- ARGV[3]: limit
local&nbsp;key = KEYS[1]
local&nbsp;now =&nbsp;tonumber(ARGV[1])
local&nbsp;window =&nbsp;tonumber(ARGV[2])
local&nbsp;limit =&nbsp;tonumber(ARGV[3])

redis.call('ZADD', key, now, now)
redis.call('ZREMRANGEBYSCORE', key,&nbsp;0, now - window)
local&nbsp;count = redis.call('ZCARD', key)
redis.call('PEXPIRE', key, window)

if&nbsp;count > limit&nbsp;then
&nbsp; &nbsp;&nbsp;return&nbsp;0
else
&nbsp; &nbsp;&nbsp;return&nbsp;1
end

6.7 Lua 脚本：缓存空值防穿透

lua

-- set_null_cache.lua
-- KEYS[1]: null cache key
-- ARGV[1]: ttl (seconds)
if&nbsp;redis.call('EXISTS', KEYS[1]) ==&nbsp;0&nbsp;then
&nbsp; &nbsp; redis.call('SET', KEYS[1],&nbsp;'1',&nbsp;'EX', ARGV[1])
&nbsp; &nbsp;&nbsp;return&nbsp;1
else
&nbsp; &nbsp;&nbsp;return&nbsp;0
end

6.8 监控 Redis 关键指标

bash

# 实时指标（每秒刷新）
watch -n 1&nbsp;'redis-cli INFO stats | grep instantaneous'

# 输出示例：
# instantaneous_ops_per_sec:1000
# instantaneous_input_kbps:50.0
# instantaneous_output_kbps:200.0

# 关键指标：
# used_memory: 已用内存
# used_memory_peak: 内存峰值
# used_memory_rss: 物理内存
# mem_fragmentation_ratio: 内存碎片率
# connected_clients: 连接数
# blocked_clients: 阻塞连接数
# total_connections_received: 总连接数
# total_commands_processed: 总命令数
# instantaneous_ops_per_sec: QPS
# keyspace_hits: 命中次数
# keyspace_misses: 未命中次数
# keyspace_hit_ratio: 命中率（需要算）
# expired_keys: 过期 key 数
# evicted_keys: 淘汰 key 数

七、配置示例

7.1 redis.conf 关键配置

conf

# /etc/redis/redis.conf

# 网络
bind 0.0.0.0 -::*
protected-mode yes
port 6379
timeout 300
tcp-keepalive 60

# 通用
daemonize yes
pidfile /var/run/redis/redis-server.pid
loglevel notice
logfile /var/log/redis/redis-server.log
databases 16

# 内存管理
maxmemory 8gb
maxmemory-policy allkeys-lru
maxmemory-samples 5

# 持久化
save 900 1
save 300 10
save 60 10000
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite yes
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-use-rdb-preamble yes

# 慢查询
slowlog-log-slower-than 10000
slowlog-max-len 128

# 客户端
maxclients 10000
tcp-backlog 511

# 安全
requirepass <your_strong_password>
rename-command FLUSHALL ""
rename-command FLUSHDB ""
rename-command KEYS ""
rename-command CONFIG "CONFIG_b9f8c2a3" &nbsp;# 重命名敏感命令

注意：

密码、rename-command 必须按实际环境调整
rename-command

改名后要让所有客户端同步更新

7.2 Redis Sentinel 配置

conf

# /etc/redis/sentinel.conf
port 26379
daemonize yes
pidfile /var/run/redis/sentinel.pid
logfile /var/log/redis/sentinel.log

sentinel monitor mymaster 10.0.0.11 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 60000

# 密码
sentinel auth-pass mymaster <your_strong_password>

# 通知脚本（可选）
sentinel notification-script mymaster /opt/scripts/notify.sh
sentinel client-reconfig-script mymaster /opt/scripts/reconfig.sh

7.3 Redis Cluster 配置

conf

# /etc/redis/redis-cluster.conf
port 6379
cluster-enabled yes
cluster-config-file nodes-6379.conf
cluster-node-timeout 15000
cluster-require-full-coverage no

# 副本数（一个 master 配 1 个 replica）
cluster-replica-validity-factor 10

# 启用 AOF
appendonly yes

启动集群：

bash

# 6 个节点：3 主 3 从
redis-cli --cluster create \
&nbsp; 10.0.0.11:6379 10.0.0.12:6379 10.0.0.13:6379 \
&nbsp; 10.0.0.14:6379 10.0.0.15:6379 10.0.0.16:6379 \
&nbsp; --cluster-replicas 1

# 重新分片
redis-cli --cluster reshard 10.0.0.11:6379

# 增减节点
redis-cli --cluster add-node <new_node> <existing_node>
redis-cli --cluster del-node <existing_node> <node_id>

7.4 systemd 服务文件

ini

# /etc/systemd/system/redis.service
[Unit]
Description=Redis In-Memory Data Store
After=network.target

[Service]
User=redis
Group=redis
ExecStart=/usr/bin/redis-server /etc/redis/redis.conf
ExecStop=/usr/bin/redis-cli shutdown
Restart=always
RestartSec=3
LimitNOFILE=65535
PrivateTmp=true

[Install]
WantedBy=multi-user.target

7.5 限流配置（应用层）

Nginx + Redis 实现分布式限流：

nginx

# /etc/nginx/conf.d/limit.conf
lua_shared_dict&nbsp;my_limit&nbsp;10m;

init_worker_by_lua_block&nbsp;{
&nbsp; &nbsp;&nbsp;redis&nbsp;= require&nbsp;"resty.redis"
&nbsp; &nbsp; redis.connect("10.0.0.1",&nbsp;6379)
}

access_by_lua_block {
&nbsp; &nbsp;&nbsp;local&nbsp;key =&nbsp;"rate_limit:"&nbsp;.. ngx.var.remote_addr
&nbsp; &nbsp; local now = ngx.now() *&nbsp;1000
&nbsp; &nbsp; local window =&nbsp;60000
&nbsp; &nbsp; local limit =&nbsp;100

&nbsp; &nbsp; redis.connect("10.0.0.1",&nbsp;6379)
&nbsp; &nbsp; local res = redis:eval([[
&nbsp; &nbsp; &nbsp; &nbsp; local key = KEYS[1]
&nbsp; &nbsp; &nbsp; &nbsp; local now = tonumber(ARGV[1])
&nbsp; &nbsp; &nbsp; &nbsp; local window = tonumber(ARGV[2])
&nbsp; &nbsp; &nbsp; &nbsp; local limit = tonumber(ARGV[3])
&nbsp; &nbsp; &nbsp; &nbsp; redis.call('ZADD', key, now, now)
&nbsp; &nbsp; &nbsp; &nbsp; redis.call('ZREMRANGEBYSCORE', key,&nbsp;0, now - window)
&nbsp; &nbsp; &nbsp; &nbsp; local count = redis.call('ZCARD', key)
&nbsp; &nbsp; &nbsp; &nbsp; redis.call('PEXPIRE', key, window)
&nbsp; &nbsp; &nbsp; &nbsp; if count > limit then
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return&nbsp;0
&nbsp; &nbsp; &nbsp; &nbsp; else
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return&nbsp;1
&nbsp; &nbsp; &nbsp; &nbsp; end
&nbsp; &nbsp; ]],&nbsp;1, key, now, window, limit)

&nbsp; &nbsp; if res ==&nbsp;0&nbsp;then
&nbsp; &nbsp; &nbsp; &nbsp; ngx.status =&nbsp;429
&nbsp; &nbsp; &nbsp; &nbsp; ngx.say("rate limit exceeded")
&nbsp; &nbsp; &nbsp; &nbsp; return ngx.exit(429)
&nbsp; &nbsp; end
}

注意：限流阈值必须按业务基线调整。

八、日志与指标观察方法

8.1 Redis 慢查询日志

bash

# 配置
slowlog-log-slower-than 10000 &nbsp;# 10ms
slowlog-max-len 128

# 查看
redis-cli SLOWLOG GET 10

# 输出示例：
# 1) 1) (integer) 14
# &nbsp; &nbsp;2) (integer) 1309448221
# &nbsp; &nbsp;3) (integer) 15
# &nbsp; &nbsp;4) 1) "GET"
# &nbsp; &nbsp; &nbsp; 2) "key:user:100"
# &nbsp; &nbsp;5) "10.0.0.1:12345"
# &nbsp; &nbsp;6) ""
# 含义：14 号慢查询，时间戳，执行 15 微秒，命令 GET key:user:100

判断：

慢查询集中在某些 key → 大 key 或热点 key
慢查询集中在某些命令 → 慢命令（KEYS、LRANGE 大量数据、复杂 Lua）
慢查询集中在某些时间点 → 业务高峰期

8.2 Redis 监控指标

8.2.1 关键指标

bash

# 用 redis_exporter 抓取
# /etc/prometheus/prometheus.yml
scrape_configs:
&nbsp; - job_name:&nbsp;'redis'
&nbsp; &nbsp; static_configs:
&nbsp; &nbsp; &nbsp; - targets: ['redis-exporter:9121']

8.2.2 关键指标列表

8.2.3 告警规则示例

yaml

# Prometheus Alertmanager 规则
groups:
-&nbsp;name:&nbsp;redis
&nbsp;&nbsp;rules:
&nbsp;&nbsp;-&nbsp;alert:&nbsp;RedisDown
&nbsp; &nbsp;&nbsp;expr:&nbsp;redis_up&nbsp;==&nbsp;0
&nbsp; &nbsp;&nbsp;for:&nbsp;1m
&nbsp; &nbsp;&nbsp;labels:
&nbsp; &nbsp; &nbsp;&nbsp;severity:&nbsp;critical
&nbsp; &nbsp;&nbsp;annotations:
&nbsp; &nbsp; &nbsp;&nbsp;summary:&nbsp;"Redis 实例不可用"

&nbsp;&nbsp;-&nbsp;alert:&nbsp;RedisMemoryHigh
&nbsp; &nbsp;&nbsp;expr:&nbsp;redis_used_memory_bytes&nbsp;/&nbsp;redis_max_memory_bytes&nbsp;>&nbsp;0.85
&nbsp; &nbsp;&nbsp;for:&nbsp;5m
&nbsp; &nbsp;&nbsp;labels:
&nbsp; &nbsp; &nbsp;&nbsp;severity:&nbsp;warning
&nbsp; &nbsp;&nbsp;annotations:
&nbsp; &nbsp; &nbsp;&nbsp;summary:&nbsp;"Redis 内存使用率超过 85%"

&nbsp;&nbsp;-&nbsp;alert:&nbsp;RedisHitRateLow
&nbsp; &nbsp;&nbsp;expr:&nbsp;|
&nbsp; &nbsp; &nbsp; rate(redis_keyspace_hits_total[5m]) /
&nbsp; &nbsp; &nbsp; (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]))
&nbsp; &nbsp; &nbsp; < 0.8
&nbsp; &nbsp;&nbsp;for:&nbsp;5m
&nbsp; &nbsp;&nbsp;labels:
&nbsp; &nbsp; &nbsp;&nbsp;severity:&nbsp;warning
&nbsp; &nbsp;&nbsp;annotations:
&nbsp; &nbsp; &nbsp;&nbsp;summary:&nbsp;"Redis 命中率低于 80%"

&nbsp;&nbsp;-&nbsp;alert:&nbsp;RedisClientsHigh
&nbsp; &nbsp;&nbsp;expr:&nbsp;redis_connected_clients&nbsp;>&nbsp;8000
&nbsp; &nbsp;&nbsp;for:&nbsp;5m
&nbsp; &nbsp;&nbsp;labels:
&nbsp; &nbsp; &nbsp;&nbsp;severity:&nbsp;warning
&nbsp; &nbsp;&nbsp;annotations:
&nbsp; &nbsp; &nbsp;&nbsp;summary:&nbsp;"Redis 连接数过高"

&nbsp;&nbsp;-&nbsp;alert:&nbsp;RedisSlowQuery
&nbsp; &nbsp;&nbsp;expr:&nbsp;redis_slowlog_length&nbsp;>&nbsp;0
&nbsp; &nbsp;&nbsp;for:&nbsp;5m
&nbsp; &nbsp;&nbsp;labels:
&nbsp; &nbsp; &nbsp;&nbsp;severity:&nbsp;info
&nbsp; &nbsp;&nbsp;annotations:
&nbsp; &nbsp; &nbsp;&nbsp;summary:&nbsp;"Redis 出现慢查询"

&nbsp;&nbsp;-&nbsp;alert:&nbsp;RedisReplicationLag
&nbsp; &nbsp;&nbsp;expr:&nbsp;redis_master_repl_offset&nbsp;-&nbsp;redis_slave_repl_offset&nbsp;>&nbsp;1000
&nbsp; &nbsp;&nbsp;for:&nbsp;1m
&nbsp; &nbsp;&nbsp;labels:
&nbsp; &nbsp; &nbsp;&nbsp;severity:&nbsp;warning
&nbsp; &nbsp;&nbsp;annotations:
&nbsp; &nbsp; &nbsp;&nbsp;summary:&nbsp;"Redis 主从复制延迟"

注意：阈值必须按业务基线调整。

8.3 Redis 状态抓取脚本

bash

#!/bin/bash
# redis_status.sh
# 用途：定期抓取 Redis 状态
# 用法：*/1 * * * * /opt/scripts/redis_status.sh >> /var/log/redis_status.log 2>&1

REDIS_CLI="redis-cli -h 10.0.0.1 -p 6379 -a <password>"

echo&nbsp;"===&nbsp;$(date)&nbsp;==="
echo&nbsp;"--- server ---"
$REDIS_CLI&nbsp;INFO server | grep -E&nbsp;'redis_version|os|process_id|uptime_in_seconds'
echo&nbsp;"--- clients ---"
$REDIS_CLI&nbsp;INFO clients | grep -E&nbsp;'connected_clients|blocked_clients|maxclients'
echo&nbsp;"--- memory ---"
$REDIS_CLI&nbsp;INFO memory | grep -E&nbsp;'used_memory:|used_memory_peak:|used_memory_rss:|mem_fragmentation_ratio:|maxmemory:'
echo&nbsp;"--- stats ---"
$REDIS_CLI&nbsp;INFO stats | grep -E&nbsp;'instantaneous_ops_per_sec|keyspace_hits|keyspace_misses|expired_keys|evicted_keys'
echo&nbsp;"--- replication ---"
$REDIS_CLI&nbsp;INFO replication | grep -E&nbsp;'role:|connected_slaves:|master_repl_offset:'
echo&nbsp;"--- keyspace ---"
$REDIS_CLI&nbsp;INFO keyspace
echo&nbsp;"--- slowlog ---"
$REDIS_CLI&nbsp;SLOWLOG LEN

注意：脚本里使用明文密码只是示意，生产环境必须用环境变量或受控的凭据文件。

九、排查路径

9.1 故障排查案例 1：缓存击穿

现象：

监控告警：商品详情接口 5xx 飙升
业务反馈：iPhone 新品首发活动开始，详情页打不开

初步判断：

单 key（iPhone SKU）QPS 突增
数据库连接池满
Redis 命中率正常

命令检查：

bash

# 1. 看 Redis 命中率
redis-cli INFO stats | grep -E&nbsp;'keyspace_hits|keyspace_misses'

# 2. 看 key 是否存在
redis-cli EXISTS product:iphone:sku123

# 3. 看数据库连接数
mysql -e&nbsp;'SHOW STATUS LIKE "Threads_connected";'
mysql -e&nbsp;'SHOW STATUS LIKE "Threads_running";'

# 4. 看数据库慢 SQL
mysqldumpslow -s t /var/log/mysql/slow.log |&nbsp;head&nbsp;-10

关键指标：

命中率：98%（正常）
单个 key 不存在：YES（已被淘汰或过期）
数据库 Threads_running：500/500（连接池满）
慢 SQL：SELECT * FROM products WHERE sku = 'sku123'

根因定位：

iPhone SKU 是热点 key，TTL 过期后大量请求回源
单 SQL 不慢，但瞬间并发太高把数据库打挂

修复方案：

java

// 立即上线逻辑过期方案
public&nbsp;Object&nbsp;getProduct(String sku)&nbsp;{
&nbsp; &nbsp;&nbsp;String&nbsp;key&nbsp;=&nbsp;"product:"&nbsp;+ sku;
&nbsp; &nbsp; CacheData<Product> data = redis.get(key);

&nbsp; &nbsp;&nbsp;if&nbsp;(data ==&nbsp;null) {
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;// 缓存不存在
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;loadFromDbWithLock(sku);
&nbsp; &nbsp; }

&nbsp; &nbsp;&nbsp;if&nbsp;(data.getExpireAt() < System.currentTimeMillis()) {
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;// 已过期，异步刷新，返回旧数据
&nbsp; &nbsp; &nbsp; &nbsp; asyncRefresh(sku);
&nbsp; &nbsp; }

&nbsp; &nbsp;&nbsp;return&nbsp;data.getData();
}

验证：

Redis QPS 正常
数据库 QPS 平稳
接口响应时间 P99 平稳

回滚方案：

把热点 key 重新设回 Redis，恢复原 TTL
把逻辑过期代码回滚到原来的查 DB 逻辑

复盘总结：

营销活动必须提前预热热点 key
热点 key 应该用逻辑过期，避免突然击穿
数据库连接池要预留容量

9.2 故障排查案例 2：缓存雪崩

现象：

凌晨 3 点告警：Redis 内存使用率突然降到 0
同时数据库 QPS 飙升
5xx 飙升

初步判断：

大量 key 同时过期
缓存回源打挂数据库

命令检查：

bash

# 1. 看 Redis key 数量
redis-cli DBSIZE

# 2. 看 Redis 命中率
redis-cli INFO stats | grep keyspace

# 3. 看数据库 QPS
mysql -e&nbsp;'SHOW GLOBAL STATUS LIKE "Queries";'

# 4. 看应用日志
grep&nbsp;"cache miss"&nbsp;/var/log/app/app.log |&nbsp;wc&nbsp;-l

关键指标：

Redis DBSIZE：1000（之前 100 万）
命中率：0%（缓存空了）
数据库 QPS：50000
cache miss 日志：每秒 1 万

根因定位：

凌晨 3 点批量预热脚本把缓存全部刷新
全部设置了相同 TTL（6 小时）
6 小时后同时过期

修复方案：

java

// 立即：把热点 key 重新预热
// 修改：TTL 加随机值
int&nbsp;baseTtl&nbsp;=&nbsp;3600;
int&nbsp;randomTtl&nbsp;=&nbsp;new&nbsp;Random().nextInt(600);
int&nbsp;realTtl&nbsp;=&nbsp;baseTtl + randomTtl;
redis.setex(key, realTtl, data);

bash

# 立即预热
for&nbsp;key&nbsp;in&nbsp;$(mysql -N -e&nbsp;"SELECT id FROM products WHERE is_hot=1");&nbsp;do
&nbsp; &nbsp; value=$(mysql -N -e&nbsp;"SELECT data FROM products WHERE id=$key")
&nbsp; &nbsp; redis-cli SETEX&nbsp;"product:$key"&nbsp;$((3600&nbsp;+ RANDOM %&nbsp;600))&nbsp;"$value"
done

验证：

Redis key 数量恢复
数据库 QPS 下降
5xx 下降

回滚方案：

暂时回退到旧版本
关闭预热脚本，避免再次雪崩

复盘总结：

批量预热必须 TTL 随机化
预热和过期要错峰
缓存高可用要多副本

9.3 故障排查案例 3：缓存穿透

现象：

告警：Redis 命中率低
数据库 QPS 飙升
应用日志大量 “data not found”

初步判断：

不存在的数据查询过多
可能是业务异常或恶意攻击

命令检查：

bash

# 1. 看 Redis 命中率
redis-cli INFO stats | grep keyspace

# 2. 看数据库返回空结果的查询
grep&nbsp;"data not found"&nbsp;/var/log/app/app.log | awk&nbsp;'{print $NF}'&nbsp;|&nbsp;sort&nbsp;|&nbsp;uniq&nbsp;-c |&nbsp;sort&nbsp;-rn |&nbsp;head

# 3. 看访问 IP 分布
grep&nbsp;"data not found"&nbsp;/var/log/app/app.log | awk&nbsp;'{print $1}'&nbsp;|&nbsp;sort&nbsp;|&nbsp;uniq&nbsp;-c |&nbsp;sort&nbsp;-rn |&nbsp;head

# 4. 看请求参数分布
grep&nbsp;"data not found"&nbsp;/var/log/app/app.log | awk&nbsp;'{print $(NF-1)}'&nbsp;|&nbsp;sort&nbsp;|&nbsp;uniq&nbsp;-c |&nbsp;sort&nbsp;-rn |&nbsp;head

关键指标：

命中率：30%（很低）
大部分 not found 都来自同一 IP 段
请求参数在 1-99999 顺序递增

根因定位：

黑产用脚本按 ID 顺序遍历用户
用户大部分不存在
攻击目标是看哪些 ID 存在

修复方案：

java

// 立即上线布隆过滤器
public&nbsp;Object&nbsp;getUserInfo(String userId)&nbsp;{
&nbsp; &nbsp;&nbsp;// 1. 布隆过滤器
&nbsp; &nbsp;&nbsp;if&nbsp;(!bloomFilter.mightContain(userId)) {
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;null;
&nbsp; &nbsp; }

&nbsp; &nbsp;&nbsp;// 2. 查 Redis
&nbsp; &nbsp;&nbsp;Object&nbsp;value&nbsp;=&nbsp;redis.get("user:"&nbsp;+ userId);
&nbsp; &nbsp;&nbsp;if&nbsp;(value !=&nbsp;null) {
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;value;
&nbsp; &nbsp; }

&nbsp; &nbsp;&nbsp;// 3. 查 DB
&nbsp; &nbsp;&nbsp;Object&nbsp;data&nbsp;=&nbsp;db.query("SELECT * FROM users WHERE id = ?", userId);
&nbsp; &nbsp;&nbsp;if&nbsp;(data !=&nbsp;null) {
&nbsp; &nbsp; &nbsp; &nbsp; redis.setex("user:"&nbsp;+ userId,&nbsp;3600, data);
&nbsp; &nbsp; }&nbsp;else&nbsp;{
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;// 缓存空值
&nbsp; &nbsp; &nbsp; &nbsp; redis.setex("null:user:"&nbsp;+ userId,&nbsp;60,&nbsp;"1");
&nbsp; &nbsp; }
&nbsp; &nbsp;&nbsp;return&nbsp;data;
}

验证：

Redis 命中率回升
数据库 QPS 下降
攻击 IP 仍然在访问，但都被布隆过滤器挡了

回滚方案：

关闭布隆过滤器前先备份
用 Lua 脚本可以动态关闭

复盘总结：

用户 ID 类接口必须有布隆过滤器
缓存空值要有 TTL
攻击要配合 WAF 和限流

十、风险提醒

10.1 关于缓存设计的风险

不要把 Redis 当数据库用：Redis 是缓存，不是持久化存储
不要把所有数据都缓存：冷数据缓存会浪费内存
不要用 Redis 存大 key：单 key 不要超过 10KB，value 不要超过 100KB
不要在 Redis 里跑复杂计算：CPU 占用高会阻塞其他命令

10.2 关于高风险命令的风险

KEYS *

：会阻塞 Redis，生产环境禁用
FLUSHDB

/ FLUSHALL：清空所有数据，慎用
DEBUG SLEEP

：会阻塞 Redis，仅调试用
CONFIG SET

：运行时修改配置，可能导致实例不稳定
SHUTDOWN

：直接关闭 Redis，要先确认是否有数据没保存

bash

# 高风险操作必须先备份
# 比如 FLUSHDB 之前
redis-cli BGSAVE
# 等待 BGSAVE 完成
redis-cli LASTSAVE
# 看最近的 RDB 文件
ls&nbsp;-lh /var/lib/redis/dump.rdb

10.3 关于主从切换的风险

切换瞬间会有短暂不可用
切换后从节点需要追赶主节点
切换期间不要执行写操作
切换后要检查数据一致性

10.4 关于持久化的风险

AOF 持久化会消耗磁盘 IO
AOF rewrite 期间会消耗 CPU
RDB 生成期间会 fork 子进程，可能短暂阻塞
不要把持久化文件存到根分区
持久化文件要定期备份到其他机器

10.5 关于 Lua 脚本的风险

Lua 脚本中的 key 必须显式声明（KEYS）
Lua 脚本不能太复杂（避免长时间阻塞）
Lua 脚本出错要能回滚
重要操作不要全部依赖 Lua

10.6 关于分片集群的风险

Cluster 在 slot 迁移时会有短暂不可用
大 key 在 Cluster 模式下无法均匀分布
Cluster 不支持多 key 跨 slot 操作（必须用 hash tag）
跨 slot 事务有限制

十一、验证方式

11.1 缓存防护的验证

bash

# 1. 用 redis-cli 模拟请求
for&nbsp;i&nbsp;in&nbsp;{1..100};&nbsp;do
&nbsp; &nbsp; redis-cli GET&nbsp;"user:$i"
done

# 2. 看命中率变化
redis-cli INFO stats | grep keyspace

# 3. 压测验证
wrk -t 10 -c 100 -d 60s&nbsp;'https://api.example.com/products/hot-sku'

# 4. 模拟攻击
# 写脚本按 ID 顺序遍历不存在的 key
for&nbsp;i&nbsp;in&nbsp;{1..10000};&nbsp;do
&nbsp; &nbsp; curl&nbsp;"https://api.example.com/users/$i"
done

11.2 性能验证

bash

# 1. Redis 基准测试
redis-benchmark -h 10.0.0.1 -p 6379 -t&nbsp;set,get -n 100000 -q

# 2. 应用层基准
wrk -t 10 -c 100 -d 60s --latency&nbsp;'https://api.example.com/products/1'

# 3. 数据库 QPS 监控
mysql -e&nbsp;'SHOW GLOBAL STATUS LIKE "Queries";'

11.3 防护策略的验证

bash

# 1. 验证布隆过滤器
redis-cli BF.EXISTS myfilter user:99999

# 2. 验证分布式锁
redis-cli SET lock:user:100 <uuid> NX EX 10
redis-cli GET lock:user:100

# 3. 验证限流
for&nbsp;i&nbsp;in&nbsp;{1..200};&nbsp;do
&nbsp; &nbsp; curl&nbsp;"https://api.example.com/api/data"&nbsp;-w&nbsp;"%{http_code}\n"
done&nbsp;|&nbsp;sort&nbsp;|&nbsp;uniq&nbsp;-c

11.4 高可用验证

bash

# 1. Sentinel failover 测试
# 主动 kill master
redis-cli -h <master> SHUTDOWN NOSAVE
# 看 Sentinel 是否切换
redis-cli -h <sentinel> -p 26379 SENTINEL get-master-addr-by-name mymaster

# 2. Cluster 故障测试
# 同上，但观察 cluster 状态
redis-cli -c -h <node> CLUSTER INFO

# 3. 切换后数据一致性验证
# 比较切换前后的 key 数量和值

十二、回滚方案

12.1 布隆过滤器回滚

bash

# 关闭布隆过滤器（在应用配置里）
# 改回原来的 Redis 缓存逻辑

# 如果是模块化部署，重新加载 redisbloom 模块
redis-cli MODULE LIST
# 卸载模块
redis-cli MODULE UNLOAD redisbloom

注意：卸载模块前先确认没有 key 引用了布隆过滤器。

12.2 Lua 脚本回滚

bash

# 把 Lua 脚本改回原来的实现
# 不要用 EVAL 临时改，因为重启后会丢

# Lua 脚本的版本管理建议：
# /opt/scripts/lua/v1/ &nbsp;# 当前版本
# /opt/scripts/lua/v2/ &nbsp;# 新版本
# 通过符号链接切换
ln&nbsp;-sfn /opt/scripts/lua/v1 /opt/scripts/lua/current

12.3 配置变更回滚

bash

# 1. 修改前先备份
cp&nbsp;/etc/redis/redis.conf /etc/redis/redis.conf.bak.$(date&nbsp;+%Y%m%d_%H%M%S)

# 2. 启用 AOF
redis-cli CONFIG SET appendonly&nbsp;yes
redis-cli CONFIG REWRITE

# 3. 关闭危险命令
redis-cli CONFIG SET rename-command KEYS&nbsp;""

# 4. 调整 maxmemory
redis-cli CONFIG SET maxmemory 8gb

# 5. 回滚
# 用备份文件覆盖
cp&nbsp;/etc/redis/redis.conf.bak.<timestamp> /etc/redis/redis.conf
# 重启或 reload
systemctl reload redis

注意：CONFIG REWRITE 会把内存中的配置写入配置文件，但不一定覆盖所有项。生产环境建议直接修改文件然后 CONFIG SET 临时生效。

12.4 紧急情况下的回滚

如果 Redis 出现严重问题，需要紧急止血：

bash

# 1. 立即切换到只读模式（如果有 Sentinel）
redis-cli -h <sentinel> SENTINEL failover mymaster

# 2. 把流量切到另一个 Redis 实例
# 修改应用配置里的 Redis 地址
# 通过配置中心推送

# 3. 临时禁用某些命令
redis-cli CONFIG SET rename-command FLUSHALL&nbsp;""

# 4. 强制保存数据
redis-cli BGSAVE
redis-cli BGREWRITEAOF

注意：切换 Redis 实例是高风险操作，必须提前演练。

十三、生产环境注意事项

13.1 内存管理

监控 used_memory 和 maxmemory 的比例
设置合理的 maxmemory-policy
定期分析大 key
控制 key 数量

bash

# 找大 key
redis-cli --bigkeys -i 0.1

# 看 key 分布
redis-cli INFO keyspace
# db0:keys=1000,expires=500,avg_ttl=3600000

13.2 持久化策略

AOF + RDB 混合持久化
AOF fsync everysec（默认）
定期备份到其他机器
监控 RDB/AOF 文件大小

13.3 高可用架构

至少 1 主 1 从
重要业务 1 主 2 从
Sentinel 至少 3 个节点
Cluster 至少 3 主 3 从

13.4 客户端配置

使用连接池（HikariCP、JedisPool、Lettuce）
设置合理的 maxTotal
启用重试，但要设置上限
设置超时

java

// JedisPool 配置
JedisPoolConfig&nbsp;config&nbsp;=&nbsp;new&nbsp;JedisPoolConfig();
config.setMaxTotal(200);
config.setMaxIdle(50);
config.setMinIdle(10);
config.setMaxWaitMillis(3000);
config.setTestOnBorrow(false);
config.setTestWhileIdle(true);

JedisPool&nbsp;pool&nbsp;=&nbsp;new&nbsp;JedisPool(config,&nbsp;"10.0.0.1",&nbsp;6379,&nbsp;3000,&nbsp;"password");

13.5 监控告警

必须监控：连接数、内存、QPS、命中率、慢查询、主从延迟
必须告警：内存 > 80%、命中率 < 80%、主从延迟 > 阈值、慢查询出现
必须告警：实例不可用、主从切换、AOF rewrite 失败

13.6 安全

必须设置密码
必须 rename 高危命令
必须限制 bind 地址
防火墙/SecurityGroup 只允许应用服务器访问
不要把 Redis 暴露到公网

13.7 业务侧容错

所有 Redis 操作都要 try-catch
Redis 不可用时要有降级方案
不要把 Redis 当唯一数据源
重要数据要异步持久化到 DB

13.8 容量评估

业务量增长前要预估 Redis 容量
单实例内存不要超过 10GB（避免 RDB 阻塞）
业务量大时优先用 Cluster
监控内存增长趋势，提前扩容

十四、总结

缓存穿透、缓存击穿、缓存雪崩是 Redis 应用中最高频的三类问题。这一篇把它们的根因、现象、防护策略讲清楚了：

穿透：

根因：数据在缓存和 DB 中都不存在
防护：缓存空值、布隆过滤器、参数校验、限流
关键：业务层要过滤非法请求

击穿：

根因：单个热点 key 过期
防护：分布式锁、逻辑过期、永不过期、提前刷新
关键：识别热点 key 提前防护

雪崩：

根因：大量 key 同时过期或 Redis 整体不可用
防护：TTL 随机化、多级缓存、高可用、熔断降级
关键：高可用架构 + 业务降级

生产环境的最佳实践是多层防护组合：

客户端限流 + 接入层 WAF + 业务层布隆过滤器 + 缓存层逻辑过期 + DB 层连接池隔离

面试回答要点：

穿透、击穿、雪崩的区别要说清楚
防护方案要落地到具体命令和配置
生产环境的复杂场景要能讲清楚组合防护

最终验收点：

拿到一个 cache 击穿案例，能否定位到具体 key 和并发量
拿到一个 cache 雪崩告警，能否快速预热恢复
拿到一个 cache 穿透攻击，能否用布隆过滤器拦截

如果能做到这三点，这篇文章的价值就达到了。

文末福利

今天给大家分享一份超级牛掰的Linux学习笔记，足足有1456页！是一位Linux运维大佬整理分享的，分享是获得大佬同意的，大家有需要的尽管收藏起来！

笔记介绍

这份笔记非常全面且详细，从Linux基础到shell脚本，再到防火墙、数据库、日志服务管理、Nginx、高可用集群、Redis、虚拟化、Docker等等，与其说Linux学习笔记，不如说是涵盖了运维各个核心知识。

并且图文并茂，代码清晰，每一章下面都有更具体详细的内容，十分适合Linux运维学习参考！

笔记展示

笔记下载

扫描下方二维码，回复暗号“1456页Linux笔记“，即可100%免费领取成功

免责声明：

本文所载程序、技术方法仅面向合法合规的安全研究与教学场景，旨在提升网络安全防护能力，具有明确的技术研究属性。

任何单位或个人未经授权，将本文内容用于攻击、破坏等非法用途的，由此引发的全部法律责任、民事赔偿及连带责任，均由行为人独立承担，本站不承担任何连带责任。

本站内容均为技术交流与知识分享目的发布，若存在版权侵权或其他异议，请通过邮件联系处理，具体联系方式可点击页面上方的联系我。

本文转载自：马哥Linux运维点击关注 👉 点击关注 👉《Redis 缓存穿透/击穿/雪崩：面试必问+生产防护策略》