2026-06-16 04:23:25 网络安全文章来源：ZONE.CI 全球网 0 阅读模式

文章总结： 本文详细介绍了Redis主从复制与哨兵模式的原理及部署实践，涵盖全量/增量同步机制、哨兵集群的监控与故障转移流程。重点解析了复制丢数据、脑裂等生产环境常见问题，并提供完整的配置文件示例和单机伪集群搭建步骤，帮助读者掌握高可用Redis架构的部署与运维要点。 综合评分： 85 文章分类： 安全运营,解决方案,云安全,应用安全,其他

cover_image

Redis 主从复制、哨兵模式搭建与自动故障切换

点击关注👉 点击关注👉

马哥Linux运维

2026年6月13日 15:42 广东

在小说阅读器读本章

去阅读

写在前面

Redis 单机部署只能扛”开发测试”场景，生产里基本跑不过一周就会出问题：内存爆掉、磁盘写满、节点宕机、慢查询把主节点拖死、一次机房断电直接数据全丢。

要解决这些问题，必须把 Redis 变成”复制 + 故障切换”的高可用结构。这篇文章只讲最经典、生产用得最多的一套组合：

一主多从做读写分离 + 数据冗余
Sentinel（哨兵）做健康检测 + 故障自动切换

不展开 Cluster 集群模式（Redis Cluster 是另一套分片机制，命令、原理、坑都跟哨兵不一样）。学完这套，单业务 Redis 实例、数据量单机可承载（一般 8~32G 内存以内）、需要 7×24 服务的场景，都能用得上。

适用读者：

负责 Redis 日常运维的初中级工程师
想从 0 搭一套 Redis 高可用的 DevOps
已经踩过”主从断连”、”哨兵选主失败”、”脑裂丢数据”等坑的工程

读完后能落地的事情：

部署一主二从 + 三哨兵
看懂 INFO replication 关键字段
写生产可用的 redis.conf 和 sentinel.conf
主动演练一次故障切换
排查脑裂、复制延迟、哨兵误判、客户端切换卡顿

一、Redis 复制原理：到底在传什么

1.1 复制流程拆解

Redis 主从复制不是”推”，也不是简单的”拉”，而是 master 主动产生 RDB + 命令流，slave 来拉的混合模型。完整流程如下：

slave 节点连上 master，发送 PSYNC 命令（Redis 2.8+）。
master 收到 PSYNC 后判断 slave 是不是第一次连：

第一次：触发全量同步（full resync）。
不是第一次：判断 replid 和 offset 能不能对上，能对上就触发增量同步（partial resync）。

全量同步：master 立刻 fork 子进程生成 RDB，生成期间收到的写命令写到 replication buffer，RDB 写完后把 RDB 文件发给 slave，slave 把 RDB load 进内存，再把 buffer 里的命令 replay 一遍。
增量同步：master 持续把写命令写到 repl_backlog（一个固定大小的环形缓冲区），slave 拿着自己的 replication offset 来拉”missed 那一段”。

主从复制的关键字段是这两个：

replid：复制 ID。master 重启后 replid 会变，导致 slave 必须全量同步。
offset：复制偏移量。master 写一次递增一次，slave 拉一次递增一次。

判断主从”是不是同步了”，看的就是 master_repl_offset - slave_repl_offset 的差值。差值小说明没掉队，差值大说明中间断过 / 慢过。

1.2 全量同步触发场景

实战里这几种情况一定会触发全量同步（很慢，可能分钟级，slave 内存要顶住）：

slave 是新加入的，master 内存里没有这个 slave 的 replid。
master 重启过，replid 变了。
slave replid 对不上，比如误把 A 的 slave 接到 B 上。
repl_backlog 太小，断开时间一长，slave 的 offset 已经被覆盖。

所以生产配置必须注意：

repl-backlog-size 1gb（默认 1mb，实在是太小）。
尽量不让 master 频繁重启（kill -9 后再启就走全量）。
尽量避免在 master 上做长时间的 BGSAVE、DEBUG RELOAD、SHUTDOWN NOSAVE 之类的操作。

1.3 异步复制意味着什么

Redis 的复制是异步的。master 把命令写到内存、写完就返回 client OK，再异步把命令流给 slave。

这带来一个绕不开的隐患：

master 写完 OK 后，slave 还没收到这个命令，master 就宕机了。
哨兵把 slave 提升为新 master，老 master 上没传完的命令就丢了。

这就是”脑裂”和”复制丢数据”的根本原因。后文配置里的 min-replicas-to-write 和 min-replicas-max-lag 就是为了缓解这个问题，但没法彻底解决。真要强一致就得用 Redis 之外的方案，比如：

应用层双写主从
上 Raft 协议的自研 KV
用 RedisRaft 之类的小众方案

二、Sentinel 哨兵原理

2.1 哨兵集群的职责

Sentinel（哨兵）本质是一个独立的进程，部署 3~5 个组成哨兵集群，职责有四件：

监控（Monitoring）：持续 PING 主从节点，看是否在线。
通知（Notification）：节点出问题可以通过 API 通知管理员。
自动故障转移（Automatic failover）：master 挂了，把其中一个 slave 提升为新 master，其他 slave 重新指向新 master。
配置提供（Configuration provider）：客户端连哨兵集群，哨兵告诉客户端当前 master 是谁。

注意一个关键误解：哨兵不是 master 的”备份”。

哨兵不存数据。
哨兵不转发请求。
哨兵只做”裁判”，决定谁是 master。

数据层面的复制，是 Redis 主从之间自己搞的，哨兵不参与。

2.2 主观下线与客观下线

哨兵判断一个节点”挂了”分两步：

主观下线（SDOWN）：单个哨兵 PING 这个节点超时，就把它标成 SDOWN。超时的阈值是 down-after-milliseconds，默认 30 秒。
客观下线（ODOWN）：哨兵集群互相问”你们也觉得他挂了吗？”，如果有 quorum 个哨兵（一般是过半）都觉得他挂，那这个节点就是 ODOWN。

只有 master 才会走 ODOWN 流程（因为 slave 挂了一般没影响）。master 一旦 ODOWN，哨兵集群就开始选新 master。

2.3 Leader 选举与故障转移

哨兵集群里也要选一个 Leader 来执行故障转移，流程如下：

每个哨兵发”我要做 Leader”的票（先到先得）。
第一个发起投票的哨兵会向其他哨兵要票。
收到超过半数哨兵（包含自己）的票，竞选者就成为 Leader。
一轮投票没出结果就再来一轮，最长等 failover-timeout 还没结果就放弃。

Leader 哨兵会从 slave 里挑一个提升为新 master：

过滤掉所有下线、断连、最近 5 秒内没回过 PING 的 slave。
按优先级排：replica-priority（老版本叫 slave-priority）越低越优先。
优先级相同看 replication offset，越大越好（数据最新）。
一样看 runid，字典序小的胜出。

选定之后：

给新 master 发 SLAVEOF NO ONE，让新 master 停止复制。
给其他 slave 发 SLAVEOF <newmaster-ip> <newmaster-port>，让它们指向新 master。
把旧 master 标记成新 master 的 slave（如果旧 master 重新上线）。

注意一个老版本（Redis < 2.8）的坑：旧版本哨兵选完主之后只改自己的内部 master 列表，不会去主动通知老 master。这会导致老 master 还以 master 身份接客，直到客户端重新拉配置为止。Redis 2.8+ 已经没有这个问题，但还是会有人踩到。

2.4 哨兵需要的最小节点数

1 个哨兵：没意义，单点。
2 个哨兵：起不到多数派的效果，挂了 1 个就 quorum 不达标无法切换。
3 个哨兵：标准配置，容忍 1 个哨兵宕机。
5 个哨兵：容忍 2 个哨兵宕机。

强烈建议奇数个哨兵。生产一般 3 个就够了，机器多可以 5 个。

三、单机伪集群快速体验

3.1 准备环境

为了快速验证整套流程，先在一台机器上开 6 个端口，模拟”一主二从 + 三哨兵”。

# Redis 7.x，CentOS 7 演示
# 直接 yum 装一份
yum install -y epel-release
yum install -y redis

# 找不到的话走源码
cd&nbsp;/usr/local/src
wget https://download.redis.io/releases/redis-7.2.4.tar.gz
tar xzf redis-7.2.4.tar.gz
cd&nbsp;redis-7.2.4
make -j $(nproc)
make install PREFIX=/usr/local/redis
ln -s /usr/local/redis/bin/* /usr/local/bin/

风险提示：源码编译安装前要确认 gcc / make 已就绪；make install 会覆盖 /usr/local/bin/ 下同名文件，多版本共存需要自己做路径隔离。

3.2 创建配置目录

mkdir -p /opt/redis-cluster/{6379,6380,6381}/{data,log,conf,run}
mkdir -p /opt/redis-sentinel/{26379,26380,26381}/{log,conf,run}

3.3 Redis 节点配置

通用基础配置片段（先放 /opt/redis-cluster/6379/conf/redis-common.conf，后面用 include 引入）：

# 监听地址，生产不要写 0.0.0.0
bind 127.0.0.1 -::1
port 6379
daemonize yes
pidfile /opt/redis-cluster/6379/run/redis_6379.pid
logfile /opt/redis-cluster/6379/log/redis.log
dir /opt/redis-cluster/6379/data

# RDB 持久化
save 900 1
save 300 10
save 60 10000
dbfilename dump.rdb

# AOF
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite yes
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb

# 复制相关
repl-backlog-size 1gb
repl-backlog-ttl 3600
replica-priority 100
client-output-buffer-limit replica 512mb 128mb 60

# 内存
maxmemory 2gb
maxmemory-policy allkeys-lru

# 安全
requirepass Redis@2026
masterauth Redis@2026
protected-mode yes

# 慢日志
slowlog-log-slower-than 10000
slowlog-max-len 128

# 监控命令重命名
rename-command FLUSHALL ""
rename-command FLUSHDB ""
rename-command CONFIG "CONFIG_b9f3c1a2"
rename-command DEBUG ""

风险提示：

requirepass 写生产必须复杂且单独管理，不能用 Redis@2026 这种范例。
masterauth 要跟 requirepass 一致，因为 master 自己变成 slave 时也要用这个密码连新 master。
protected-mode yes 在 bind 非 127.0.0.1 / 未设密码时会拒绝外网连接，常见坑。
rename-command 是把高危命令藏起来，但客户端需要知道新名字才能调。CONFIG、FLUSHALL、FLUSHDB、KEYS 必须藏或禁用。

主节点 /opt/redis-cluster/6379/conf/redis.conf：

include /opt/redis-cluster/6379/conf/redis-common.conf

# 6379 是 master，不设 replicaof
replica-read-only yes
replica-serve-stale-data yes

从节点 /opt/redis-cluster/6380/conf/redis.conf：

include /opt/redis-cluster/6380/conf/redis-common.conf

# 指向 6379
replicaof 127.0.0.1 6379
replica-read-only yes
replica-serve-stale-data yes

从节点 /opt/redis-cluster/6381/conf/redis.conf：

include /opt/redis-cluster/6381/conf/redis-common.conf

replicaof 127.0.0.1 6379
replica-read-only yes
replica-serve-stale-data yes

replica-read-only 在 Redis 2.8 之后默认是 yes，写出来是为了显式说明 slave 不可写。replica-serve-stale-data yes 让 slave 在断连时继续用旧数据响应读请求（不要轻易设成 no，会导致 slave 断连后所有读直接报 MASTERDOWN Link master ERROR）。

3.4 启动 Redis 节点

# 启动三个 Redis 实例
redis-server /opt/redis-cluster/6379/conf/redis.conf
redis-server /opt/redis-cluster/6380/conf/redis.conf
redis-server /opt/redis-cluster/6381/conf/redis.conf

# 看进程
ps -ef | grep redis-server

# 看监听
ss -tunlp | grep redis

3.5 验证主从

# 登主节点
redis-cli -p 6379 -a Redis@2026
127.0.0.1:6379> PING
PONG
127.0.0.1:6379> INFO replication
# Role:master
# connected_slaves:2
# slave0:ip=127.0.0.1,port=6380,state=online,offset=...,lag=1
# slave1:ip=127.0.0.1,port=6381,state=online,offset=...,lag=1
# master_replid:xxxxxxxx
# master_repl_offset:12345
127.0.0.1:6379> SET&nbsp;test:1 hello
OK
127.0.0.1:6379> GET&nbsp;test:1
"hello"

redis-cli -a 直接带密码会在屏幕 echo 一次，生产更推荐用 redis-cli --user default --pass ... 或者 AUTH 命令交互输入。也可以走 SSH 跳板避免密码进 history。

# 登从节点
redis-cli -p 6380 -a Redis@2026
127.0.0.1:6380> INFO replication
# Role:slave
# master_host:127.0.0.1
# master_port:6379
# master_link_status:up
# master_sync_in_progress:0
# slave_repl_offset:12345
# master_replid:xxxxxxxx
127.0.0.1:6380> GET&nbsp;test:1
"hello"

如果从节点 GET 不到主节点刚写的 key，最大概率是：

主节点设置了 requirepass，从节点没设 masterauth。
防火墙挡住了 6379。
bind 设了 127.0.0.1，但主从之间用内网 IP 互通时被拒。

四、Sentinel 哨兵部署

4.1 哨兵配置

哨兵配置文件 /opt/redis-sentinel/26379/conf/sentinel.conf：

# 哨兵本身
bind 127.0.0.1
port 26379
daemonize yes
pidfile /opt/redis-sentinel/26379/run/sentinel_26379.pid
logfile /opt/redis-sentinel/26379/log/sentinel.log
dir /opt/redis-sentinel/26379

# 监控的 master 集群
# sentinel monitor <master-name> <ip> <port> <quorum>
# master-name 推荐用业务名，不要用 mymaster
sentinel monitor redis-ha 127.0.0.1 6379 2

# 哨兵认证（如果 master 有 requirepass）
sentinel auth-pass redis-ha Redis@2026

# 主观下线超时
sentinel down-after-milliseconds redis-ha 5000

# 同时多少个 slave 同步新 master（防止同时同步打满 master）
sentinel parallel-syncs redis-ha 1

# failover 总超时
sentinel failover-timeout redis-ha 60000

# 通知脚本（可选）
# sentinel notification-script redis-ha /opt/sh/notify.sh

# 客户端重配置脚本（可选）
# sentinel client-reconfig-script redis-ha /opt/sh/reconfig.sh

/opt/redis-sentinel/26380/conf/sentinel.conf 和 26381 的配置基本一样，只改 port 和 pidfile。quorum 2 意味着 3 个哨兵里至少 2 个认为 master 挂，才进入故障转移。

哨兵配置参数详解：

down-after-milliseconds：哨兵多久没收到 PONG 就算 SDOWN。生产一般 5000~10000。设太大故障切换慢，设太小容易误判网络抖动。
parallel-syncs：新 master 选定后，多少个 slave 同时发起对新 master 的复制。1 表示串行（推荐，复制风暴可控）。设大会让 master 压力剧增。
failover-timeout：一次故障转移的总时长，包含 slave 选举、复制、重新配置。超过这个时间没完成就放弃。
auth-pass：哨兵连 master / slave 时用的密码。
notification-script：master / slave 被 SDOWN、ODOWN 时调，参数是事件名。
client-reconfig-script：failover 完成后调，参数是 <leader> <role> <state> <from-ip> <from-port> <to-ip> <to-port>。

4.2 启动哨兵

redis-sentinel /opt/redis-sentinel/26379/conf/sentinel.conf
redis-sentinel /opt/redis-sentinel/26380/conf/sentinel.conf
redis-sentinel /opt/redis-sentinel/26381/conf/sentinel.conf

# 或者用 redis-server 启动哨兵模式
redis-server /opt/redis-sentinel/26379/conf/sentinel.conf --sentinel

注意 redis-server --sentinel 启动后，哨兵会自动把配置写回 sentinel.conf，包含自动发现的 slave 列表。这是个很迷惑的点：哨兵一启动，sentinel monitor 配置项后面会多出一堆 sentinel known-replica 和 sentinel known-sentinel。这是正常现象，不要去手改。

# 看哨兵日志
tail -F /opt/redis-sentinel/26379/log/sentinel.log

会看到类似输出：

+monitor master redis-ha 127.0.0.1 6379 quorum 2
+slave slave 127.0.0.1:6380 127.0.0.1 6380 @ redis-ha 127.0.0.1 6379
+slave slave 127.0.0.1:6381 127.0.0.1 6381 @ redis-ha 127.0.0.1 6379
+sentinel sentinel 127.0.0.1:26380 127.0.0.1 26380 @ redis-ha 127.0.0.1 6379
+sentinel sentinel 127.0.0.1:26381 127.0.0.1 26381 @ redis-ha 127.0.0.1 6379

4.3 哨兵集群验证

# 登录任一哨兵
redis-cli -p 26379 -a Redis@2026
127.0.0.1:26379> SENTINEL masters
1) &nbsp;1)&nbsp;"name"
&nbsp; &nbsp; 2)&nbsp;"redis-ha"
&nbsp; &nbsp; 3)&nbsp;"ip"
&nbsp; &nbsp; 4)&nbsp;"127.0.0.1"
&nbsp; &nbsp; 5)&nbsp;"port"
&nbsp; &nbsp; 6)&nbsp;"6379"
&nbsp; &nbsp; ...

# 看 master 状态
127.0.0.1:26379> SENTINEL master redis-ha
1)&nbsp;"name"
2)&nbsp;"redis-ha"
3)&nbsp;"ip"
4)&nbsp;"127.0.0.1"
5)&nbsp;"port"
6)&nbsp;"6379"
7)&nbsp;"runid"
8)&nbsp;"xxxxxxxx"
9)&nbsp;"flags"
10)&nbsp;"master"
11)&nbsp;"num-slaves"
12)&nbsp;"2"
13)&nbsp;"num-other-sentinels"
14)&nbsp;"2"
15)&nbsp;"quorum"
16)&nbsp;"2"
...

# 看 slave
127.0.0.1:26379> SENTINEL replicas redis-ha

# 看 sentinel 集群
127.0.0.1:26379> SENTINEL sentinels redis-ha

# 客户端拿到当前 master
127.0.0.1:26379> SENTINEL get-master-addr-by-name redis-ha
1)&nbsp;"127.0.0.1"
2)&nbsp;"6379"

哨兵几个核心命令：

SENTINEL masters：列出所有被监控的 master 集群。
SENTINEL master <name>：看指定 master 详情。
SENTINEL replicas <name>（老版本叫 SENTINEL slaves <name>）：列从节点。
SENTINEL sentinels <name>：列哨兵集群。
SENTINEL get-master-addr-by-name <name>：拿当前 master 的地址。
SENTINEL failover <name>：强制切换，不依赖 SDOWN / ODOWN。演练用。
SENTINEL reset <name>：清掉 sentinel 内部状态，重新发现 slave / sentinel。
SENTINEL ckquorum <name>：检查 quorum 是否还达标。
SENTINEL flushconfig：把内存里的配置写回磁盘。

五、故障切换演练

5.1 演练一：kill master，验证自动切换

模拟生产里 master 突然宕机：

# 确认当前 master
redis-cli -p 26379 -a Redis@2026 SENTINEL get-master-addr-by-name redis-ha
# 1) "127.0.0.1"
# 2) "6379"

# 强制杀 master
kill&nbsp;-9 $(cat /opt/redis-cluster/6379/run/redis_6379.pid)

# 看哨兵日志（实时）
tail -F /opt/redis-sentinel/26379/log/sentinel.log

预期日志（按时间顺序）：

# 哨兵 1
+sdown master redis-ha 127.0.0.1 6379
# 哨兵 2、3 也报 sdown
+odown master redis-ha 127.0.0.1 6379&nbsp;#quorum&nbsp;2/2
+new-epoch 1
+try-failover master redis-ha 127.0.0.1 6379
+vote-for-leader xxx 1
+elected-leader master redis-ha 127.0.0.1 6379
+failover-state-select-slave master redis-ha 127.0.0.1 6379
+selected-slave slave 127.0.0.1:6380 127.0.0.1 6380 @ redis-ha 127.0.0.1 6379
+failover-state-send-slaveof-noone slave 127.0.0.1:6380 127.0.0.1 6380 @ redis-ha 127.0.0.1 6379
+failover-state-wait-promotion slave 127.0.0.1:6380 127.0.0.1 6380 @ redis-ha 127.0.0.1 6379
+promoted-slave slave 127.0.0.1:6380 127.0.0.1 6380 @ redis-ha 127.0.0.1 6379
+failover-state-reconf-slaves master redis-ha 127.0.0.1 6379
+slave-reconf-sent slave 127.0.0.1:6381 127.0.0.1 6381 @ redis-ha 127.0.0.1 6379
+slave-reconf-inprog slave 127.0.0.1:6381 127.0.0.1 6381 @ redis-ha 127.0.0.1 6379
+slave-reconf-done slave 127.0.0.1:6381 127.0.0.1 6381 @ redis-ha 127.0.0.1 6379
+failover-end master redis-ha 127.0.0.1 6379
+switch-master redis-ha 127.0.0.1 6379 127.0.0.1 6380

切换完成。

# 再次查询 master 地址
redis-cli -p 26379 -a Redis@2026 SENTINEL get-master-addr-by-name redis-ha
# 1) "127.0.0.1"
# 2) "6380"

# 写入测试
redis-cli -p 6380 -a Redis@2026 SET failover:test&nbsp;1
redis-cli -p 6380 -a Redis@2026 GET failover:test
# "1"

# 看其他 slave 同步情况
redis-cli -p 6381 -a Redis@2026 GET failover:test
# "1"

5.2 演练二：master 重启，验证回归

master 修复后重新启动：

redis-server /opt/redis-cluster/6379/conf/redis.conf

# 几秒后看哨兵日志
tail -F /opt/redis-sentinel/26379/log/sentinel.log

预期日志：

# 老 master 重启后，哨兵发现它以 slave 身份出现
-sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ redis-ha 127.0.0.1 6380
+convert-to-slave slave 127.0.0.1:6379 127.0.0.1 6379 @ redis-ha 127.0.0.1 6380

# 确认 6379 已经成为 6380 的 slave
redis-cli -p 6379 -a Redis@2026 INFO replication | grep -E&nbsp;"role|master_host|master_port"
# role:slave
# master_host:127.0.0.1
# master_port:6380

5.3 演练三：手动强制切换

# 不动 master，手动触发切换
redis-cli -p 26379 -a Redis@2026 SENTINEL failover redis-ha

返回 OK。Sentinel 会立刻进入故障转移流程，不会等 SDOWN。

5.4 演练四：网络分区与脑裂

模拟旧 master 还在接客，哨兵已经切换到新 master 的脑裂情况：

# 在 6379 上加 iptables 规则阻断心跳（仅演示用）
# 注意实际生产不要直接断 iptables，先评估
iptables -I INPUT -p tcp --dport 6379 -j DROP
# 这时 6379 还在，但哨兵 PING 不到

观察：

旧 master（6379）还能 SET 数据，但 6380 / 6381 已经收不到 6379 的复制。
哨兵判定 6379 客观下线，选举新 master。
老 master 上的”这段时间”的写不会被同步到新 master，造成数据丢失。

恢复后：

# 解开网络
iptables -D INPUT -p tcp --dport 6379 -j DROP

老 master 重连哨兵后，会被哨兵自动改造成新 master 的 slave。老 master 上没同步的数据不会回放，就此丢失。

要避免脑裂丢数据，关键配置：

# 旧 master 至少要有 1 个 slave 在 10 秒内跟自己同步过
min-replicas-to-write 1
min-replicas-max-lag 10

这会让主节点在”自己觉得 slave 都掉队了”的情况下拒绝写入，直接返回 (error) NOREPLICAS Not enough good replicas to write.，从而避免脑裂期间写入孤立数据。

风险提示：min-replicas-to-write 1 设了之后，如果所有 slave 都宕机，master 会拒绝所有写。生产一般配 1，最大允许 lag 10 秒，做业务容错时要明确：宁可返回错误，也不能让用户写完以为成功，回头又丢了。

六、生产级配置模板

6.1 多机生产部署

伪集群只用于验证，生产至少三台机器：

| 机器 | Redis 角色 | 端口 | | — | — | — | | redis-prod-01 | master | 6379 | | redis-prod-02 | slave1 | 6379 | | redis-prod-03 | slave2 | 6379 | | redis-prod-01 | sentinel | 26379 | | redis-prod-02 | sentinel | 26380 | | redis-prod-03 | sentinel | 26381 |

哨兵和 Redis 同机部署能省机器，但风险点是机器挂了哨兵和 Redis 一起死。推荐机器充足时把哨兵单独放。

6.2 Redis 生产配置差异

# 与单机伪集群的差异点

# 监听内网
bind 10.20.0.11

# 保护模式关闭（已 bind 内网 + 强密码）
protected-mode no

# 关闭或限制危险命令
rename-command FLUSHALL ""
rename-command FLUSHDB ""
rename-command CONFIG "CONFIG_x9f3"
rename-command KEYS ""

# 慢日志
slowlog-log-slower-than 5000
slowlog-max-len 256

# 客户端输出缓冲
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit replica 1024mb 256mb 300
client-output-buffer-limit pubsub 64mb 16mb 60

# 协议版本
# RESP2 兼容大部分客户端；RESP3 在 Redis 6+ 默认开启，客户端需支持

# 关闭 THP（透明大页）
# 注意这个不在 redis.conf 里，是系统层设置
# echo never > /sys/kernel/mm/transparent_hugepage/enabled

# 启用 lazyfree（避免大 key 回收阻塞主线程）
lazyfree-lazy-eviction yes
lazyfree-lazy-expire yes
lazyfree-lazy-server-del yes
lazyfree-lazy-user-del yes

风险提示：

protected-mode no 一旦打开，外网任何能连上 6379 的人都能尝试密码爆破，必须配合 bind 和强密码 + 防火墙。
rename-command KEYS "" 会让 KEYS * 这种生产大杀器无法使用，但运维排错时也会查不到。更稳的做法是用 rename-command KEYS "KEYS_x9f3"，藏起来但留口子。

6.3 Sentinel 生产配置差异

bind 10.20.0.11
port 26379
daemonize yes
logfile /var/log/redis/sentinel.log
dir /var/lib/redis/sentinel

sentinel monitor redis-ha 10.20.0.11 6379 2
sentinel auth-pass redis-ha <STRONG-PASSWORD>
sentinel down-after-milliseconds redis-ha 5000
sentinel parallel-syncs redis-ha 1
sentinel failover-timeout redis-ha 60000

# 超过多少毫秒没收到 slave 复制确认，判定 slave 失联
sentinel deny-scripts-reconfig yes

# 哨兵不重写 hostname
# sentinel resolve-hostnames no
# sentinel announce-ip 10.20.0.11
# sentinel announce-port 26379

announce-ip / announce-port 是 Redis 4.0.7+ 加的，用于多网卡或 NAT 场景下，哨兵告诉客户端”我真实的地址”是什么。否则客户端拿到的是 sentinel 自己看到的对端 IP，可能跨网段不可达。

6.4 防火墙与安全组

生产环境必须做的：

Redis 节点之间放通 6379。
Sentinel 节点之间放通 26379。
客户端到 Sentinel 放通 26379。
客户端到 Redis 放通 6379。
Redis 节点对外网只开 6379（如果有外网访问需求的话），但建议业务都走内网。
不需要外网访问的话，直接禁掉外网入口。

iptables 示例（按需调整）：

# Redis 节点之间互通
iptables -A INPUT -p tcp --dport 6379 -s 10.20.0.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 6379 -j DROP

# Sentinel 之间互通
iptables -A INPUT -p tcp --dport 26379 -s 10.20.0.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 26379 -j DROP

# 业务访问
iptables -A INPUT -p tcp --dport 6379 -s 10.30.0.0/24 -j ACCEPT

风险提示：改 iptables 之前先开一个会话保留 root，别 service iptables save 后直接 systemctl restart iptables，小心把自己挡在外面。

七、客户端接入

7.1 redis-cli 直连 master

生产一般不直接连 master，而是用哨兵发现 master：

# 拿当前 master
redis-cli -h sentinel-01 -p 26379 -a <PASSWORD> SENTINEL get-master-addr-by-name redis-ha

# 然后用拿到的 IP / 端口连
redis-cli -h 10.20.0.11 -p 6379 -a <PASSWORD> SET foo bar

不推荐每条命令都先 SENTINEL get-master-addr-by-name，而是用连接池或客户端 SDK。

7.2 Lettuce（Java）

import&nbsp;io.lettuce.core.RedisClient;
import&nbsp;io.lettuce.core.RedisURI;
import&nbsp;io.lettuce.core.api.StatefulRedisConnection;
import&nbsp;io.lettuce.core.api.sync.RedisCommands;
import&nbsp;io.lettuce.core.sentinel.api.SentinelCommands;
import&nbsp;io.lettuce.core.sentinel.api.StatefulRedisSentinelConnection;

public&nbsp;class&nbsp;LettuceSentinelDemo&nbsp;{
&nbsp; &nbsp;&nbsp;public&nbsp;static&nbsp;void&nbsp;main(String[] args)&nbsp;{
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;// 哨兵连接 URI
&nbsp; &nbsp; &nbsp; &nbsp; RedisURI sentinelUri = RedisURI.builder()
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .withSentinel("10.20.0.11",&nbsp;26379)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .withSentinel("10.20.0.12",&nbsp;26380)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .withSentinel("10.20.0.13",&nbsp;26381)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .withMasterId("redis-ha")
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .withPassword("<PASSWORD>".toCharArray())
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .build();

&nbsp; &nbsp; &nbsp; &nbsp; RedisClient client = RedisClient.create();

&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;// 拿到 Sentinel 连接
&nbsp; &nbsp; &nbsp; &nbsp; StatefulRedisSentinelConnection<String, String> sentinelConn = client.connectSentinel(sentinelUri);
&nbsp; &nbsp; &nbsp; &nbsp; SentinelCommands<String, String> sentinel = sentinelConn.sync();

&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;// 拿当前 master 地址
&nbsp; &nbsp; &nbsp; &nbsp; System.out.println(sentinel.getMasterAddrByName("redis-ha"));

&nbsp; &nbsp; &nbsp; &nbsp; sentinelConn.close();

&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;// 业务连接
&nbsp; &nbsp; &nbsp; &nbsp; StatefulRedisConnection<String, String> conn = client.connect(sentinelUri);
&nbsp; &nbsp; &nbsp; &nbsp; conn.sync().set("key",&nbsp;"value");
&nbsp; &nbsp; &nbsp; &nbsp; conn.close();

&nbsp; &nbsp; &nbsp; &nbsp; client.shutdown();
&nbsp; &nbsp; }
}

Lettuce 默认是异步长连接，能自动感知 master 切换。建议生产项目用 Lettuce 而不是 Jedis（后者同步、连接管理麻烦、master 切换感知差）。

7.3 redis-py（Python）

from&nbsp;redis.sentinel&nbsp;import&nbsp;Sentinel

sentinel = Sentinel(
&nbsp; &nbsp; [("10.20.0.11",&nbsp;26379), ("10.20.0.12",&nbsp;26380), ("10.20.0.13",&nbsp;26381)],
&nbsp; &nbsp; socket_timeout=0.5,
&nbsp; &nbsp; password="<PASSWORD>",
)

# 拿到 master 连接
master = sentinel.master_for("redis-ha", password="<PASSWORD>", db=0)
master.set("foo",&nbsp;"bar")
print(master.get("foo"))

# 拿到 slave 连接（读）
slave = sentinel.slave_for("redis-ha", password="<PASSWORD>", db=0)
print(slave.get("foo"))

sentinel.master_for 内部封装了一个连接池，能在 master 切换时自动重连。

7.4 Go 客户端（go-redis）

package&nbsp;main

import&nbsp;(
&nbsp; &nbsp;&nbsp;"context"
&nbsp; &nbsp;&nbsp;"fmt"
&nbsp; &nbsp;&nbsp;"github.com/redis/go-redis/v9"
)

func&nbsp;main()&nbsp;{
&nbsp; &nbsp; client := redis.NewFailoverClient(&redis.FailoverOptions{
&nbsp; &nbsp; &nbsp; &nbsp; MasterName: &nbsp; &nbsp;"redis-ha",
&nbsp; &nbsp; &nbsp; &nbsp; SentinelAddrs: []string{"10.20.0.11:26379",&nbsp;"10.20.0.12:26380",&nbsp;"10.20.0.13:26381"},
&nbsp; &nbsp; &nbsp; &nbsp; Password: &nbsp; &nbsp; &nbsp;"<PASSWORD>",
&nbsp; &nbsp; &nbsp; &nbsp; DB: &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0,
&nbsp; &nbsp; })

&nbsp; &nbsp; ctx := context.Background()
&nbsp; &nbsp;&nbsp;if&nbsp;err := client.Set(ctx,&nbsp;"foo",&nbsp;"bar",&nbsp;0).Err(); err !=&nbsp;nil&nbsp;{
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;panic(err)
&nbsp; &nbsp; }
&nbsp; &nbsp; val, _ := client.Get(ctx,&nbsp;"foo").Result()
&nbsp; &nbsp; fmt.Println(val)
}

go-redis v9 以上的 FailoverClient 支持哨兵模式。

八、生产参数调优

8.1 关键内核参数

# /etc/sysctl.conf
# 关闭透明大页（Redis 强烈推荐）
echo&nbsp;never > /sys/kernel/mm/transparent_hugepage/enabled
echo&nbsp;never > /sys/kernel/mm/transparent_hugepage/defrag

# 修改 /etc/rc.local 加这两行
# echo never > /sys/kernel/mm/transparent_hugepage/enabled
# echo never > /sys/kernel/mm/transparent_hugepage/defrag

# 调整 socket 队列
net.core.somaxconn = 1024
net.ipv4.tcp_max_syn_backlog = 1024

# 内存相关（看业务情况）
vm.overcommit_memory = 1
# 这个值的含义：
# 0：启发式（默认）
# 1：总是允许 overcommit（Redis 推荐）
# 2：不允许 overcommit

# 注意 overcommit_memory=1 可能导致 OOM Killer 杀进程，需要预留内存

# 文件描述符
# /etc/security/limits.conf
redis soft nofile 65535
redis hard nofile 65535

风险提示：

vm.overcommit_memory=1 后系统不会校验申请内存是否够用，进程可能一次性 mmap 大量内存触发 OOM。生产一般要预留 1.5~2 倍 Redis 内存给系统。
transparent_hugepage 不关会导致 Redis 延迟波动几十毫秒甚至上百毫秒。改完必须重启 Redis 让 mmap 重新分配。

8.2 Redis 性能相关参数

# 关闭 AOF + RDB
# AOF 写盘会拖慢主线程，磁盘繁忙时更明显
# 但生产关闭 AOF 等于宕机就丢数据，要权衡

# 大量小 key 场景
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
# Redis 7+ 这些参数名字改了，叫 listpack
# 5.x 用 ziplist，6.x 起推荐 listpack

ziplist / listpack 的目的：把多个小元素压缩存储，节省内存，缺点是元素多 / 大会退化成普通结构。生产 key 分布不均的情况下要注意。

8.3 内存淘汰策略

# 默认 noeviction（写不进去会返回 OOM 错误），生产一般用 allkeys-lru
maxmemory-policy allkeys-lru

# 备选
# volatile-lru：只淘汰有过期时间的 key
# allkeys-lfu：LFU 算法（Redis 4.0+）
# volatile-ttl：淘汰 TTL 最小的
# volatile-random / allkeys-random

allkeys-lfu 在 Redis 4.0+ 是推荐的，比 LRU 抗”突发热点”更好。Redis 7 起 lfu 相关参数更细（lfu-log-factor、lfu-decay-time）。

九、复制延迟监控与告警

9.1 关键指标

master_repl_offset 与 slave_repl_offset 差值。
slave_lag（秒数），INFO replication 里直接有。
master_last_io_seconds_ago：master 收到 slave 最近一次 PING 多久了。
master_link_down_since_seconds：master – slave 链路中断累计秒数。
connected_slaves：在线 slave 数。
blocked_clients：阻塞客户端数（BLPOP / BRPOP 等）。
instantaneous_ops_per_sec：QPS。
used_memory_peak / used_memory_rss：内存峰值与实际占用。
mem_fragmentation_ratio：内存碎片率，>1.5 要考虑重启或迁移。

9.2 redis_exporter 接入 Prometheus

wget https://github.com/oliver006/redis_exporter/releases/download/v1.58.0/redis_exporter-v1.58.0.linux-amd64.tar.gz
tar xzf redis_exporter-v1.58.0.linux-amd64.tar.gz
cd&nbsp;redis_exporter-v1.58.0.linux-amd64
cp redis_exporter /usr/local/bin/

启动 redis_exporter：

# 多实例方式（生产推荐）
redis_exporter \
&nbsp; -redis.password <PASSWORD> \
&nbsp; -web.listen-address :9121 \
&nbsp; &

Prometheus 抓取配置（prometheus.yml）：

scrape_configs:
&nbsp;&nbsp;-&nbsp;job_name:&nbsp;'redis_exporter'
&nbsp; &nbsp;&nbsp;static_configs:
&nbsp; &nbsp; &nbsp;&nbsp;-&nbsp;targets:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;-&nbsp;'10.20.0.11:9121'
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;-&nbsp;'10.20.0.12:9121'
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;-&nbsp;'10.20.0.13:9121'
&nbsp; &nbsp;&nbsp;metrics_path:&nbsp;/metrics

Grafana 面板推荐使用 Redis Dashboard for Prometheus (ID 11835)，或者自己拼。

9.3 常用告警规则

groups:
&nbsp;&nbsp;-&nbsp;name:&nbsp;redis_alerts
&nbsp; &nbsp;&nbsp;rules:
&nbsp; &nbsp; &nbsp;&nbsp;-&nbsp;alert:&nbsp;RedisDown
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;expr:&nbsp;redis_up&nbsp;==&nbsp;0
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;for:&nbsp;1m
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;labels:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;severity:&nbsp;critical
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;annotations:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;summary:&nbsp;"Redis 节点&nbsp;{{ $labels.instance }}&nbsp;不可达"

&nbsp; &nbsp; &nbsp;&nbsp;-&nbsp;alert:&nbsp;RedisReplicationBroken
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;expr:&nbsp;redis_connected_slaves&nbsp;<&nbsp;1
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;for:&nbsp;5m
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;labels:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;severity:&nbsp;warning
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;annotations:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;summary:&nbsp;"Redis master&nbsp;{{ $labels.instance }}&nbsp;没有 slave"

&nbsp; &nbsp; &nbsp;&nbsp;-&nbsp;alert:&nbsp;RedisReplicationLag
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;expr:&nbsp;(redis_master_repl_offset&nbsp;-&nbsp;redis_slave_repl_offset)&nbsp;>&nbsp;1048576
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;for:&nbsp;5m
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;labels:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;severity:&nbsp;warning
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;annotations:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;summary:&nbsp;"Redis 主从延迟超过 1MB"

&nbsp; &nbsp; &nbsp;&nbsp;-&nbsp;alert:&nbsp;RedisMemoryHigh
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;expr:&nbsp;(redis_used_memory&nbsp;/&nbsp;redis_max_memory)&nbsp;>&nbsp;0.8
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;for:&nbsp;10m
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;labels:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;severity:&nbsp;warning
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;annotations:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;summary:&nbsp;"Redis 内存使用率超过 80%"

&nbsp; &nbsp; &nbsp;&nbsp;-&nbsp;alert:&nbsp;RedisRejectedConnections
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;expr:&nbsp;increase(redis_rejected_connections_total[5m])&nbsp;>&nbsp;0
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;labels:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;severity:&nbsp;warning
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;annotations:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;summary:&nbsp;"Redis 正在拒绝连接"

&nbsp; &nbsp; &nbsp;&nbsp;-&nbsp;alert:&nbsp;RedisKeyEvictions
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;expr:&nbsp;increase(redis_evicted_keys_total[10m])&nbsp;>&nbsp;1000
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;labels:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;severity:&nbsp;warning
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;annotations:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;summary:&nbsp;"Redis 淘汰 key 速度异常"

阈值不要拍脑袋定，要按业务基线调。redis_connected_slaves 写 1 是因为哨兵模式下 slave 数量固定，但要注意多 master 集群时这个指标每个 master 各自统计。

十、故障案例

10.1 案例 1：主从复制断连

现象：

业务监控发现读从节点响应慢。
客户端报 READONLY You can't write against a read only replica.

初步判断：

客户端是不是写到从节点了（一般代码 bug）。
主从是不是断连。

命令检查：

# 看从节点复制状态
redis-cli -p 6380 -a <PASSWORD> INFO replication
# ...
# master_link_status:down
# master_sync_in_progress:0
# master_link_down_since_seconds:120
# master_host:10.20.0.11

# 看 master 端 slave 状态
redis-cli -p 6379 -a <PASSWORD> INFO replication
# slave0:ip=10.20.0.12,port=6379,state=offline

关键指标：

master_link_status:down 表示 slave 收不到 master 心跳。
master_link_down_since_seconds 持续累加。

根因定位：

网络问题（防火墙 / 安全组 / 路由）。
master 端 OOM / 进程崩溃。
master 重启过，replid 变化，slave 进入全量同步未完成。
repl-backlog 太小，断连期间覆盖了。

修复：

# 临时重启同步
redis-cli -p 6380 -a <PASSWORD> SLAVEOF 10.20.0.11 6379

# 看是否进入 sync_in_progress
redis-cli -p 6380 -a <PASSWORD> INFO replication | grep master_sync_in_progress

# 全量同步期间看传输进度
redis-cli -p 6380 -a <PASSWORD> INFO replication | grep -E&nbsp;"master_last_io_seconds|slave_repl_offset"

验证：

# 等 master_link_status 变 up
redis-cli -p 6380 -a <PASSWORD> INFO replication | grep master_link_status
# master_link_status:up

# 看 offset 追平
redis-cli -p 6379 -a <PASSWORD> INFO replication | grep master_repl_offset
redis-cli -p 6380 -a <PASSWORD> INFO replication | grep slave_repl_offset

复盘：

repl-backlog-size 调到 1gb 以上。
跨机房主从要保证 down-after-milliseconds 不要设太小。
全量同步对内存和带宽压力极大，master 内存剩余要监控。

10.2 案例 2：哨兵选主失败

现象：

master 宕机，哨兵日志一直卡在 +try-failover。
客户端长时间连不上。

初步判断：

哨兵集群是不是不健康（quorum 凑不齐）。
sentinel 之间的通信是不是有网络问题。

命令检查：

# 看哨兵集群状态
redis-cli -p 26379 -a <PASSWORD> SENTINEL sentinels redis-ha

# 看 quorum
redis-cli -p 26379 -a <PASSWORD> SENTINEL master redis-ha | grep quorum

# 看哨兵日志
tail -F /var/log/redis/sentinel.log

关键指标：

num-other-sentinels：哨兵集群能互相发现的数量，应该等于哨兵数 – 1。
quorum：判定 master 客观下线需要的票数。
哨兵日志里 +sdown 有没有出现 / +odown 出现没。

根因定位：

哨兵之间网络不通，每个哨兵都只能看到自己。
哨兵配置错了 quorum，比如 3 个哨兵配了 quorum 3，需要 3 个哨兵都认为挂才行，但哨兵数量只有 3 个。
哨兵启动时绑定的 IP 错，导致 sentinel 之间拿不到对方真实地址。

修复：

# 看哨兵互相能 ping 通
redis-cli -p 26379 -a <PASSWORD> PING

# 重置哨兵集群状态
redis-cli -p 26379 -a <PASSWORD> SENTINEL reset redis-ha

# 如果有哨兵没起来，重启它
systemctl status redis-sentinel
systemctl restart redis-sentinel

验证：

# 哨兵集群应该看到其他哨兵
redis-cli -p 26379 -a <PASSWORD> SENTINEL sentinels redis-ha

# 主动切一次验证
redis-cli -p 26379 -a <PASSWORD> SENTINEL failover redis-ha

复盘：

哨兵必须奇数部署。
quorum 一般设 N/2 + 1（3 哨兵就是 2）。
sentinel 之间的端口（默认 26379）必须互通。
同一业务不要把多个 Redis 集群的哨兵混在同一个集群里，sentinel reset 会影响所有。

10.3 案例 3：脑裂后客户端写入丢失

现象：

网络抖动恢复后，客户端报告”我明明写了 key 啊，怎么没了”。
监控看老 master 的 db_size 和新 master 的 db_size 不一致。

初步判断：

发生过脑裂 + 异步复制。
客户端没做幂等。

命令检查：

# 看 master 历史
redis-cli -p 6379 -a <PASSWORD> INFO replication | grep -E&nbsp;"role|connected_slaves"

# 看 slave 状态
redis-cli -p 6380 -a <PASSWORD> INFO replication | grep -E&nbsp;"role|master_link_status"

# 关键：看 master_replid 是否一致
redis-cli -p 6379 -a <PASSWORD> INFO replication | grep master_replid
redis-cli -p 6380 -a <PASSWORD> INFO replication | grep master_replid

关键指标：

master_replid 一致：复制关系正常。
master_replid 不一致：发生过 master 切换。
db_size（DBSIZE）：key 总数，不一致说明有数据丢失。

根因定位：

客户端写的时候正好 master 网络被分区。
哨兵把老 master 降为 slave。
客户端还在往老 master 写，replication 链路不通。
网络恢复后老 master 上孤立的数据不会自动复制到新 master。

修复：

业务层做幂等：用 SET key value NX 或版本号。
配置 min-replicas-to-write 1 + min-replicas-max-lag 10，让 master 在没有 slave 同步的情况下拒绝写入。
引入重试：客户端写失败时记日志或重试到新 master。

验证：

# 模拟脑裂
iptables -I INPUT -p tcp --dport 6379 -j DROP
# 此时向 6379 写，应该失败（如果配了 min-replicas）
redis-cli -p 6379 -a <PASSWORD> SET k v
# (error) NOREPLICAS Not enough good replicas to write.

# 恢复
iptables -D INPUT -p tcp --dport 6379 -j DROP

复盘：

脑裂是 Redis 异步复制的固有风险，不是 bug。
写敏感业务（钱、订单）必须考虑强一致方案。

10.4 案例 4：failover 后客户端仍连老 master

现象：

哨兵切换完成，客户端日志大量 MASTERDOWN Link master ERROR。
部分客户端连到的还是老 master（已经变成 slave）。

初步判断：

客户端没正确处理 MASTERDOWN 错误。
客户端连接池里的连接没失效。

命令检查：

# 看客户端日志
grep -E&nbsp;"MASTERDOWN|Connection refused|Connection reset"&nbsp;app.log

# 看客户端配置（Java Lettuce）
# 在 Lettuce 里 default 是不断重试，但部分自定义客户端可能不重试

关键指标：

客户端日志里 MASTERDOWN 出现的时间窗。
Redis 端 connected_clients 数量。

根因定位：

Lettuce 默认有 MasterSlaveConnectivity 监听，能感知 master 切换自动重建连接。
Jedis 用的是 JedisSentinelPool，切主时需要客户端重新 getResource() 拿到新连接。
自研客户端最容易踩这个坑：failover 期间连错 master 不会自动发现。

修复：

Lettuce：用 RedisClient.connectSentinel 或 MasterSlave 自动感知。
Jedis：在 JedisSentinelPool 中拿到异常时重试 getResource()。
业务层：捕获连接异常，强制重连。

// Jedis 例子
JedisSentinelPool pool =&nbsp;new&nbsp;JedisSentinelPool("redis-ha", sentinels, config);
Jedis jedis =&nbsp;null;
try&nbsp;{
&nbsp; &nbsp; jedis = pool.getResource();
&nbsp; &nbsp; jedis.set("k",&nbsp;"v");
}&nbsp;catch&nbsp;(JedisConnectionException e) {
&nbsp; &nbsp;&nbsp;// pool 会自动剔除失效连接
&nbsp; &nbsp;&nbsp;// 但业务要重试
&nbsp; &nbsp; jedis.close();
&nbsp; &nbsp; jedis = pool.getResource();
&nbsp; &nbsp; jedis.set("k",&nbsp;"v");
}&nbsp;finally&nbsp;{
&nbsp; &nbsp;&nbsp;if&nbsp;(jedis !=&nbsp;null) {
&nbsp; &nbsp; &nbsp; &nbsp; jedis.close();
&nbsp; &nbsp; }
}

验证：

# 主动 failover
redis-cli -p 26379 -a <PASSWORD> SENTINEL failover redis-ha

# 客户端写应该短暂报错后恢复

复盘：

客户端必须用支持哨兵自动重连的库。
自研客户端必须有 “Sentinel subscribe + master 切换监听 + 连接重建” 流程。
测试必须包含主动 failover 用例。

十一、常见坑合集

11.1 密码与认证

主从之间 requirepass 和 masterauth 必须一致，主从双向都要设。
哨兵要设 sentinel auth-pass，否则哨兵连不上 master 会被 SDOWN。
protected-mode yes 时，如果 bind 不是 127.0.0.1，必须有密码。
bind 0.0.0.0 + 没密码 = 公开 Redis，几小时内被挖矿。

11.2 复制与同步

repl-backlog-size 1mb 默认值在生产一定要改，1GB 起。
不要用 DEBUG RELOAD 模拟 master 重启（replid 不变，复制不会断），用 SHUTDOWN 才是真。
replica-serve-stale-data no 会在 slave 断连后所有读返回错误，不推荐。
写从节点不报错但 replica-read-only no 会导致写从节点的数据丢失，生产禁止关。

11.3 哨兵

quorum 3 在 3 哨兵集群里几乎永远凑不齐。
哨兵 bind 0.0.0.0 时，客户端会拿到所有 IP，可能连不上内网 IP。
哨兵自动写回的 sentinel.conf 不要手改，会被覆盖。
sentinel reset 是清空内部状态，会触发 slave 重新发现，慎用。
SENTINEL flushconfig 立刻把内存配置写回磁盘，但不会重置 sentinel 内部对节点的发现。

11.4 客户端

写代码时不要拿 sentinel 的 IP / 端口当 master 写，要走 SENTINEL get-master-addr-by-name。
连接池在切换时不会自动重连，要靠客户端 SDK 处理。
TLS 加密的 Redis 要在客户端配 CA / cert，sentinel 也一样。

11.5 监控

connected_slaves 看的是 master 上的 slave 数量，slave 上这个指标是 0（没有自己的 slave）。
rejected_connections 暴涨说明 maxclients 满了，master 拒接。
expired_keys / evicted_keys 暴涨说明内存压力大。
mem_fragmentation_ratio > 1.5 说明有大量碎片，可能需要重启或切版本。

十二、运维操作清单

12.1 日常操作

12.2 紧急止血

# 1. 看现在的 master 是谁
redis-cli -p 26379 SENTINEL get-master-addr-by-name redis-ha

# 2. 临时把流量切到指定 slave
redis-cli -p 26379 SENTINEL failover redis-ha

# 3. 关掉坏掉的 master
redis-cli -p 6379 -a <PASSWORD> SHUTDOWN NOSAVE

# 4. 把坏掉节点的 slave 重新指向新 master
redis-cli -p 6379 -a <PASSWORD> SLAVEOF <new-master-ip> 6379

风险提示：

SHUTDOWN NOSAVE 不写 RDB/AOF，直接关进程。生产除非你明确知道后果再用。
SENTINEL failover 会触发业务断流几秒，要先知会业务方。
修改 requirepass 不通知客户端会导致 100% 写入失败。

12.3 数据恢复

如果 RDB / AOF 损坏：

# 先备份损坏文件
cp appendonly.aof appendonly.aof.broken

# 尝试用 redis-check-aof 修复
redis-check-aof --fix appendonly.aof

# RDB 用 redis-check-rdb 检查
redis-check-rdb dump.rdb

修复不成功就只能 CONFIG SET appendonly no 重新启动，从一个健康的 slave 做主。

风险提示：修复 AOF 是对正在使用的文件做修改，必须先 cp 一份做兜底。

十三、Redis 7.x 升级注意事项

13.1 ACL 改动

Redis 6.0+ 引入了 ACL，6.x 是实验性，7.x 稳定。生产从 5.x 升到 7.x：

requirepass 还在，但推荐迁到 ACL。
用户系统从单一 default 变成多用户。

# 旧版
requirepass foobar

# 新版（推荐）
user default on ><STRONG-PASSWORD> ~* &* +@all
# on：启用
# >：密码
# ~*：所有 key
# &*：所有 pubsub channel
# +@all：所有命令

业务用户可以单独建：

user appuser on ><APP-PASSWORD> ~app:* &* +@read&nbsp;+@write -@admin
# appuser 只能读写 app:* 的 key
# 不能用 FLUSHALL、CONFIG、SHUTDOWN 等管理命令

13.2 RESP3

Redis 6+ 默认支持 RESP3，老客户端走 RESP2。Lettuce、redis-py、go-redis 新版本默认 RESP3。

13.3 客户端输出缓冲

Redis 7+ 对 client-output-buffer-limit 做了更精细的调整：

client-output-buffer-limit normal 0 0 0
client-output-buffer-limit replica 1024mb 256mb 300
client-output-buffer-limit pubsub 64mb 16mb 60

老版本没有 pubsub 的限制，Redis 6 起加了。

13.4 持久化

Redis 7 起 RDB 文件格式升级，低版本无法读。升级时建议新节点用 7.x，老节点先升 6.x 再升 7.x，避免跨大版本直接跳。

十四、回滚方案

14.1 部署新哨兵集群的回滚

记录旧哨兵配置：cp /opt/redis-sentinel/26379/conf/sentinel.conf sentinel.conf.bak。
部署新哨兵用不同端口，验证无误后再切流量。
切流量时停客户端、停旧哨兵、起新哨兵、改客户端配置。
客户端配 MasterName 不变时，可以只改 SentinelAddrs 实现切换。

14.2 Redis 版本回滚

Redis 7 升到 7.0.5 之后发现 bug，要回滚到 7.0.4：

关掉要回滚的 Redis（先踢出哨兵监控）。
备份数据：cp -r /opt/redis-cluster/6379/data /backup/。
用老版本二进制启动。
验证 INFO replication 是否正常。
通知哨兵集群重新发现。

风险提示：跨大版本回滚（AOF / RDB 不兼容）会丢数据，必须做完整备份 + 业务暂停写入。

十五、总结

Redis 主从 + 哨兵这套是”用得最多、最稳、最容易踩坑”的高可用方案。这篇文章把原理、部署、配置、演练、案例、监控、风险点都串了一遍。

最后给一个上线前的 Checklist：

[ ] Redis 配置中 repl-backlog-size 1gb 以上
[ ] min-replicas-to-write / min-replicas-max-lag 配好
[ ] requirepass / masterauth 一致
[ ] 哨兵 3 / 5 节点，quorum 配 N/2 + 1
[ ] 哨兵 announce-ip 配好（多网卡）
[ ] 客户端 SDK 是 Lettuce / go-redis v9 / redis-py 之类支持哨兵自动重连的
[ ] 监控 redis_up、redis_connected_slaves、redis_replication_lag、used_memory
[ ] 告警阈值按业务基线调
[ ] 至少演练过一次主动 failover
[ ] 至少演练过一次 master 重启 / 杀进程
[ ] 至少演练过一次脑裂，写策略已经覆盖
[ ] 客户端写异常有重试 + 业务幂等
[ ] 防火墙 / 安全组 / SELinux 全部放通
[ ] transparent_hugepage 关闭
[ ] vm.overcommit_memory 配 1 或评估过风险
[ ] 慢查询日志 slowlog-log-slower-than 调小
[ ] 危险命令 rename-command 藏好
[ ] 监控 redis_exporter / Sentinel 都接 Prometheus
[ ] AOF 至少 everysec，磁盘空间预留 2 倍
[ ] 定期演练 SENTINEL failover
[ ] 客户端连接池配置正确，failover 期间业务断流 < 5 秒
[ ] ACL 配好，业务账号最小权限
[ ] 升级 / 回滚方案写入 Runbook
[ ] 值班同学知道哨兵和 Redis 的密码位置
[ ] Redis 进程绑 CPU（taskset -c 0,1 redis-server ...）
[ ] 关键监控项接入 OnCall 群
[ ] 业务有”Redis 不可用”降级方案

十六、Redis 性能基线与容量规划

16.1 性能基线测试

部署前要先在自己机器上跑一遍基准，拿到这个数字心里有底：

# Redis 自带 redis-benchmark
redis-benchmark -h 127.0.0.1 -p 6379 -a <PASSWORD> -c 50 -n 100000 -t&nbsp;set,get,lpush,lpop
# -c：并发连接数
# -n：总请求数
# -t：测试命令

# 只测 SET
redis-benchmark -h 127.0.0.1 -p 6379 -a <PASSWORD> -c 100 -n 1000000 -t&nbsp;set&nbsp;-P 16
# -P 16：pipeline 16 条/请求

# 实测典型结果（不同机器差异大）：
# 1 字节 SET：单机 8~12 万 QPS
# 1 字节 GET：单机 10~15 万 QPS
# 100 字节 SET：单机 5~8 万 QPS
# pipeline 16：单机 30~50 万 QPS

生产评估时按实测 QPS 的 1/4 ~ 1/2 算容量，留足余量。Redis 单线程模型下，CPU 主频和网卡带宽是最大瓶颈。

16.2 内存容量规划

# 看 key 占用分布
redis-cli -p 6379 -a <PASSWORD> --bigkeys
# 输出 key 数量最大的前几个 key
# 注意：--bigkeys 会用 SCAN 全库扫一遍，对大库有压力

# 看每种数据类型的内存占比
redis-cli -p 6379 -a <PASSWORD> MEMORY USAGE <key>
# 给出单个 key 占用的字节数

# 看内存碎片
redis-cli -p 6379 -a <PASSWORD> INFO memory | grep -E&nbsp;"used_memory|mem_fragmentation"

内存规划的工程经验：

数据总量预估：当前数据量 × 1.5（一年增长）× 1.5（备份、日志、buffer）= 所需 Redis 内存。
留 30% 给 repl-backlog、client-output-buffer、lazyfree 待回收的 key。
一台机器 64GB 内存，Redis maxmemory 不要超过 50GB。
单 key 不要超过 10KB（除非你确定不会阻塞主线程）。

16.3 网络带宽规划

Redis 主要消耗是网络：

单连接 10K QPS × 100 字节 = 1 MB/s 双向带宽。
1000 个客户端 = 1 GB/s。
主从复制峰值 100 MB/s 很正常（首次全量同步）。

网卡至少 10 Gbps，主从之间最好独立网卡或独立 VLAN。

16.4 慢查询分析

# 看最近的慢查询
redis-cli -p 6379 -a <PASSWORD> SLOWLOG GET 20

# 慢查询组成
redis-cli -p 6379 -a <PASSWORD> SLOWLOG LEN

# 重置
redis-cli -p 6379 -a <PASSWORD> SLOWLOG RESET

SLOWLOG 每条记录格式：

1) 1) (integer) 1001 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# 唯一 ID
&nbsp; &nbsp;2) (integer) 1700000000 &nbsp; &nbsp;&nbsp;# 时间戳
&nbsp; &nbsp;3) (integer) 12000 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# 耗时（微秒）
&nbsp; &nbsp;4) 1)&nbsp;"GET"&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# 命令
&nbsp; &nbsp; &nbsp; 2)&nbsp;"hugekey"&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# 参数
&nbsp; &nbsp;5)&nbsp;"10.20.0.11:54321"&nbsp; &nbsp; &nbsp; &nbsp;# 客户端
&nbsp; &nbsp;6)&nbsp;""

慢查询最常见的原因：

KEYS *（生产禁用）。
LRANGE key 0 -1 拉大 list。
HGETALL 拉大 hash。
单 key 太大。
DEL 大 key。
EXPIRE 一个不存在的 key。

16.5 大 key 治理

大 key（>10KB）是 Redis 的隐形炸弹：

# 用 redis-cli --bigkeys 找大 key
redis-cli -p 6379 -a <PASSWORD> --bigkeys -i 0.1
# -i 0.1：每扫描 100 个 key 休眠 0.1 秒，降低对线上影响

# 用 redis-rdb-tools 分析 RDB
# 装好之后
rdb -c memory /var/lib/redis/dump.rdb > memory.csv
# 按 size 排序
sort -t, -k4 -nr memory.csv | head -20

大 key 治理原则：

拆 key：把 user:1000:profile 拆成 user:1000:profile:name、user:1000:profile:age。
改结构：把 100 万个 field 的 hash 拆成 100 个 1 万 field 的 hash。
用 UNLINK 代替 DEL（异步删除，Redis 4+）。
用 SCAN 代替 KEYS。
业务层避免一次性写大 key。

十七、Redis 安全加固清单

17.1 认证与授权

# 创建只读账号
redis-cli -p 6379 -a <ADMIN-PASSWORD> ACL SETUSER&nbsp;readonly&nbsp;on ><RO-PASSWORD> ~* &* +@read

# 创建可写账号
redis-cli -p 6379 -a <ADMIN-PASSWORD> ACL SETUSER readwrite on ><RW-PASSWORD> ~app:* &* +@read&nbsp;+@write -@admin -@dangerous

# 禁用 default 用户
redis-cli -p 6379 -a <ADMIN-PASSWORD> ACL SETUSER default off

# 列出所有用户
redis-cli -p 6379 -a <ADMIN-PASSWORD> ACL LIST

17.2 网络隔离

# 只允许内网
iptables -A INPUT -p tcp --dport 6379 ! -s 10.20.0.0/16 -j DROP

# 关闭公网监听
# bind 10.20.0.11

# 启用 TLS（Redis 6+）
# 生成证书
openssl req -x509 -newkey rsa:4096 -keyout /etc/redis/redis.key -out /etc/redis/redis.crt -days 365 -nodes -subj&nbsp;"/CN=redis-prod"
cat /etc/redis/redis.key /etc/redis/redis.crt > /etc/redis/redis.pem

# redis.conf
tls-port 6380
tls-cert-file /etc/redis/redis.crt
tls-key-file /etc/redis/redis.key
tls-ca-cert-file /etc/redis/ca.crt
tls-auth-clients no

风险提示：TLS 会让 Redis 性能下降 10%~30%，按需启用。

17.3 命令禁用

# 必禁
rename-command FLUSHALL ""
rename-command FLUSHDB ""
rename-command CONFIG ""
rename-command DEBUG ""
rename-command SHUTDOWN ""
rename-command KEYS ""

17.4 审计

# Redis 6+ 开启 audit log
# Redis 本身没有审计日志，需要通过 ACL + 监控实现
# 推荐：把 redis-audit 这样的工具挂上去
# 或者用 elk 收 audit 事件

十八、Redis 7.x 重点新特性

18.1 Function

# 注册函数
redis-cli -p 6379 FUNCTION LOAD&nbsp;"#!lua name=mylib\nredis.call('SET', KEYS[1], ARGV[1])\nreturn 1"

# 调用
redis-cli -p 6379 FCALL myfunc 1 k v

# 列出
redis-cli -p 6379 FUNCTION LIST

# 删
redis-cli -p 6379 FUNCTION DELETE mylib

FUNCTION 替代了之前的 EVAL / EVALSHA 难维护的脚本管理问题。

18.2 多线程 I/O

Redis 6+ 引入了 I/O 多线程：

# redis.conf
io-threads 4
io-threads-do-reads yes

注意：执行命令还是单线程。多线程只用来接收和发送网络包。

18.3 客户端缓存（Tracking）

# 客户端开启 tracking
redis-cli -p 6379 CLIENT TRACKING ON
# 当 key 被修改时，Redis 会主动通知客户端
# 适合做"本地缓存"场景

# 广播模式
redis-cli -p 6379 CLIENT TRACKING ON BCAST

18.4 Redis 7.2 集群分片改进

Redis 7.2 增强了 Cluster 的多 shard failover，但本篇不展开 Cluster 模式。

十九、自动化运维脚本

19.1 监控复制延迟

#!/bin/bash
# check_redis_replication.sh
# 监控主从延迟，超过阈值告警

REDIS_CLI="redis-cli -p 6379 -a <PASSWORD>"
THRESHOLD=1048576 &nbsp;# 1MB

MASTER_OFFSET=$($REDIS_CLI&nbsp;INFO replication | grep master_repl_offset | cut -d: -f2 | tr -d&nbsp;'\r')
SLAVE_OFFSET=$($REDIS_CLI&nbsp;INFO replication | grep slave_repl_offset | cut -d: -f2 | tr -d&nbsp;'\r')

LAG=$((MASTER_OFFSET - SLAVE_OFFSET))

if&nbsp;[&nbsp;$LAG&nbsp;-gt&nbsp;$THRESHOLD&nbsp;];&nbsp;then
&nbsp; &nbsp;&nbsp;echo&nbsp;"[CRITICAL] Replication lag:&nbsp;${LAG}&nbsp;bytes"
&nbsp; &nbsp;&nbsp;exit&nbsp;2
fi
echo&nbsp;"[OK] Replication lag:&nbsp;${LAG}&nbsp;bytes"
exit&nbsp;0

风险提示：脚本里硬编码密码不安全。生产用环境变量或 ~/.redisauth 文件（chmod 600）注入。

19.2 主动 failover 演练

#!/bin/bash
# force_failover.sh
# 主动触发 sentinel failover

SENTINEL_HOST="10.20.0.11"
SENTINEL_PORT="26379"
SENTINEL_PASS="<SENTINEL-PASSWORD>"
MASTER_NAME="redis-ha"

echo&nbsp;"[$(date)] Triggering failover for&nbsp;${MASTER_NAME}"
redis-cli -h&nbsp;$SENTINEL_HOST&nbsp;-p&nbsp;$SENTINEL_PORT&nbsp;-a&nbsp;$SENTINEL_PASS&nbsp;SENTINEL failover&nbsp;$MASTER_NAME

if&nbsp;[ $? -ne 0 ];&nbsp;then
&nbsp; &nbsp;&nbsp;echo&nbsp;"[$(date)] Failover failed"
&nbsp; &nbsp;&nbsp;exit&nbsp;1
fi

# 等待切换完成
for&nbsp;i&nbsp;in&nbsp;{1..30};&nbsp;do
&nbsp; &nbsp; sleep 1
&nbsp; &nbsp; NEW_MASTER=$(redis-cli -h&nbsp;$SENTINEL_HOST&nbsp;-p&nbsp;$SENTINEL_PORT&nbsp;-a&nbsp;$SENTINEL_PASS&nbsp;SENTINEL get-master-addr-by-name&nbsp;$MASTER_NAME)
&nbsp; &nbsp;&nbsp;if&nbsp;[ -n&nbsp;"$NEW_MASTER"&nbsp;];&nbsp;then
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;echo&nbsp;"[$(date)] New master:&nbsp;$NEW_MASTER"
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;exit&nbsp;0
&nbsp; &nbsp;&nbsp;fi
done
echo&nbsp;"[$(date)] Failover timeout"
exit&nbsp;2

19.3 自动扩缩容

#!/bin/bash
# add_slave.sh <NEW_SLAVE_IP> <SLAVE_PORT>
# 在新机器上添加 slave 节点

NEW_SLAVE_IP=$1
SLAVE_PORT=${2:-6379}
MASTER_IP="10.20.0.11"
MASTER_PORT=6379

if&nbsp;[ -z&nbsp;"$NEW_SLAVE_IP"&nbsp;];&nbsp;then
&nbsp; &nbsp;&nbsp;echo&nbsp;"Usage:&nbsp;$0&nbsp;<NEW_SLAVE_IP> [SLAVE_PORT]"
&nbsp; &nbsp;&nbsp;exit&nbsp;1
fi

# 1. 在新机器上启动 Redis
ssh&nbsp;$NEW_SLAVE_IP&nbsp;<< EOF
redis-server /opt/redis/conf/redis.conf
EOF

# 2. 等待 Redis 起来
sleep 5

# 3. 设为 slave
redis-cli -h&nbsp;$NEW_SLAVE_IP&nbsp;-p&nbsp;$SLAVE_PORT&nbsp;-a <PASSWORD> SLAVEOF&nbsp;$MASTER_IP&nbsp;$MASTER_PORT

# 4. 验证
sleep 10
ROLE=$(redis-cli -h&nbsp;$NEW_SLAVE_IP&nbsp;-p&nbsp;$SLAVE_PORT&nbsp;-a <PASSWORD> INFO replication | grep role)
echo&nbsp;"New slave role:&nbsp;$ROLE"

风险提示：自动化脚本跑批量操作一定要有 dry-run 模式，命令示例里没加，生产用要补上 -n 标志。

二十、常见运维问题速答

20.1 `MISCONF Redis is configured to save RDB snapshots` 错误

# 原因：磁盘满 / 写不进 / fork 失败
# 处理：
# 1. 清理磁盘
# 2. 临时关闭这个检查
redis-cli -p 6379 -a <PASSWORD> CONFIG SET stop-writes-on-bgsave-error no
# 3. 修好磁盘后打开
redis-cli -p 6379 -a <PASSWORD> CONFIG SET stop-writes-on-bgsave-error yes

20.2 `LOADING Redis is loading the dataset in memory`

# 原因：Redis 启动时正在加载 RDB / AOF
# 客户端在这个阶段不能发命令
# 处理：等待，或者从客户端 SDK 忽略这个错误

20.3 `OOM command not allowed when used memory > 'maxmemory'`

# 内存满了
# 1. 加内存
# 2. 调整 maxmemory
redis-cli -p 6379 -a <PASSWORD> CONFIG SET maxmemory 4gb
# 3. 调整淘汰策略
redis-cli -p 6379 -a <PASSWORD> CONFIG SET maxmemory-policy allkeys-lru
# 4. 清理大 key
redis-cli -p 6379 -a <PASSWORD> --bigkeys

20.4 客户端 `READONLY` 错误

# 原因：客户端连到 slave 但尝试写
# 处理：客户端用哨兵拿到 master，连 master

20.5 `MASTERDOWN` 错误

# 原因：客户端连到的 master 已经不是 master 了
# 处理：客户端重新向哨兵询问

20.6 `CLUSTERDOWN` 错误

# 集群模式专用错误，本篇哨兵模式不会遇到
# 但运维要知道区分

二十一、与 Cluster 模式的关系

21.1 哨兵 vs Cluster

哨兵和 Cluster 不能混用。一个 master-replica group 走哨兵，多个走 Cluster。生产里常见做法：

数据 < 32GB 走哨兵（一主多从）。
数据 > 32GB 走 Cluster。
跨机房走哨兵（Cluster 跨机房是另一个话题）。

21.2 哨兵 + Cluster 混布

生产里不建议。如果一定要做：

不同业务起不同的 Sentinel / Cluster 实例。
端口分开，配置文件不混用。

二十二、生产故障演练 Runbook

22.1 演练计划

# 季度演练计划
-&nbsp;演练项目:&nbsp;单个&nbsp;Redis&nbsp;节点宕机
&nbsp;&nbsp;频率:&nbsp;月度
&nbsp;&nbsp;影响:&nbsp;业务无感（slave&nbsp;接管读，master&nbsp;短暂不可写）
&nbsp;&nbsp;验证:&nbsp;SENTINEL&nbsp;failover&nbsp;自动完成&nbsp;<&nbsp;30s

-&nbsp;演练项目:&nbsp;master&nbsp;进程&nbsp;kill&nbsp;-9
&nbsp;&nbsp;频率:&nbsp;月度
&nbsp;&nbsp;影响:&nbsp;同上
&nbsp;&nbsp;验证:&nbsp;切换后客户端重连&nbsp;<&nbsp;10s

-&nbsp;演练项目:&nbsp;master&nbsp;OOM
&nbsp;&nbsp;频率:&nbsp;季度
&nbsp;&nbsp;影响:&nbsp;master&nbsp;拒绝写
&nbsp;&nbsp;验证:&nbsp;告警及时，运维介入

-&nbsp;演练项目:&nbsp;master&nbsp;磁盘满
&nbsp;&nbsp;频率:&nbsp;季度
&nbsp;&nbsp;影响:&nbsp;AOF&nbsp;写失败，master&nbsp;可能挂
&nbsp;&nbsp;验证:&nbsp;监控告警&nbsp;+&nbsp;应急扩容

-&nbsp;演练项目:&nbsp;sentinel&nbsp;集群一台宕机
&nbsp;&nbsp;频率:&nbsp;季度
&nbsp;&nbsp;影响:&nbsp;哨兵集群仍然可以切换
&nbsp;&nbsp;验证:&nbsp;哨兵数减&nbsp;1，仍能选主

-&nbsp;演练项目:&nbsp;sentinel&nbsp;集群两台宕机
&nbsp;&nbsp;频率:&nbsp;半年
&nbsp;&nbsp;影响:&nbsp;哨兵无法做&nbsp;ODOWN
&nbsp;&nbsp;验证:&nbsp;哨兵数减&nbsp;2&nbsp;不可接受

-&nbsp;演练项目:&nbsp;网络分区（脑裂）
&nbsp;&nbsp;频率:&nbsp;半年
&nbsp;&nbsp;影响:&nbsp;短暂数据丢失
&nbsp;&nbsp;验证:&nbsp;min-replicas&nbsp;拒绝写入生效

-&nbsp;演练项目:&nbsp;Redis&nbsp;集群迁移
&nbsp;&nbsp;频率:&nbsp;一次性
&nbsp;&nbsp;影响:&nbsp;业务断流&nbsp;<&nbsp;5&nbsp;分钟
&nbsp;&nbsp;验证:&nbsp;全量回放

22.2 演练步骤模板

# 1. 提前公告
# 提前 24 小时发公告：日期、时间、影响范围、回滚方案

# 2. 准备回滚
cp -r /opt/redis-cluster/6379/conf /backup/conf-$(date +%F)
cp -r /opt/redis-cluster/6379/data /backup/data-$(date +%F)

# 3. 执行演练
# 4. 观察指标
# 5. 验证业务
# 6. 记录过程
# 7. 复盘

二十三、值班速查

23.1 紧急排查 5 步

业务反馈 Redis 不可用：

redis-cli -p 26379 SENTINEL get-master-addr-by-name redis-ha
redis-cli -p 6379 PING
redis-cli -p 6379 INFO replication

看 master 状态：

INFO server（uptime、redis_version）
INFO clients（connected_clients、blocked_clients）
INFO memory（used_memory、mem_fragmentation_ratio）
INFO stats（instantaneous_ops_per_sec、total_connections_received）

看从节点：

每个 slave 的 master_link_status / master_last_io_seconds_ago

看哨兵：

SENTINEL masters / SENTINEL replicas redis-ha / SENTINEL sentinels redis-ha

查日志：

/var/log/redis/redis.log
/var/log/redis/sentinel.log
journalctl -u redis -u redis-sentinel --since "5 min ago"

23.2 修复优先级

先恢复业务：手动 failover / 启新 master。
再补监控：确认监控有数据。
最后复盘：写事故报告，更新 Runbook。

23.3 凌晨叫醒阈值

只有出现以下情况再叫人：

Redis 集群不可用超过 5 分钟。
数据出现不一致（db_size 对不上 / 业务报 key 找不到）。
客户端写入大面积失败。
Sentinel 无法做出决策。

其他情况白天处理即可。

二十四、与其他组件的集成

24.1 Redis + ELK

把 Redis 日志接入 ELK：

# filebeat.yml
filebeat.inputs:
&nbsp; -&nbsp;type:&nbsp;log
&nbsp; &nbsp; paths:
&nbsp; &nbsp; &nbsp; - /var/log/redis/redis.log
&nbsp; &nbsp; &nbsp; - /var/log/redis/sentinel.log
&nbsp; &nbsp; fields:
&nbsp; &nbsp; &nbsp;&nbsp;type: redis
&nbsp; &nbsp; fields_under_root:&nbsp;true

output.logstash:
&nbsp; hosts: ["logstash:5044"]

24.2 Redis + Prometheus + Grafana

前面已经介绍 redis_exporter + Prometheus，告警规则用 Alertmanager 推送。

Grafana 推荐面板：

Redis Dashboard for Prometheus (ID 11835)
Node Exporter Full (ID 1860) 看机器层指标

24.3 Redis + 业务层

业务层注意：

区分热 key / 大 key。
区分可丢失 / 不可丢失（cache / persistence）。
写入策略：先 cache 再 DB，或反过来。
重试策略：失败重试 + 降级到 DB。
监控指标：业务读 / 写 Redis 的 P99 / P999 延迟。

24.4 Redis + Kubernetes

K8s 部署 Redis 哨兵有几种方案：

StatefulSet + Headless Service（推荐）。
每个 Redis Pod 一个 PVC。
Sentinel 单独部署。
用 helm chart：helm repo add bitnami https://charts.bitnami.com/bitnami，helm install redis bitnami/redis。

K8s 里部署的额外注意：

节点亲和性 / 反亲和性：master 和 slave 不在同一节点。
持久化卷必须有 IOPS 保障。
PodDisruptionBudget 保证滚动更新时至少有一个 master。
用 init container 关掉 transparent_hugepage。
Readiness probe：redis-cli PING。
Liveness probe：redis-cli INFO replication（不依赖具体角色）。

二十五、上线前最后一遍

把以下清单再过一遍：

[ ] Redis 配置文件 redis.conf 经过 review。
[ ] Sentinel 配置文件 sentinel.conf 经过 review。
[ ] 防火墙规则评审。
[ ] 监控仪表盘评审。
[ ] 告警规则评审。
[ ] 故障演练至少 3 次。
[ ] 客户端 SDK 版本统一。
[ ] 业务幂等逻辑 review。
[ ] 数据备份恢复演练。
[ ] 升级 / 回滚 Runbook 完整。
[ ] 值班 OnCall 人员熟悉。
[ ] 第一次故障有 SRE 跟班。

文章到这里就结束了。Redis 主从 + 哨兵这套看起来简单，但生产真出问题时排查链路很长。把演练做足、把监控接全、把客户端写对，这三件事是落到生产环境后能不能睡好觉的关键。

文末福利

网络监控是保障网络系统和数据安全的重要手段，能够帮助运维人员及时发现并应对各种问题，及时发现并解决，从而确保网络的顺畅运行。

谢谢一路支持，给大家分享6款开源免费的网络监控工具，并准备了对应的资料文档，建议运维工程师收藏（文末一键领取）。

备注：【监控合集】

100%免费领取

一、zabbix

二、Prometheus

内容较多，6款常用网络监控工具（zabbix、Prometheus、Cacti、Grafana、OpenNMS、Nagios）不再一一介绍，需要的朋友扫码备注【监控合集】，即可100%免费领取。

以上所有资料获取请扫码

备注：【监控合集】

100%免费领取

（后台不再回复，扫码一键领取）****

免责声明：

本文所载程序、技术方法仅面向合法合规的安全研究与教学场景，旨在提升网络安全防护能力，具有明确的技术研究属性。

任何单位或个人未经授权，将本文内容用于攻击、破坏等非法用途的，由此引发的全部法律责任、民事赔偿及连带责任，均由行为人独立承担，本站不承担任何连带责任。

本站内容均为技术交流与知识分享目的发布，若存在版权侵权或其他异议，请通过邮件联系处理，具体联系方式可点击页面上方的联系我。

本文转载自：马哥Linux运维点击关注👉 点击关注👉《Redis 主从复制、哨兵模式搭建与自动故障切换》