Ceph-Placement Group States

PG状态简介

PG，Placement Group,让我们简单了解几个PG的状态,可以分为:

Creating: Ceph is still creating the placement group.
Scrubbing: Ceph is checking the placement group for inconsistencies.
Degraded: Ceph has not replicated some objects in the placement group the correct number of times yet.
Inconsistent: Ceph detects inconsistencies in the one or more replicas of an object in the placement group (e.g. objects are the wrong size, objects are missing from one replica after recovery finished, etc.).
Peering: The placement group is undergoing the peering process
Repair: Ceph is checking the placement group and repairing any inconsistencies it finds (if possible).
Recovering: Ceph is migrating/synchronizing objects and their replicas.
Backfill: Ceph is scanning and synchronizing the entire contents of a placement group instead of inferring what contents need to be synchronized from the logs of recent operations. Backfill is a special case of recovery.
Wait-backfill: The placement group is waiting in line to start backfill
Backfill-toofull: A backfill operation is waiting because the destination OSD is over its full ratio.
Incomplete: Ceph detects that a placement group is missing information about writes that may have occurred, or does not have any healthy copies. If you see this state, try to start any failed OSDs that may contain the needed information. In the case of an erasure coded pool temporarily reducing min_size may allow recovery
Remapped: The placement group is temporarily mapped to a different set of OSDs from what CRUSH specified.
Undersized: The placement group fewer copies than the configured pool replication level.
Peered: The placement group has peered, but cannot serve client IO due to not having enough copies to reach the pool’s configured min_size parameter. Recovery may occur in this state, so the pg may heal up to min_size eventually.
具体参考这里

如何理解 PG

让我们来看下先前搭建的集群状态

[root@admin ~]# ceph osd tree
ID WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 17.94685 root default                                     
-2  5.98228     host node1                                   
 0  1.99409         osd.0       up  1.00000          1.00000 
 1  1.99409         osd.1       up  1.00000          1.00000 
 2  1.99409         osd.2       up  1.00000          1.00000 
-3  5.98228     host node2                                   
 3  1.99409         osd.3       up  1.00000          1.00000 
 4  1.99409         osd.4       up  1.00000          1.00000 
 5  1.99409         osd.5       up  1.00000          1.00000 
-4  5.98228     host node3                                   
 6  1.99409         osd.6       up  1.00000          1.00000 
 7  1.99409         osd.7       up  1.00000          1.00000 
 8  1.99409         osd.8       up  1.00000          1.00000 
[root@admin ~]#

每一个pool都有一个id，系统默认生成的rbd池的id号为0，那么rbd池内的所有PG都会以0.开头，让我们看一下osd.0下面的current目录的内容：

[root@node1 ~]# ls /var/lib/ceph/osd/ceph-0/current/
0.0_head   0.16_TEMP  0.1c_head  0.23_TEMP  0.2b_head  0.38_TEMP  0.46_head  0.4d_TEMP  0.50_head  0.55_TEMP  0.64_head  0.6e_TEMP  0.72_head  0.79_TEMP  0.8_head
0.0_TEMP   0.17_head  0.1c_TEMP  0.24_head  0.2b_TEMP  0.42_head  0.46_TEMP  0.4e_head  0.50_TEMP  0.59_head  0.64_TEMP  0.6_head   0.72_TEMP  0.7b_head  0.8_TEMP
0.10_head  0.17_TEMP  0.1d_head  0.24_TEMP  0.2c_head  0.42_TEMP  0.48_head  0.4e_TEMP  0.53_head  0.59_TEMP  0.65_head  0.6_TEMP   0.75_head  0.7b_TEMP  commit_op_seq
0.10_TEMP  0.18_head  0.1d_TEMP  0.25_head  0.2c_TEMP  0.43_head  0.48_TEMP  0.4f_head  0.53_TEMP  0.5c_head  0.65_TEMP  0.70_head  0.75_TEMP  0.7d_head  meta
0.13_head  0.18_TEMP  0.20_head  0.25_TEMP  0.2_head   0.43_TEMP  0.49_head  0.4f_TEMP  0.54_head  0.5c_TEMP  0.6d_head  0.70_TEMP  0.77_head  0.7d_TEMP  nosnap
0.13_TEMP  0.19_head  0.20_TEMP  0.2a_head  0.2_TEMP   0.44_head  0.49_TEMP  0.4_head   0.54_TEMP  0.63_head  0.6d_TEMP  0.71_head  0.77_TEMP  0.7e_head  omap
0.16_head  0.19_TEMP  0.23_head  0.2a_TEMP  0.38_head  0.44_TEMP  0.4d_head  0.4_TEMP   0.55_head  0.63_TEMP  0.6e_head  0.71_TEMP  0.79_head  0.7e_TEMP

每个OSD的current目录下都保存了部分的PG，而rbd池的PG以0.xxx_head的目录形式存在！

现在我们通过rados向集群写入一个文件(char.txt)，在集群内保存名为char,通过下面的指令查看该文件实际保存的位置

1 2	[root@admin ~]# ceph osd map rbd char osdmap e89 pool 'rbd' (0) object 'char' -> pg 0.98165844 (0.44) -> up ([0,3,8], p0) acting ([0,3,8], p0)

可见，文件会保存在PG 0.44中，而这个PG位于osd.0,osd.3,osd.8中，之所以有这里有三个OSD，是因为集群副本数为3，可以在0/3/8这三个OSD的current下找到0.44_head目录。而同一份文件(比如这里的char.txt)的三个副本会分别保存在这三个同名的PG中

执行指令，将char.txt文件存入集群，并查看各个OSD的PG目录内容：

[root@admin ~]# rados -p rbd put char char.txt 

[root@node1 ~]# ll /var/lib/ceph/osd/ceph-0/current/0.44_head/
total 4
-rw-r--r-- 1 ceph ceph 4 May 31 18:20 char__head_98165844__0
-rw-r--r-- 1 ceph ceph 0 May 31 10:41 __head_00000044__0

[root@node2 ~]# ll /var/lib/ceph/osd/ceph-3/current/0.44_head/
total 4
-rw-r--r-- 1 ceph ceph 4 May 31 18:20 char__head_98165844__0
-rw-r--r-- 1 ceph ceph 0 May 31 10:41 __head_00000044__0

[root@node3 ~]# ll /var/lib/ceph/osd/ceph-8/current/0.44_head/
total 4
-rw-r--r-- 1 ceph ceph 4 May 31 18:20 char__head_98165844__0
-rw-r--r-- 1 ceph ceph 0 May 31 10:41 __head_00000044__0

可见，这三个OSD保存了这个文件的三分副本，这就是ceph的多副本的基础，将一份数据保存成多个副本，按照一定规则分布在各个OSD中，而多副本的数据的一个特点就是，他们都保存在同名的PG下面，也就是同名的目录下。这样，我们就对PG有了一个直接的理解

PG troubleshooting

Degraded
降级：由上文可以得知，每个PG有三个副本，分别保存在不同的OSD中，在非故障情况下，这个PG是a+c状态，那么，如果保存0.44这个PG的osd.3挂掉了，这个PG是什么状态呢，操作如下：
1
2
3
4
[neo@node2 ~]$ sudo systemctl stop ceph-osd@3
[neo@node2 ~]$ sudo ceph pg dump_stuck |egrep ^0.44
ok
0.44 active+undersized+degraded [0,8] 0 [0,8] 0
这里我们前往node2节点，手动停止了osd.3，然后查看此时pg 0.44的状态，可见，它此刻的状态是active+undersized+degraded,当一个PG所在的OSD挂掉之后，这个PG就会进入undersized+degraded状态，而后面的[0,8]的意义就是还有两个0.44的副本存活在osd.0和osd.8上。那么现在可以读取刚刚写入的那个文件吗？是可以正常读取的！
1
2
3
[root@admin test]# rados -p rbd get char char.txt
[root@admin test]# cat char.txt
123
降级就是在发生了一些故障比如OSD挂掉之后，ceph将这个OSD上的所有PG标记为degraded，但是此时的集群还是可以正常读写数据的
而另一个词undersized,我的理解就是当前存活的PG 0.44数为2，小于副本数3，将其做此标记，也不是严重的问题
Remapped
ceph强大的自我恢复能力,在OSD挂掉5min(default 300s)之后，这个OSD会被标记为out状态，可以理解为ceph认为这个OSD已经不属于集群了，然后就会把PG 0.44 map到别的OSD上去，这个map也是按照一定的规则的，重映射之后呢，就会在另外两个OSD上找到0.44这个PG，而这只是创建了这个目录而已，丢失的数据是要从仅存的OSD上回填到新的OSD上的，处于回填状态的PG就会被标记为backfilling
所以当一个PG处于remapped+backfilling状态时，可以认为其处于自我克隆复制的自愈过程

Recover
这里我们先来做一个测试,首先开启所有OSD使得集群处于健康状态，然后前往osd.3的PG 0.44下面删除char__head_98165844__0这个文件，再通知ceph扫描(scrub)这个PG：

[root@node2 0.44_head]# pwd && rm -f char__head_98165844__0
/var/lib/ceph/osd/ceph-3/current/0.44_head
[root@node2 0.44_head]# ceph pg scrub 0.44
instructing pg 0.44 on osd.0 to scrub
[root@node2 0.44_head]# ceph pg dump |egrep ^0.44 
dumped all in format plain
0.44    1       0       0       0       0       4       1       1       active+clean+inconsistent       2017-06-01 11:54:54.980396      89'1    137:181 [0,3,8] 0       [0,3,8] 0       89'1 2017-06-01 11:54:54.980279       0'0     2017-05-27 11:44:16.205723

可见，0.44多了个inconsistent状态，顾名思义，ceph发现了这个PG的不一致状态，这样就可以理解这个状态的意义了。
想要修复丢失的文件呢，只需要执行ceph pg repair 0.44，ceph就会从别的副本中将丢失的文件拷贝过来，这也是ceph自愈的一个情形。
现在再假设一个情形，在osd.4挂掉的过程中呢，我们对char文件进行了写操作，因为集群内还存在着两个副本，是可以正常写入的，但是osd.3内的数据并没有得到更新，过了一会，osd.3上线了，ceph就发现，osd.3的char文件是陈旧的，就通过别的OSD向osd.3进行数据的恢复，使其数据为最新的，而这个恢复的过程中，PG就会被标记为recover

ref
大话Ceph–PG那点事儿
 PLACEMENT GROUP STATES
Ceph 实战
 解析 Ceph : OSD , OSDMap 和 PG, PGMap

您的鼓励是我写作最大的动力

俗话说，投资效率是最好的投资。如果您感觉我的文章质量不错，读后收获很大，预计能为您提高 10% 的工作效率，不妨小额捐助我一下，让我有动力继续写出更多好文章。