译文 | Apache Pulsar 集群如何确保消息不丢

原文作者 Jack Vanlightly,翻译 Sijia@StreamNative,如需转载,请后台留言。
英文链接: https://jack-vanlightly.com/b...
关于 Apache Pulsar
Apache Pulsar 是 Apache 软件基金会顶级项目,是下一代云原生分布式消息流平台,集消息、存储、轻量化函数式计算为一体,采用计算与存储分离架构设计,支持多租户、持久化存储、多机房跨区域数据复制,具有强一致性、高吞吐、低延时及高可扩展性等流数据存储特性。
GitHub 地址: http://github.com/apache/pulsar/

阅读本文需要大约 15 分钟。

消息系统

Apache Pulsar 的工作原理介绍了 Pulsar 的协议和存储模型,主要强调了(1)Pulsar 计算与存储分离;(2)Pulsar 将 topic 分成 ledger 与分片,无需重平衡,即可向 bookie 自动写入新数据的特性。本文将会使用 Blockade 工具关闭节点、降低网速,并丢失网络数据包进行测试。在消息如何在 RabbitMQ 集群中丢失消息如何在 Kafka 集群中丢失文中,我使用了相同的自动化测试工具,本文的测试不仅包含数据丢失,还包括消息排序和消息重复。

我已经将测试使用的代码上传到 GitHub,你可以在 ChaosTesingCode 仓库中的 Pulsar 文件夹里查看。

在每个测试场景中,我们都创建了新的 blockade 集群,并且进行了相同的配置:

  • Apache Pulsar broker 数量
  • Apache BookKeeper 节点(Bookie)数量
  • Ensemble size (E)
  • Write quorum size (Qw)
  • Ack quorum size (Qa)

客户端通过 proxy 与 broker 进行通信。

译文 | Apache Pulsar 集群如何确保消息不丢_第1张图片

在每个测试中,我们都发送消息并添加干扰操作,比如关闭 topic 所属的 Pulsar broker,关闭当前 ledger ensemble 中的 1 个 bookie,从 ZooKeeper 中隔离节点等。

消息发送后,启动 reader 读取并计算消息条数,确保接收到了所有 ack 的消息,并且确保消息是按照正确顺序被读取。

如果你想看测试方法,请阅读全文;如果你只想看测试结果,建议直接看结语部分。

测试剖析

在测试中,producer 以最快速度发送消息,我设置的发送消息速率为 2 万 - 3 万条/秒。有些测试需要较长的时间,所以我增加了发送消息的总数,但不降低 producer 的发送速率。

禁用消息去重。

通过调用 pulsar-test.py python 脚本进行测试。测试需要以下参数:

  • 测试类型(no-fail、kill-broker、kill-bookie、kill-bookies[n]、isolate-broker-from-zk、isolate-bookie-from-zk、custom-isolation[partition|partition])。测试类型会在测试部分进行解释。
  • 测试名称
  • 测试运行次数
  • 发送消息数
  • 开始干扰操作时消息的位置。例如,50000 指在接收到第 5 万次 ack 时,干扰操作开始。启用干扰操作通常需要一段时间,一般会在启用后的第几十万条消息处开始。
  • Ledger 配置(E、Qw、Qa),格式为 E-Qw-Qa,如 2-2-1
  • 节点计数,格式为 brokers-bookies,如 3-3
  • 是否启用数据去重(是/否)

blockade.yml 文件都存储在 cluster/blockade-files 目录中。每个文件中的 broker 和 bookie 数量各不相同。在调用 blockade 前,需将相应文件复制到 cluster 目录中。

调用 pulsar-test.py 执行以下操作:

  1. 如果 blockade 集群正在运行,使用 blockade ` destroy `命令终止集群。
  2. 复制相应的 blockade.yml
  3. blockade up
  4. 发布消息。消息内容为递增的整数,用于检查消息顺序。
  5. 开始干扰操作
  6. 消息发布一结束,立刻开始读取数据
  7. 分析并记录结果
  8. 重复“测试运行次数“

在终端可以查看所有输出。也可以在 automated/test-output 目录下的 [test-name]-output.txt 文件中查看输出。可以在 [test-name]-duplicates.txt 和 [test-name]-out-of-order.txt 文件中分别查看重复消息和乱序消息。

运行结果的格式如下:


Final send count: 2000000
Final ack count: 2000000
Final positive ack count: 2000000
Final negative ack count: 0
Messages received: 2000000
Acked messages missing: 0
Non-acked messages received: 0
Out-of-order: 0
Duplicates: 0

追踪消息:

  • 消息发出后,将消息对应的整数作为键添加到 messages_sent 词典,并添加空列表作为值。
  • Ack 消息接收成功后,添加消息对应的整数值到 messages_pos_acked 集合。
  • Ack 消息接收失败后,添加消息所对应的整数值到 messages_neg_acked 集合。

读取消息时,在 messages_sent 词典中查找,并添加消息 ID 到相应列表。通过这种方式即可检测到丢失的和重复的消息。

  • 丢失已 ack 的消息 = messages_sent 词典与 messages_pos_acked 集合中重合的所有值为空列表的键
  • 接收到未 ack 的消息 = messages_sent 词典中值为空列表,但不在 messages_pos_acked 集合中的所有
  • 重复消息 = messages_sent 词典中,值列表包含多个消息 ID 的键
  • 乱序消息 = 整数值小于上一条消息,词典中值为非空的消息 ID 列表(即未被消费过)

由于 producer 一次只允许处理 1 万条消息,当接收的消息比发送的消息滞后 1 万时,producer 会暂停并等待 ack 追赶。在测试中,我们也的确见到过 1 万。

Pulsar 客户端在消息发送失败时自动重试,直到发送超时。因此,当消息成功写入 BookKeeper,但 Pulsar broker 在向客户端发送 ack 前出现故障时,会出现消息重复。测试中默认关闭消息去重。

默认 E、Qw、Qa

测试中出现了很奇怪的现象,在关闭 bookie 后,reader 和 consumer 不能进行消费。后来我们发现这和 E、Qw、Qa 的默认值有关。游标存储在由上述默认值创建的 ledger 中,如果 bookie 数量不够,则无法创建订阅。

在 blockade.yml 文件中为所有 bookie 的 environment 进行设置:

environment: {
   "clusterName": "cluster-1",
   "zookeeperServers": "zk1:2181",
   "managedLedgerDefaultEnsembleSize": 4,
   "managedLedgerDefaultWriteQuorum": 3,
   "managedLedgerDefaultAckQuorum": 2
}

要保证集群正常工作,以上配置至少需要 4 个 bookie。因此,在使用少量 bookie 测试时,需要保证 bookie 数量与使用场景最少需要 bookie 的数量相同(还要考虑到 bookie 可能会出故障)。

在所有的测试场景中,E、Qw、Qa 默认值为 2、2、2,bookie 数量不少于 3。这样的设置保证了在关闭 bookie 时,不会丢失游标数据,并且可以创建新的 reader 和 consumer。

测试场景 0 - 对比实验组

在第一个测试场景中,发送 200 万条消息,无干扰,检查消息是否全部读取,并且读取顺序正确,消息不重复。对于这一场景,我们测试了 5 次。

最后一次测试的终端输出如下。

$ python pulsar-test.py no-fail no-fail-test1 5 2000000 50000 2-2-1 3-3 false
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #1 on topic no-fail-test1_5  ------------
... (omitted)
-----------------------------------------------
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #2 on topic no-fail-test1_5  ------------
... (omitted)
-----------------------------------------------
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #3 on topic no-fail-test1_5  ------------
... (omitted)
-----------------------------------------------
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #4 on topic no-fail-test1_5  ------------
... (omitted)
-----------------------------------------------
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #5 on topic no-fail-test1_5  ------------
2018-10-19 16:30:44.795 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.208.9:6650
2018-10-19 16:30:44.799 INFO  ClientConnection:285 | [192.168.208.1:34476 -> 192.168.208.9:6650] Connected to broker
2018-10-19 16:30:45.965 INFO  BatchMessageContainer:41 | { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://vanlightly/cluster-1/ns1/no-fail-test1_5] [producerName_ = ] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 0] [averageBatchSize = 0]} BatchMessageContainer constructed
2018-10-19 16:30:45.965 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/no-fail-test1_5, ] Getting connection from pool
2018-10-19 16:30:47.313 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar3:6650
2018-10-19 16:30:47.313 INFO  ClientConnection:287 | [192.168.208.1:34482 -> 192.168.208.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar3:6650
2018-10-19 16:30:47.894 INFO  ProducerImpl:155 | [persistent://vanlightly/cluster-1/ns1/no-fail-test1_5, ] Created producer on broker [192.168.208.1:34482 -> 192.168.208.9:6650] 
Send count: 56645 Ack count: 50000 Pos: 50000 Neg: 0
No chaos action to perform
Send count: 102275 Ack count: 100000 Pos: 100000 Neg: 0
Send count: 151578 Ack count: 150000 Pos: 150000 Neg: 0
Send count: 207610 Ack count: 200000 Pos: 200000 Neg: 0
Send count: 255241 Ack count: 250000 Pos: 250000 Neg: 0
Send count: 309300 Ack count: 300000 Pos: 300000 Neg: 0
Send count: 354595 Ack count: 350000 Pos: 350000 Neg: 0
Send count: 407730 Ack count: 400000 Pos: 400000 Neg: 0
Send count: 459060 Ack count: 450000 Pos: 450000 Neg: 0
Send count: 503590 Ack count: 500000 Pos: 500000 Neg: 0
Send count: 557272 Ack count: 550000 Pos: 550000 Neg: 0
Send count: 601064 Ack count: 600000 Pos: 600000 Neg: 0
Send count: 653045 Ack count: 650000 Pos: 650000 Neg: 0
Send count: 701051 Ack count: 700000 Pos: 700000 Neg: 0
Send count: 752087 Ack count: 750000 Pos: 750000 Neg: 0
Send count: 800455 Ack count: 800000 Pos: 800000 Neg: 0
Send count: 851194 Ack count: 850000 Pos: 850000 Neg: 0
Send count: 904089 Ack count: 900000 Pos: 900000 Neg: 0
Send count: 958704 Ack count: 950000 Pos: 950000 Neg: 0
Send count: 1007863 Ack count: 1000000 Pos: 1000000 Neg: 0
Send count: 1053645 Ack count: 1050000 Pos: 1050000 Neg: 0
Send count: 1102725 Ack count: 1100000 Pos: 1100000 Neg: 0
Send count: 1150921 Ack count: 1150000 Pos: 1150000 Neg: 0
Send count: 1201317 Ack count: 1200000 Pos: 1200000 Neg: 0
Send count: 1254143 Ack count: 1250000 Pos: 1250000 Neg: 0
Send count: 1303459 Ack count: 1300000 Pos: 1300000 Neg: 0
Send count: 1358449 Ack count: 1350000 Pos: 1350000 Neg: 0
Send count: 1408474 Ack count: 1400000 Pos: 1400000 Neg: 0
Send count: 1456386 Ack count: 1450000 Pos: 1450000 Neg: 0
Send count: 1505650 Ack count: 1500000 Pos: 1500000 Neg: 0
Send count: 1550343 Ack count: 1550000 Pos: 1550000 Neg: 0
Send count: 1601060 Ack count: 1600000 Pos: 1600000 Neg: 0
Send count: 1652008 Ack count: 1650000 Pos: 1650000 Neg: 0
Send count: 1701559 Ack count: 1700000 Pos: 1700000 Neg: 0
Send count: 1751662 Ack count: 1750000 Pos: 1750000 Neg: 0
Send count: 1801550 Ack count: 1800000 Pos: 1800000 Neg: 0
Send count: 1852799 Ack count: 1850000 Pos: 1850000 Neg: 0
Send count: 1901281 Ack count: 1900000 Pos: 1900000 Neg: 0
Send count: 1951079 Ack count: 1950000 Pos: 1950000 Neg: 0
Send count: 2000000 Ack count: 2000000 Pos: 2000000 Neg: 0
2018-10-19 16:31:51.177 INFO  ProducerImpl:467 | [persistent://vanlightly/cluster-1/ns1/no-fail-test1_5, cluster-1-2-0] Closed producer
2018-10-19 16:31:51.346 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.208.9:6650
2018-10-19 16:31:51.348 INFO  ClientConnection:285 | [192.168.208.1:34574 -> 192.168.208.9:6650] Connected to broker
2018-10-19 16:31:51.357 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/no-fail-test1_5, reader-5a64ed, 0] Getting connection from pool
2018-10-19 16:31:51.645 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar3:6650
2018-10-19 16:31:51.645 INFO  ClientConnection:287 | [192.168.208.1:34578 -> 192.168.208.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar3:6650
2018-10-19 16:31:51.716 INFO  ConsumerImpl:168 | [persistent://vanlightly/cluster-1/ns1/no-fail-test1_5, reader-5a64ed, 0] Created consumer on broker [192.168.208.1:34578 -> 192.168.208.9:6650] 
Last confirmed entry: [0, 2557]
Start reading from (0,0,0,-1)
Received: 50000 Curr Entry: [0, 58]
Received: 100000 Curr Entry: [0, 115]
Received: 150000 Curr Entry: [0, 177]
Received: 200000 Curr Entry: [0, 236]
Received: 250000 Curr Entry: [0, 300]
Received: 300000 Curr Entry: [0, 386]
Received: 350000 Curr Entry: [0, 446]
Received: 400000 Curr Entry: [0, 506]
Received: 450000 Curr Entry: [0, 569]
Received: 500000 Curr Entry: [0, 629]
Received: 550000 Curr Entry: [0, 695]
Received: 600000 Curr Entry: [0, 754]
Received: 650000 Curr Entry: [0, 816]
Received: 700000 Curr Entry: [0, 892]
Received: 750000 Curr Entry: [0, 955]
Received: 800000 Curr Entry: [0, 1021]
Received: 850000 Curr Entry: [0, 1086]
Received: 900000 Curr Entry: [0, 1152]
Received: 950000 Curr Entry: [0, 1211]
Received: 1000000 Curr Entry: [0, 1268]
Received: 1050000 Curr Entry: [0, 1339]
Received: 1100000 Curr Entry: [0, 1410]
Received: 1150000 Curr Entry: [0, 1472]
Received: 1200000 Curr Entry: [0, 1534]
Received: 1250000 Curr Entry: [0, 1595]
Received: 1300000 Curr Entry: [0, 1654]
Received: 1350000 Curr Entry: [0, 1713]
Received: 1400000 Curr Entry: [0, 1779]
Received: 1450000 Curr Entry: [0, 1836]
Received: 1500000 Curr Entry: [0, 1896]
Received: 1550000 Curr Entry: [0, 1971]
Received: 1600000 Curr Entry: [0, 2043]
Received: 1650000 Curr Entry: [0, 2101]
Received: 1700000 Curr Entry: [0, 2164]
Received: 1750000 Curr Entry: [0, 2226]
Received: 1800000 Curr Entry: [0, 2297]
Received: 1850000 Curr Entry: [0, 2367]
Received: 1900000 Curr Entry: [0, 2433]
Received: 1950000 Curr Entry: [0, 2496]
Received: 2000000 Curr Entry: [0, 2557]
Read phase complete with message (0,2557,378,-1)
2018-10-19 16:32:35.080 INFO  ConsumerImpl:761 | [persistent://vanlightly/cluster-1/ns1/no-fail-test1_5, reader-5a64ed, 0] Closed consumer 0
Results --------------------------------------------
Final send count: 2000000
Final ack count: 2000000
Final positive ack count: 2000000
Final negative ack count: 0
Messages received: 2000000
Acked messages missing: 0
Non-acked messages received: 0
Out-of-order: 0
Duplicates: 0

可以在 test-output/no-fail-test1-output.txt 文件中查看简要结果。

16:11:45 Start test
16:13:54: 
16:13:54: Test Run #1 on topic no-fail-test1_1  ------------
16:15:41: Results --------------------------------------------
16:15:41: Final send count: 2000000
16:15:41: Final ack count: 2000000
16:15:41: Final positive ack count: 2000000
16:15:41: Final negative ack count: 0
16:15:41: Messages received: 2000000
16:15:41: Acked messages missing: 0
16:15:41: Non-acked messages received: 0
16:15:41: Out-of-order: 0
16:15:41: Duplicates: 0
16:15:41: ----------------------------------------------------
16:17:57: 
16:17:57: Test Run #2 on topic no-fail-test1_2  ------------
16:19:44: Results --------------------------------------------
16:19:44: Final send count: 2000000
16:19:44: Final ack count: 2000000
16:19:44: Final positive ack count: 2000000
16:19:44: Final negative ack count: 0
16:19:44: Messages received: 2000000
16:19:44: Acked messages missing: 0
16:19:44: Non-acked messages received: 0
16:19:44: Out-of-order: 0
16:19:44: Duplicates: 0
16:19:44: ----------------------------------------------------
16:22:01: 
16:22:01: Test Run #3 on topic no-fail-test1_3  ------------
16:23:52: Results --------------------------------------------
16:23:52: Final send count: 2000000
16:23:52: Final ack count: 2000000
16:23:52: Final positive ack count: 2000000
16:23:52: Final negative ack count: 0
16:23:52: Messages received: 2000000
16:23:52: Acked messages missing: 0
16:23:52: Non-acked messages received: 0
16:23:52: Out-of-order: 0
16:23:52: Duplicates: 0
16:23:52: ----------------------------------------------------
16:26:20: 
16:26:20: Test Run #4 on topic no-fail-test1_4  ------------
16:28:18: Results --------------------------------------------
16:28:18: Final send count: 2000000
16:28:18: Final ack count: 2000000
16:28:18: Final positive ack count: 2000000
16:28:18: Final negative ack count: 0
16:28:18: Messages received: 2000000
16:28:18: Acked messages missing: 0
16:28:18: Non-acked messages received: 0
16:28:18: Out-of-order: 0
16:28:18: Duplicates: 0
16:28:18: ----------------------------------------------------
16:30:43: 
16:30:43: Test Run #5 on topic no-fail-test1_5  ------------
16:32:36: Results --------------------------------------------
16:32:36: Final send count: 2000000
16:32:36: Final ack count: 2000000
16:32:36: Final positive ack count: 2000000
16:32:36: Final negative ack count: 0
16:32:36: Messages received: 2000000
16:32:36: Acked messages missing: 0
16:32:36: Non-acked messages received: 0
16:32:36: Out-of-order: 0
16:32:36: Duplicates: 0
16:32:36: ----------------------------------------------------

从终端输出可以看出速度并不是很快。五个测试共用了大约 20 分钟。主要问题在于创建 blockade 集群速度较慢。使用 RabbitMQ 和 Kafka,运行速度很快;但 Pulsar 集群占用了大量 CPU,启动缓慢。原因可能和我们运行了 8 个节点(1 ZK、1 Proxy、3 Broker、3 Bookie)有关,最多时节点数达到 11 个。

测试中没有出现消息丢失、重复或乱序的问题。

测试场景 1 - 关闭 topic broker owner

参数配置 2-2-1 3-3。

在测试中,pulsar-test.py 使用 shell 脚本检测 broker owner,并在消息传输时将其关闭。

参数配置为 E=2、Qw=2、Qa=1,3 个 broker,3 个 bookie,这种配置下冗余最小。每条消息有两个副本,但 broker 接收到任一 bookie 的一个 ack 后,就会将 ack 发送到客户端。

在消息传输时关闭 broker 不会造成数据丢失。只有在 Qa bookie 持久化消息到磁盘后,消息才被 ack,并且由 broker 发送 ack。如果关闭 broker,最差的情况是 reader 读取未被 ack 的消息。Bookie 可能已经持久化消息,但 broker 在向客户端发送 ack 前出现故障。在启用新 broker 后,恢复期间会检测到这些消息,并确保关闭 ledger 前生成消息副本,使这些消息对 reader 可用。

以下是某次测试的完整输出。在每次测试中,发送 100 万条消息,在发送到第 5 万条消息时关闭 owner。我们进行了 5 次该测试。

$ python pulsar-test.py kill-broker kill-broker-test1 5 1000000 50000 2-2-1 3-3 false
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #1 on topic kill-broker-test1_1  ------------
-------------------------------------------------
WRITE PHASE
-------------------------------------------------
2018-10-19 20:31:48.389 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar2:6650
2018-10-19 20:31:48.389 INFO  ClientConnection:287 | [192.168.96.1:52202 -> 192.168.96.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar2:6650
Send count: 51706 Ack count: 50000 Pos: 50000 Neg: 0
Send count: 101199 Ack count: 100000 Pos: 100000 Neg: 0
Send count: 155831 Ack count: 150000 Pos: 150000 Neg: 0
Send count: 208015 Ack count: 200000 Pos: 200000 Neg: 0
Send count: 250923 Ack count: 250000 Pos: 250000 Neg: 0
Send count: 303241 Ack count: 300000 Pos: 300000 Neg: 0
Send count: 350970 Ack count: 350000 Pos: 350000 Neg: 0
Send count: 400041 Ack count: 400000 Pos: 400000 Neg: 0
Send count: 450612 Ack count: 450000 Pos: 450000 Neg: 0
-------------------------------------------------
pulsar2 is the topic owner, killing pulsar2!!!!!!
-------------------------------------------------
Send count: 501553 Ack count: 500000 Pos: 500000 Neg: 0
Send count: 551276 Ack count: 550000 Pos: 550000 Neg: 0
Send count: 602319 Ack count: 600000 Pos: 600000 Neg: 0
Send count: 651554 Ack count: 650000 Pos: 650000 Neg: 0
2018-10-19 20:32:07.710 INFO  ClientConnection:1237 | [192.168.96.1:52202 -> 192.168.96.9:6650] Connection closed
2018-10-19 20:32:07.710 INFO  HandlerBase:129 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test1_1, cluster-1-1-0] Schedule reconnection in 0.1 s
2018-10-19 20:32:07.710 INFO  ClientConnection:195 | [192.168.96.1:52202 -> 192.168.96.9:6650] Destroyed connection
2018-10-19 20:32:07.810 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test1_1, cluster-1-1-0] Getting connection from pool
2018-10-19 20:32:07.815 ERROR ClientConnection:792 | [192.168.96.1:52196 -> 192.168.96.9:6650] Failed lookup req_id: 3 error: 6
2018-10-19 20:32:07.815 INFO  HandlerBase:129 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test1_1, cluster-1-1-0] Schedule reconnection in 0.192 s
2018-10-19 20:32:08.009 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test1_1, cluster-1-1-0] Getting connection from pool
2018-10-19 20:32:08.500 INFO  ConnectionPool:53 | Deleting stale connection from pool for pulsar://pulsar2:6650 use_count: -1 @ 0
2018-10-19 20:32:08.500 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar2:6650
2018-10-19 20:32:08.501 INFO  ClientConnection:287 | [192.168.96.1:52276 -> 192.168.96.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar2:6650
-------------------------------------------------
pulsar2 KILLED!
-------------------------------------------------
Send count: 702977 Ack count: 700000 Pos: 692977 Neg: 7023
2018-10-19 20:32:38.508 INFO  ClientConnection:1237 | [192.168.96.1:52276 -> 192.168.96.9:6650] Connection closed
2018-10-19 20:32:38.509 INFO  HandlerBase:129 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, cluster-1-1-0] Schedule reconnection in 0.1 s
2018-10-19 20:32:38.509 INFO  ClientConnection:195 | [192.168.96.1:52276 -> 192.168.96.9:6650] Destroyed connection
2018-10-19 20:32:38.610 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, cluster-1-1-0] Getting connection from pool
2018-10-19 20:32:38.647 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-19 20:32:38.647 INFO  ClientConnection:287 | [192.168.96.1:52306 -> 192.168.96.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-19 20:32:39.592 INFO  ProducerImpl:155 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, cluster-1-1-0] Created producer on broker [192.168.96.1:52306 -> 192.168.96.9:6650] 
Send count: 750954 Ack count: 750000 Pos: 740000 Neg: 10000
Send count: 805568 Ack count: 800000 Pos: 790000 Neg: 10000
Send count: 853879 Ack count: 850000 Pos: 840000 Neg: 10000
Send count: 904020 Ack count: 900000 Pos: 890000 Neg: 10000
Send count: 953552 Ack count: 950000 Pos: 940000 Neg: 10000
Send count: 1000000 Ack count: 1000000 Pos: 990000 Neg: 10000
2018-10-19 20:32:47.320 INFO  ProducerImpl:467 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, cluster-1-1-0] Closed producer
2018-10-19 20:32:47.466 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.96.9:6650
2018-10-19 20:32:47.468 INFO  ClientConnection:285 | [192.168.96.1:52406 -> 192.168.96.9:6650] Connected to broker
2018-10-19 20:32:47.477 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, reader-63a5b2, 0] Getting connection from pool
2018-10-19 20:32:47.496 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-19 20:32:47.496 INFO  ClientConnection:287 | [192.168.96.1:52410 -> 192.168.96.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-19 20:32:47.562 INFO  ConsumerImpl:168 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, reader-63a5b2, 0] Created consumer on broker [192.168.96.1:52410 -> 192.168.96.9:6650]
-------------------------------------------------
READ PHASE
-------------------------------------------------
Last confirmed entry: [1, 387]
Start reading from (0,0,0,-1)
Received: 50000 Curr Entry: [0, 72]
Received: 100000 Curr Entry: [0, 148]
Received: 150000 Curr Entry: [0, 207]
Received: 200000 Curr Entry: [0, 272]
Received: 250000 Curr Entry: [0, 335]
Received: 300000 Curr Entry: [0, 400]
Received: 350000 Curr Entry: [0, 466]
Received: 400000 Curr Entry: [0, 534]
Received: 450000 Curr Entry: [0, 611]
Received: 500000 Curr Entry: [0, 688]
Received: 550000 Curr Entry: [0, 747]
Received: 600000 Curr Entry: [0, 815]
Received: 650000 Curr Entry: [0, 913]
Received: 700000 Curr Entry: [1, 7]
Received: 750000 Curr Entry: [1, 83]
Received: 800000 Curr Entry: [1, 157]
Received: 850000 Curr Entry: [1, 227]
Received: 900000 Curr Entry: [1, 287]
Received: 950000 Curr Entry: [1, 343]
Read phase complete with message (1,387,469,-1)
2018-10-19 20:33:11.943 INFO  ConsumerImpl:761 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, reader-63a5b2, 0] Closed consumer 0
Results --------------------------------------------
Final send count: 1000000
Final ack count: 1000000
Final positive ack count: 990000
Final negative ack count: 10000
Messages received: 990000
Acked messages missing: 0
Non-acked messages received: 0
Out-of-order: 0
Duplicates: 0

可以看到,在消息传输过程中,识别 pulsar2 为 owner 后,很快将其关闭。客户端出现了连接失败的问题,并通过 proxy 重新连接到新的 owner pulsar1。重连成功后,继续发送消息。

由于 producer 受到限制,通常只会传输 1 万条消息,而这 1 万条消息全部接收失败(超时)。这样接收成功的消息就有 99 万条,无重复消息,均被有序读取。

我们共进行了 5 次测试,测试结果如下:

19:59:08: 
19:59:08: Test Run #1 on topic kill-broker-test_1  ------------
20:00:41: Results --------------------------------------------
20:00:41: Final send count: 1000000
20:00:41: Final ack count: 1000000
20:00:41: Final positive ack count: 990000
20:00:41: Final negative ack count: 10000
20:00:41: Messages received: 990000
20:00:41: Acked messages missing: 0
20:00:41: Non-acked messages received: 0
20:00:41: Out-of-order: 0
20:00:41: Duplicates: 0
20:00:41: ----------------------------------------------------
20:02:54: 
20:02:54: Test Run #2 on topic kill-broker-test_2  ------------
20:03:54: Results --------------------------------------------
20:03:54: Final send count: 1000000
20:03:54: Final ack count: 1000000
20:03:54: Final positive ack count: 1000000
20:03:54: Final negative ack count: 0
20:03:54: Messages received: 1000000
20:03:54: Acked messages missing: 0
20:03:54: Non-acked messages received: 0
20:03:54: Out-of-order: 0
20:03:54: Duplicates: 0
20:03:54: ----------------------------------------------------
20:06:06: 
20:06:06: Test Run #3 on topic kill-broker-test_3  ------------
20:07:31: Results --------------------------------------------
20:07:31: Final send count: 1000000
20:07:31: Final ack count: 1000000
20:07:31: Final positive ack count: 990000
20:07:31: Final negative ack count: 10000
20:07:31: Messages received: 991065
20:07:31: Acked messages missing: 0
20:07:31: Non-acked messages received: 1065
20:07:31: Out-of-order: 0
20:07:31: Duplicates: 0
20:07:31: ----------------------------------------------------
20:09:42: 
20:09:42: Test Run #4 on topic kill-broker-test_4  ------------
20:11:18: Results --------------------------------------------
20:11:18: Final send count: 1000000
20:11:18: Final ack count: 1000000
20:11:18: Final positive ack count: 990000
20:11:18: Final negative ack count: 10000
20:11:18: Messages received: 990000
20:11:18: Acked messages missing: 0
20:11:18: Non-acked messages received: 0
20:11:18: Out-of-order: 0
20:11:18: Duplicates: 0
20:11:18: ----------------------------------------------------
20:13:36: 
20:13:36: Test Run #5 on topic kill-broker-test_5  ------------
20:15:09: Results --------------------------------------------
20:15:09: Final send count: 1000000
20:15:09: Final ack count: 1000000
20:15:09: Final positive ack count: 990000
20:15:09: Final negative ack count: 10000
20:15:09: Messages received: 990000
20:15:09: Acked messages missing: 0
20:15:09: Non-acked messages received: 0
20:15:09: Out-of-order: 0
20:15:09: Duplicates: 0
20:15:09: ----------------------------------------------------

可以看出没有消息丢失。第 3 次运行时,读取了 1065 条未 ack 的消息。如果重试发送消息引起了消息重复,这种情况就很正常。启用去重就可以解决这一问题,我们会在后面的场景中进行测试。

正如预期,broker 故障不会导致消息丢失。

测试场景 2 - 关闭当前 Ledger Ensemble 中的 Bookie

参数配置 2-2-1 3-3。

我们采用冗余最小的配置来测试 Pulsar 的灵活性,E=2、Qw=2、Qa=1,3 个 broker,3 个 bookie。在发送 100 万条消息期间,识别当前 ledger ensemble 中的一个 bookie,并将其关闭。然后,检查 reader 读取 100 万条消息的顺序是否正确。

在这里我做了一个特殊的假设,把当前 ledger 看作第一个 ledger。在进行过无数次测试后,我发现至少前 100 万条消息会落在第一个 ledger 的 entry 中。因此,我决定只关注第一个 ledger 中的 ensemble,它在 ZooKeeper 中的路径已知。

如果 Qw = 2,只关闭一个 bookie,不应该出现数据丢失。如果关闭了 ack 消息的 bookie,则在 AutoRecovery 复制数据前,仍有一个 entry 副本,而出现数据丢失的情况是两个副本同时丢失。

单次运行的完整输出如下。

$ python pulsar-test.py kill-bookie kill-bookie-test 5 1000000 50000 2-2-1 3-3 false
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #1 on topic kill-bookie-test_1  ------------
-------------------------------------------------
WRITE PHASE
-------------------------------------------------
2018-10-19 23:46:10.939 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.224.9:6650
2018-10-19 23:46:12.877 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar3:6650
2018-10-19 23:46:12.877 INFO  ClientConnection:287 | [192.168.224.1:50588 -> 192.168.224.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar3:6650
2018-10-19 23:46:13.397 INFO  ProducerImpl:155 | [persistent://vanlightly/cluster-1/ns1/kill-bookie-test_1, ] Created producer on broker [192.168.224.1:50588 -> 192.168.224.9:6650] 
Send count: 52575 Ack count: 50000 Pos: 50000 Neg: 0
Send count: 100560 Ack count: 100000 Pos: 100000 Neg: 0
Send count: 156455 Ack count: 150000 Pos: 150000 Neg: 0
Send count: 203545 Ack count: 200000 Pos: 200000 Neg: 0
Send count: 255199 Ack count: 250000 Pos: 250000 Neg: 0
Send count: 305732 Ack count: 300000 Pos: 300000 Neg: 0
Send count: 357709 Ack count: 350000 Pos: 350000 Neg: 0
Send count: 407634 Ack count: 400000 Pos: 400000 Neg: 0
-------------------------------------------------
bookie1 is in the current ledger ensemble, killing bookie1!!!!!!
-------------------------------------------------
Send count: 451809 Ack count: 450000 Pos: 450000 Neg: 0
Send count: 501234 Ack count: 500000 Pos: 500000 Neg: 0
Send count: 551348 Ack count: 550000 Pos: 550000 Neg: 0
Send count: 601729 Ack count: 600000 Pos: 600000 Neg: 0
Send count: 650566 Ack count: 650000 Pos: 650000 Neg: 0
-------------------------------------------------
bookie1 KILLED!
-------------------------------------------------
Send count: 701961 Ack count: 700000 Pos: 700000 Neg: 0
Send count: 751294 Ack count: 750000 Pos: 750000 Neg: 0
Send count: 802939 Ack count: 800000 Pos: 800000 Neg: 0
Send count: 850390 Ack count: 850000 Pos: 850000 Neg: 0
Send count: 903739 Ack count: 900000 Pos: 900000 Neg: 0
Send count: 953286 Ack count: 950000 Pos: 950000 Neg: 0
Send count: 1000000 Ack count: 1000000 Pos: 1000000 Neg: 0
2018-10-19 23:46:47.281 INFO  ProducerImpl:467 | [persistent://vanlightly/cluster-1/ns1/kill-bookie-test_1, cluster-1-2-0] Closed producer
-------------------------------------------------
READ PHASE
-------------------------------------------------
2018-10-19 23:46:57.499 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.224.9:6650
2018-10-19 23:46:57.500 INFO  ClientConnection:285 | [192.168.224.1:50756 -> 192.168.224.9:6650] Connected to broker
2018-10-19 23:46:57.508 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-bookie-test_1, reader-4d4064, 0] Getting connection from pool
2018-10-19 23:46:57.566 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar2:6650
2018-10-19 23:46:57.566 INFO  ClientConnection:287 | [192.168.224.1:50760 -> 192.168.224.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar2:6650
2018-10-19 23:46:57.750 INFO  ConsumerImpl:168 | [persistent://vanlightly/cluster-1/ns1/kill-bookie-test_1, reader-4d4064, 0] Created consumer on broker [192.168.224.1:50760 -> 192.168.224.9:6650] 
LCE. broker pulsar2 lac_line "1:475",
Last confirmed entry: [1, 475]
Start reading from (0,0,0,-1)
Received: 50000 Curr Entry: [0, 70]
Received: 100000 Curr Entry: [0, 165]
Received: 150000 Curr Entry: [0, 270]
Received: 200000 Curr Entry: [0, 339]
Received: 250000 Curr Entry: [0, 403]
Received: 300000 Curr Entry: [0, 462]
Received: 350000 Curr Entry: [0, 519]
Received: 400000 Curr Entry: [0, 579]
Received: 450000 Curr Entry: [0, 658]
Received: 500000 Curr Entry: [0, 717]
Received: 550000 Curr Entry: [0, 782]
Received: 600000 Curr Entry: [0, 848]
Received: 650000 Curr Entry: [0, 942]
Received: 700000 Curr Entry: [1, 57]
Received: 750000 Curr Entry: [1, 119]
Received: 800000 Curr Entry: [1, 195]
Received: 850000 Curr Entry: [1, 267]
Received: 900000 Curr Entry: [1, 357]
Received: 950000 Curr Entry: [1, 418]
Received: 1000000 Curr Entry: [1, 475]
Read phase complete with message (1,475,324,-1)
2018-10-19 23:47:51.236 INFO  ConsumerImpl:761 | [persistent://vanlightly/cluster-1/ns1/kill-bookie-test_1, reader-4d4064, 0] Closed consumer 0
Results --------------------------------------------
Final send count: 1000000
Final ack count: 1000000
Final positive ack count: 1000000
Final negative ack count: 0
Messages received: 1000000
Acked messages missing: 0
Non-acked messages received: 0
Out-of-order: 0
Duplicates: 0
----------------------------------------------------

在消息传输期间关闭了 1 个 bookie,但消息传输并未受到影响,reader 按顺序成功读取了 100 万条消息,并且没有出现消息重复。

5 次测试的运行结果如下。

23:52:20 Start test
23:54:49: 
23:54:49: Test Run #1 on topic kill-bookie-test_1  ------------
23:56:38: Results --------------------------------------------
23:56:38: Final send count: 1000000
23:56:38: Final ack count: 1000000
23:56:38: Final positive ack count: 1000000
23:56:38: Final negative ack count: 0
23:56:38: Messages received: 1000000
23:56:38: Acked messages missing: 0
23:56:38: Non-acked messages received: 0
23:56:38: Out-of-order: 0
23:56:38: Duplicates: 0
23:56:38: ----------------------------------------------------
23:58:54: 
23:58:54: Test Run #2 on topic kill-bookie-test_2  ------------
00:00:50: Results --------------------------------------------
00:00:50: Final send count: 1000000
00:00:50: Final ack count: 1000000
00:00:50: Final positive ack count: 1000000
00:00:50: Final negative ack count: 0
00:00:50: Messages received: 1000000
00:00:50: Acked messages missing: 0
00:00:50: Non-acked messages received: 0
00:00:50: Out-of-order: 0
00:00:50: Duplicates: 0
00:00:50: ----------------------------------------------------
00:03:12: 
00:03:12: Test Run #3 on topic kill-bookie-test_3  ------------
00:05:01: Results --------------------------------------------
00:05:01: Final send count: 1000000
00:05:01: Final ack count: 1000000
00:05:01: Final positive ack count: 1000000
00:05:01: Final negative ack count: 0
00:05:01: Messages received: 1000000
00:05:01: Acked messages missing: 0
00:05:01: Non-acked messages received: 0
00:05:01: Out-of-order: 0
00:05:01: Duplicates: 0
00:05:01: ----------------------------------------------------
00:07:15: 
00:07:15: Test Run #4 on topic kill-bookie-test_4  ------------
00:09:04: Results --------------------------------------------
00:09:04: Final send count: 1000000
00:09:04: Final ack count: 1000000
00:09:04: Final positive ack count: 1000000
00:09:04: Final negative ack count: 0
00:09:04: Messages received: 1000000
00:09:04: Acked messages missing: 0
00:09:04: Non-acked messages received: 0
00:09:04: Out-of-order: 0
00:09:04: Duplicates: 0
00:09:04: ----------------------------------------------------
00:11:23: 
00:11:23: Test Run #5 on topic kill-bookie-test_5  ------------
00:13:23: Results --------------------------------------------
00:13:23: Final send count: 1000000
00:13:23: Final ack count: 1000000
00:13:23: Final positive ack count: 1000000
00:13:23: Final negative ack count: 0
00:13:23: Messages received: 1000000
00:13:23: Acked messages missing: 0
00:13:23: Non-acked messages received: 0
00:13:23: Out-of-order: 0
00:13:23: Duplicates: 0
00:13:23: ----------------------------------------------------

没有出现消息丢失、重复和乱序。

测试场景 3 - 隔离 Topic Owner Broker 与 ZooKeeper

参数配置 2-2-1 3-3。

在这次测试中,我们只隔离 topic owner broker 和 ZooKeeper。我们预测在隔离 ZooKeeper 前,broker owner 会持续接收写入;隔离生效后,broker owner 自动重启,另一个 broker 接管 broker owner 角色。

预计测试结果与关闭 bookie 的结果类似。

单次运行的完整输出如下。

$ python pulsar-test.py isolate-broker-from-zk iso-broker 1 2000000 50000 2-2-1 3-3 false
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #1 on topic iso-broker_1  ------------
-------------------------------------------------
WRITE PHASE
-------------------------------------------------
2018-10-20 08:22:43.961 INFO  ConnectionPool:63 | Created connection for pulsar://172.22.0.9:6650
2018-10-20 08:22:43.963 INFO  ClientConnection:285 | [172.22.0.1:46968 -> 172.22.0.9:6650] Connected to broker
2018-10-20 08:22:44.664 INFO  BatchMessageContainer:41 | { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://vanlightly/cluster-1/ns1/iso-broker_1] [producerName_ = ] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 0] [averageBatchSize = 0]} BatchMessageContainer constructed
2018-10-20 08:22:44.664 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, ] Getting connection from pool
2018-10-20 08:22:45.254 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-20 08:22:45.254 INFO  ClientConnection:287 | [172.22.0.1:46972 -> 172.22.0.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-20 08:22:46.329 INFO  ProducerImpl:155 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, ] Created producer on broker [172.22.0.1:46972 -> 172.22.0.9:6650] 
Send count: 57639 Ack count: 50000 Pos: 50000 Neg: 0
pulsar1 is the topic owner, isolating pulsar1 from zookeepr!!!!!!
Send count: 100918 Ack count: 100000 Pos: 100000 Neg: 0
Send count: 151902 Ack count: 150000 Pos: 150000 Neg: 0
Send count: 200658 Ack count: 200000 Pos: 200000 Neg: 0
Send count: 251287 Ack count: 250000 Pos: 250000 Neg: 0
Send count: 306076 Ack count: 300000 Pos: 300000 Neg: 0
Send count: 353796 Ack count: 350000 Pos: 350000 Neg: 0
Send count: 403469 Ack count: 400000 Pos: 400000 Neg: 0
Send count: 455195 Ack count: 450000 Pos: 450000 Neg: 0
-------------------------------------------------
pulsar1 ISOLATED!
-------------------------------------------------
Send count: 506817 Ack count: 500000 Pos: 500000 Neg: 0
Send count: 557051 Ack count: 550000 Pos: 550000 Neg: 0
Send count: 601668 Ack count: 600000 Pos: 600000 Neg: 0
Send count: 652490 Ack count: 650000 Pos: 650000 Neg: 0
Send count: 709134 Ack count: 700000 Pos: 700000 Neg: 0
Send count: 757860 Ack count: 750000 Pos: 750000 Neg: 0
Send count: 805959 Ack count: 800000 Pos: 800000 Neg: 0
Send count: 858283 Ack count: 850000 Pos: 850000 Neg: 0
Send count: 907911 Ack count: 900000 Pos: 900000 Neg: 0
Send count: 951597 Ack count: 950000 Pos: 950000 Neg: 0
Send count: 1002732 Ack count: 1000000 Pos: 1000000 Neg: 0
Send count: 1050689 Ack count: 1050000 Pos: 1050000 Neg: 0
Send count: 1104743 Ack count: 1100000 Pos: 1100000 Neg: 0
Send count: 1157332 Ack count: 1150000 Pos: 1150000 Neg: 0
Send count: 1202361 Ack count: 1200000 Pos: 1200000 Neg: 0
Send count: 1256378 Ack count: 1250000 Pos: 1250000 Neg: 0
Send count: 1301586 Ack count: 1300000 Pos: 1300000 Neg: 0
Send count: 1352458 Ack count: 1350000 Pos: 1350000 Neg: 0
Send count: 1407455 Ack count: 1400000 Pos: 1400000 Neg: 0
2018-10-20 08:23:40.624 INFO  ClientConnection:1237 | [172.22.0.1:46972 -> 172.22.0.9:6650] Connection closed
2018-10-20 08:23:40.624 INFO  HandlerBase:129 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, cluster-1-0-0] Schedule reconnection in 0.1 s
2018-10-20 08:23:40.624 INFO  ClientConnection:195 | [172.22.0.1:46972 -> 172.22.0.9:6650] Destroyed connection
2018-10-20 08:23:40.724 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, cluster-1-0-0] Getting connection from pool
2018-10-20 08:23:41.103 INFO  ConnectionPool:53 | Deleting stale connection from pool for pulsar://pulsar1:6650 use_count: -1 @ 0
2018-10-20 08:23:41.103 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-20 08:23:41.104 INFO  ClientConnection:287 | [172.22.0.1:47050 -> 172.22.0.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-20 08:23:42.139 INFO  ClientConnection:1237 | [172.22.0.1:47050 -> 172.22.0.9:6650] Connection closed
2018-10-20 08:23:42.139 INFO  HandlerBase:129 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, cluster-1-0-0] Schedule reconnection in 0.186 s
2018-10-20 08:23:42.139 INFO  ClientConnection:195 | [172.22.0.1:47050 -> 172.22.0.9:6650] Destroyed connection
2018-10-20 08:23:42.325 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, cluster-1-0-0] Getting connection from pool
2018-10-20 08:23:42.750 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar3:6650
2018-10-20 08:23:42.751 INFO  ClientConnection:287 | [172.22.0.1:47054 -> 172.22.0.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar3:6650
2018-10-20 08:23:43.632 INFO  ProducerImpl:155 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, cluster-1-0-0] Created producer on broker [172.22.0.1:47054 -> 172.22.0.9:6650] 
Send count: 1459178 Ack count: 1450000 Pos: 1450000 Neg: 0
Send count: 1505703 Ack count: 1500000 Pos: 1500000 Neg: 0
Send count: 1551367 Ack count: 1550000 Pos: 1550000 Neg: 0
Send count: 1601042 Ack count: 1600000 Pos: 1600000 Neg: 0
Send count: 1651310 Ack count: 1650000 Pos: 1650000 Neg: 0
Send count: 1707836 Ack count: 1700000 Pos: 1700000 Neg: 0
Send count: 1751159 Ack count: 1750000 Pos: 1750000 Neg: 0
Send count: 1802089 Ack count: 1800000 Pos: 1800000 Neg: 0
Send count: 1851082 Ack count: 1850000 Pos: 1850000 Neg: 0
Send count: 1906930 Ack count: 1900000 Pos: 1900000 Neg: 0
Send count: 1951920 Ack count: 1950000 Pos: 1950000 Neg: 0
Send count: 2000000 Ack count: 2000000 Pos: 2000000 Neg: 0
2018-10-20 08:23:57.505 INFO  ProducerImpl:467 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, cluster-1-0-0] Closed producer
-------------------------------------------------
READ PHASE
-------------------------------------------------
2018-10-20 08:24:07.620 INFO  ConnectionPool:63 | Created connection for pulsar://172.22.0.9:6650
2018-10-20 08:24:07.621 INFO  ClientConnection:285 | [172.22.0.1:47156 -> 172.22.0.9:6650] Connected to broker
2018-10-20 08:24:07.635 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, reader-c1ba0b, 0] Getting connection from pool
2018-10-20 08:24:07.665 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar3:6650
2018-10-20 08:24:07.665 INFO  ClientConnection:287 | [172.22.0.1:47160 -> 172.22.0.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar3:6650
2018-10-20 08:24:07.749 INFO  ConsumerImpl:168 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, reader-c1ba0b, 0] Created consumer on broker [172.22.0.1:47160 -> 172.22.0.9:6650] 
Last confirmed entry: [1, 658]
Start reading from (0,0,0,-1)
Received: 50000 Curr Entry: [0, 61]
Received: 100000 Curr Entry: [0, 142]
Received: 150000 Curr Entry: [0, 212]
Received: 200000 Curr Entry: [0, 288]
Received: 250000 Curr Entry: [0, 349]
Received: 300000 Curr Entry: [0, 420]
Received: 350000 Curr Entry: [0, 487]
Received: 400000 Curr Entry: [0, 554]
Received: 450000 Curr Entry: [0, 612]
Received: 500000 Curr Entry: [0, 675]
Received: 550000 Curr Entry: [0, 731]
Received: 600000 Curr Entry: [0, 790]
Received: 650000 Curr Entry: [0, 847]
Received: 700000 Curr Entry: [0, 904]
Received: 750000 Curr Entry: [0, 961]
Received: 800000 Curr Entry: [0, 1022]
Received: 850000 Curr Entry: [0, 1080]
Received: 900000 Curr Entry: [0, 1136]
Received: 950000 Curr Entry: [0, 1194]
Received: 1000000 Curr Entry: [0, 1252]
Received: 1050000 Curr Entry: [0, 1310]
Received: 1100000 Curr Entry: [0, 1370]
Received: 1150000 Curr Entry: [0, 1428]
Received: 1200000 Curr Entry: [0, 1487]
Received: 1250000 Curr Entry: [0, 1545]
Received: 1300000 Curr Entry: [0, 1602]
Received: 1350000 Curr Entry: [0, 1660]
Received: 1400000 Curr Entry: [0, 1721]
Received: 1450000 Curr Entry: [1, 1]
Received: 1500000 Curr Entry: [1, 60]
Received: 1550000 Curr Entry: [1, 124]
Received: 1600000 Curr Entry: [1, 186]
Received: 1650000 Curr Entry: [1, 247]
Received: 1700000 Curr Entry: [1, 303]
Received: 1750000 Curr Entry: [1, 361]
Received: 1800000 Curr Entry: [1, 420]
Received: 1850000 Curr Entry: [1, 478]
Received: 1900000 Curr Entry: [1, 539]
Received: 1950000 Curr Entry: [1, 598]
Received: 2000000 Curr Entry: [1, 658]
Read phase complete with message (1,658,138,-1)
2018-10-20 08:24:44.361 INFO  ConsumerImpl:761 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, reader-c1ba0b, 0] Closed consumer 0
Results --------------------------------------------
Final send count: 2000000
Final ack count: 2000000
Final positive ack count: 2000000
Final negative ack count: 0
Messages received: 2000000
Acked messages missing: 0
Non-acked messages received: 0
Out-of-order: 0
Duplicates: 0
----------------------------------------------------

在本次测试中,我们发送了 200 万条消息,因为隔离 owner broker 再看到效果需要一段时间。大概在发送 140 万条消息时出现故障并迅速重新连接。Reader 有序读取全部消息,并且没有消息重复。

5 次测试的运行结果如下。

08:33:38 Start test
08:35:55: 
08:35:55: Test Run #1 on topic iso-broker_1  ------------
08:38:10: Results --------------------------------------------
08:38:10: Final send count: 2000000
08:38:10: Final ack count: 2000000
08:38:10: Final positive ack count: 2000000
08:38:10: Final negative ack count: 0
08:38:10: Messages received: 2000000
08:38:10: Acked messages missing: 0
08:38:10: Non-acked messages received: 0
08:38:10: Out-of-order: 0
08:38:10: Duplicates: 0
08:38:10: ----------------------------------------------------
08:40:29: 
08:40:29: Test Run #2 on topic iso-broker_2  ------------
08:42:36: Results --------------------------------------------
08:42:36: Final send count: 2000000
08:42:36: Final ack count: 2000000
08:42:36: Final positive ack count: 2000000
08:42:36: Final negative ack count: 0
08:42:36: Messages received: 2000000
08:42:36: Acked messages missing: 0
08:42:36: Non-acked messages received: 0
08:42:36: Out-of-order: 0
08:42:36: Duplicates: 0
08:42:36: ----------------------------------------------------
08:44:55: 
08:44:55: Test Run #3 on topic iso-broker_3  ------------
08:47:08: Results --------------------------------------------
08:47:08: Final send count: 2000000
08:47:08: Final ack count: 2000000
08:47:08: Final positive ack count: 2000000
08:47:08: Final negative ack count: 0
08:47:08: Messages received: 2001000
08:47:08: Acked messages missing: 0
08:47:08: Non-acked messages received: 0
08:47:08: Out-of-order: 0
08:47:08: Duplicates: 1000
08:47:08: ----------------------------------------------------
08:49:33: 
08:49:33: Test Run #4 on topic iso-broker_4  ------------
08:51:52: Results --------------------------------------------
08:51:52: Final send count: 2000000
08:51:52: Final ack count: 2000000
08:51:52: Final positive ack count: 2000000
08:51:52: Final negative ack count: 0
08:51:52: Messages received: 2001090
08:51:52: Acked messages missing: 0
08:51:52: Non-acked messages received: 0
08:51:52: Out-of-order: 0
08:51:52: Duplicates: 1090
08:51:52: ----------------------------------------------------
08:54:18: 
08:54:18: Test Run #5 on topic iso-broker_5  ------------
08:56:41: Results --------------------------------------------
08:56:41: Final send count: 2000000
08:56:41: Final ack count: 2000000
08:56:41: Final positive ack count: 2000000
08:56:41: Final negative ack count: 0
08:56:41: Messages received: 2000000
08:56:41: Acked messages missing: 0
08:56:41: Non-acked messages received: 0
08:56:41: Out-of-order: 0
08:56:41: Duplicates: 0
08:56:41: ----------------------------------------------------

没有出现消息丢失和乱序问题,但出现了消息重复。在两次测试中,分别出现了 1000 条和 1090 条重复消息。

当已读取整数值小于等于读取过的消息时(即这是一条重复消息),reader 会在 [test-name]-duplicates.txt 中记录消息 ID、之前消费消息的整数值、当前消息的整数值。

test-output]$ cat iso-broker_duplicates.txt
run|last_msg_id|last_value|curr_msg_id|curr_value
3|(0,1643,999,-1)|1356634|(1,0,0,-1)|1355635
4|(0,1597,792,-1)|1220380|(1,0,0,-1)|1219588
4|(1,712,296,-1)|1811705|(2,0,0,-1)|1811409

第 3 次运行结果中只有一个 entry,也就是说重复消息位于连续的消息块中。整数值 1,355,635 第二次出现于 ledger 1,entry 0 的第一条消息中,消息 ID 为(1,0,0,-1)。在 ledger 0 中,共追加了 1000 条消息,在下一个 ledger 的开头同样写入了这 1000 条消息。

在第 4 次测试中也出现了类似的情况。

第一个 broker 重启后,客户端重新发送消息可能是造成消息重复的原因。Broker 向 pulsar1 的 BK ensemble 发送 1000 条消息,但是在向客户端发送 ack 前,由于与 ZooKeeper 断开,broker 出现故障。然后,客户端重新连接到新的 owner pulsar2,再次发送 1000 条消息。Pulsar2 恢复了上一个 ledger(即 pulsar1 拥有的 ledger),在关闭此 ledger 后创建新 ledger,准备接收新消息。Broker 接收到了客户端发来的 1000 条消息,并将这些消息写入新 ledger。因此出现了消息重复。

我们可以启用消息去重测试上述可能的原因。启用消息去重时,broker 将每个 producer 最后的序列号存储在哈希表中。如果收到的序列号小于所有现有序列号,则这条消息为重复的消息,不对其做任何处理。Broker 将关于 topic 的数据(producer、序列号)存储在游标中,当 broker 进行故障转移时,新 broker 将重新创建哈希表。由于哈希表会定期刷新,如果 broker 出现故障,哈希表中的最新序列号可能会丢失。如果新 broker 仅依赖于哈希表更新快照,则在 broker 进行故障转移时,很可能再次出现消息重复。为了避免这一情况的发生,Pulsar 中的新 broker owner 从 ledger 读取最新的 N 条 entry,并追加这些 entry 到哈希表,以确保不会因为故障转移导致哈希表中数据不完整。

测试场景 4 - 隔离 Topic Owner Broker 与 ZooKeeper,启用消息去重

参数配置为 2-2-1 3-3,启用消息去重。

除了启用消息去重外,测试场景 4 与测试场景 3 完全相同。本次测试用于检测消息去重是否可以防止消息重复。

在 blockade.yml 文件中,为 broker 的 environment 添加 brokerDeduplicationEnabled 参数,启用消息去重。

5 次测试的运行结果如下。

09:24:34 Start test
09:26:56: 
09:26:56: Test Run #1 on topic iso-broker_1  ------------
09:29:27: Results --------------------------------------------
09:29:27: Final send count: 2000000
09:29:27: Final ack count: 2000000
09:29:27: Final positive ack count: 2000000
09:29:27: Final negative ack count: 0
09:29:27: Messages received: 2000000
09:29:27: Acked messages missing: 0
09:29:27: Non-acked messages received: 0
09:29:27: Out-of-order: 0
09:29:27: Duplicates: 0
09:29:27: ----------------------------------------------------
09:32:48: 
09:32:48: Test Run #2 on topic iso-broker_2  ------------
09:35:06: Results --------------------------------------------
09:35:06: Final send count: 2000000
09:35:06: Final ack count: 2000000
09:35:06: Final positive ack count: 2000000
09:35:06: Final negative ack count: 0
09:35:06: Messages received: 2000000
09:35:06: Acked messages missing: 0
09:35:06: Non-acked messages received: 0
09:35:06: Out-of-order: 0
09:35:06: Duplicates: 0
09:35:06: ----------------------------------------------------
09:37:21: 
09:37:21: Test Run #3 on topic iso-broker_3  ------------
09:39:21: Results --------------------------------------------
09:39:21: Final send count: 2000000
09:39:21: Final ack count: 2000000
09:39:21: Final positive ack count: 2000000
09:39:21: Final negative ack count: 0
09:39:21: Messages received: 2000000
09:39:21: Acked messages missing: 0
09:39:21: Non-acked messages received: 0
09:39:21: Out-of-order: 0
09:39:21: Duplicates: 0
09:39:21: ----------------------------------------------------
09:41:39: 
09:41:39: Test Run #4 on topic iso-broker_4  ------------
09:43:51: Results --------------------------------------------
09:43:51: Final send count: 2000000
09:43:51: Final ack count: 2000000
09:43:51: Final positive ack count: 2000000
09:43:51: Final negative ack count: 0
09:43:51: Messages received: 2000000
09:43:51: Acked messages missing: 0
09:43:51: Non-acked messages received: 0
09:43:51: Out-of-order: 0
09:43:51: Duplicates: 0
09:43:51: ----------------------------------------------------
09:46:17: 
09:46:17: Test Run #5 on topic iso-broker_5  ------------
09:48:53: Results --------------------------------------------
09:48:53: Final send count: 2000000
09:48:53: Final ack count: 2000000
09:48:53: Final positive ack count: 2000000
09:48:53: Final negative ack count: 0
09:48:53: Messages received: 2000000
09:48:53: Acked messages missing: 0
09:48:53: Non-acked messages received: 0
09:48:53: Out-of-order: 0
09:48:53: Duplicates: 0
09:48:53: ----------------------------------------------------

从测试结果可以看出,启用消息去重可以解决测试场景 3 中出现的消息重复问题。

测试场景 5 - 隔离 Bookie 与 ZooKeeper

参数配置 2-2-1 3-3。

强调一下,bookie 不需要 ZooKeeper 进行读写,只需要在 ZooKeeper 中进行注册、垃圾回收、AutoRecovery。所以,隔离 bookie 与 ZooKeeper 应该不会产生任何影响。

首先在当前 ledger 的 ensemble 中识别出一个 bookie,在消息传输期间将其与 ZooKeeper 隔离。5 次测试结果如下。

$ cat iso-bookie_output.txt
12:03:52 Start test
12:06:10: 
12:06:10: Test Run #1 on topic iso-bookie_1  ------------
12:08:18: Results --------------------------------------------
12:08:18: Final send count: 2000000
12:08:18: Final ack count: 2000000
12:08:18: Final positive ack count: 2000000
12:08:18: Final negative ack count: 0
12:08:18: Messages received: 2000000
12:08:18: Acked messages missing: 0
12:08:18: Non-acked messages received: 0
12:08:18: Out-of-order: 0
12:08:18: Duplicates: 0
12:08:18: ----------------------------------------------------
12:10:35: 
12:10:35: Test Run #2 on topic iso-bookie_2  ------------
12:12:40: Results --------------------------------------------
12:12:40: Final send count: 2000000
12:12:40: Final ack count: 2000000
12:12:40: Final positive ack count: 2000000
12:12:40: Final negative ack count: 0
12:12:40: Messages received: 2000000
12:12:40: Acked messages missing: 0
12:12:40: Non-acked messages received: 0
12:12:40: Out-of-order: 0
12:12:40: Duplicates: 0
12:12:40: ----------------------------------------------------
12:14:55: 
12:14:55: Test Run #3 on topic iso-bookie_3  ------------
12:16:54: Results --------------------------------------------
12:16:54: Final send count: 2000000
12:16:54: Final ack count: 2000000
12:16:54: Final positive ack count: 2000000
12:16:54: Final negative ack count: 0
12:16:54: Messages received: 2000000
12:16:54: Acked messages missing: 0
12:16:54: Non-acked messages received: 0
12:16:54: Out-of-order: 0
12:16:54: Duplicates: 0
12:16:54: ----------------------------------------------------
12:19:10: 
12:19:10: Test Run #4 on topic iso-bookie_4  ------------
12:21:16: Results --------------------------------------------
12:21:16: Final send count: 2000000
12:21:16: Final ack count: 2000000
12:21:16: Final positive ack count: 2000000
12:21:16: Final negative ack count: 0
12:21:16: Messages received: 2000000
12:21:16: Acked messages missing: 0
12:21:16: Non-acked messages received: 0
12:21:16: Out-of-order: 0
12:21:16: Duplicates: 0
12:21:16: ----------------------------------------------------
12:23:36: 
12:23:36: Test Run #5 on topic iso-bookie_5  ------------
12:25:39: Results --------------------------------------------
12:25:39: Final send count: 2000000
12:25:39: Final ack count: 2000000
12:25:39: Final positive ack count: 2000000
12:25:39: Final negative ack count: 0
12:25:39: Messages received: 2000000
12:25:39: Acked messages missing: 0
12:25:39: Non-acked messages received: 0
12:25:39: Out-of-order: 0
12:25:39: Duplicates: 0
12:25:39: ----------------------------------------------------

没有出现消息丢失、重复和乱序。

测试场景 6 - 关闭多个 Bookie

参数配置 3-3-1 3-5。

除了冗余数为 3,关闭 2 个 bookie 外,测试场景 6 与测试场景 2 相同。使用 Pulsar 时,bookie 故障次数不超过 Qw-1,就不会出现数据丢失。

可以使用 kill-bookies[n] 命令关闭当前 ensemble 中任意数量的 bookie。

单次运行的完整输出如下。

$ python pulsar-test.py kill-bookies[2] kill-m-bookies 1 2000000 50000 3-3-1 3-5 false
Creating blockade cluster
Running test with config: E 3 Qw 3 Qa 1 with nodes bookie1 bookie2 bookie3 bookie4 bookie5 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 3-3-1
Test Run #1 on topic kill-m-bookies_1  ------------
-------------------------------------------------
WRITE PHASE
-------------------------------------------------
2018-10-20 22:41:21.932 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.176.11:6650
2018-10-20 22:41:21.932 INFO  ClientConnection:285 | [192.168.176.1:47178 -> 192.168.176.11:6650] Connected to broker
2018-10-20 22:41:22.599 INFO  BatchMessageContainer:41 | { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://vanlightly/cluster-1/ns1/kill-m-bookies_1] [producerName_ = ] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 0] [averageBatchSize = 0]} BatchMessageContainer constructed
2018-10-20 22:41:22.599 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-m-bookies_1, ] Getting connection from pool
2018-10-20 22:41:23.140 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-20 22:41:23.140 INFO  ClientConnection:287 | [192.168.176.1:47182 -> 192.168.176.11:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-20 22:41:23.631 INFO  ProducerImpl:155 | [persistent://vanlightly/cluster-1/ns1/kill-m-bookies_1, ] Created producer on broker [192.168.176.1:47182 -> 192.168.176.11:6650] 
Send count: 53933 Ack count: 50000 Pos: 50000 Neg: 0
-------------------------------------------------
Identifing first 2 bookies in ensemble
-------------------------------------------------
Send count: 105567 Ack count: 100000 Pos: 100000 Neg: 0
Send count: 150962 Ack count: 150000 Pos: 150000 Neg: 0
Send count: 202992 Ack count: 200000 Pos: 200000 Neg: 0
Send count: 251576 Ack count: 250000 Pos: 250000 Neg: 0
Send count: 304059 Ack count: 300000 Pos: 300000 Neg: 0
Send count: 354476 Ack count: 350000 Pos: 350000 Neg: 0
Send count: 401795 Ack count: 400000 Pos: 400000 Neg: 0
Send count: 451880 Ack count: 450000 Pos: 450000 Neg: 0
Send count: 500607 Ack count: 500000 Pos: 500000 Neg: 0
Send count: 550300 Ack count: 550000 Pos: 550000 Neg: 0
Send count: 600474 Ack count: 600000 Pos: 600000 Neg: 0
Send count: 650469 Ack count: 650000 Pos: 650000 Neg: 0
Send count: 702927 Ack count: 700000 Pos: 700000 Neg: 0
Send count: 751328 Ack count: 750000 Pos: 750000 Neg: 0
Send count: 805205 Ack count: 800000 Pos: 800000 Neg: 0
Send count: 853940 Ack count: 850000 Pos: 850000 Neg: 0
-------------------------------------------------
2 BOOKIES KILLED!
-------------------------------------------------
Send count: 901111 Ack count: 900000 Pos: 900000 Neg: 0
Send count: 950963 Ack count: 950000 Pos: 950000 Neg: 0
Send count: 1000561 Ack count: 1000000 Pos: 1000000 Neg: 0
Send count: 1053648 Ack count: 1050000 Pos: 1050000 Neg: 0
Send count: 1103723 Ack count: 1100000 Pos: 1100000 Neg: 0
Send count: 1155762 Ack count: 1150000 Pos: 1150000 Neg: 0
Send count: 1202638 Ack count: 1200000 Pos: 1200000 Neg: 0
Send count: 1253326 Ack count: 1250000 Pos: 1250000 Neg: 0
Send count: 1300821 Ack count: 1300000 Pos: 1300000 Neg: 0
Send count: 1355902 Ack count: 1350000 Pos: 1350000 Neg: 0
Send count: 1409752 Ack count: 1400000 Pos: 1400000 Neg: 0
Send count: 1451236 Ack count: 1450000 Pos: 1450000 Neg: 0
Send count: 1500961 Ack count: 1500000 Pos: 1500000 Neg: 0
Send count: 1554802 Ack count: 1550000 Pos: 1550000 Neg: 0
Send count: 1600702 Ack count: 1600000 Pos: 1600000 Neg: 0
Send count: 1652755 Ack count: 1650000 Pos: 1650000 Neg: 0
Send count: 1705403 Ack count: 1700000 Pos: 1700000 Neg: 0
Send count: 1757240 Ack count: 1750000 Pos: 1750000 Neg: 0
Send count: 1804954 Ack count: 1800000 Pos: 1800000 Neg: 0
Send count: 1850631 Ack count: 1850000 Pos: 1850000 Neg: 0
Send count: 1907237 Ack count: 1900000 Pos: 1900000 Neg: 0
Send count: 1951117 Ack count: 1950000 Pos: 1950000 Neg: 0
Send count: 2000000 Ack count: 2000000 Pos: 2000000 Neg: 0
2018-10-20 22:42:20.572 INFO  ProducerImpl:467 | [persistent://vanlightly/cluster-1/ns1/kill-m-bookies_1, cluster-1-0-0] Closed producer
-------------------------------------------------
READ PHASE
-------------------------------------------------
2018-10-20 22:42:30.708 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.176.11:6650
2018-10-20 22:42:30.710 INFO  ClientConnection:285 | [192.168.176.1:47374 -> 192.168.176.11:6650] Connected to broker
2018-10-20 22:42:30.715 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-m-bookies_1, reader-b23b69, 0] Getting connection from pool
2018-10-20 22:42:31.028 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-20 22:42:31.028 INFO  ClientConnection:287 | [192.168.176.1:47378 -> 192.168.176.11:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-20 22:42:31.110 INFO  ConsumerImpl:168 | [persistent://vanlightly/cluster-1/ns1/kill-m-bookies_1, reader-b23b69, 0] Created consumer on broker [192.168.176.1:47378 -> 192.168.176.11:6650] 
Last confirmed entry: [0, 2546]
Start reading from (0,0,0,-1)
Received: 50000 Curr Entry: [0, 62]
Received: 100000 Curr Entry: [0, 135]
Received: 150000 Curr Entry: [0, 203]
Received: 200000 Curr Entry: [0, 276]
Received: 250000 Curr Entry: [0, 341]
Received: 300000 Curr Entry: [0, 409]
Received: 350000 Curr Entry: [0, 470]
Received: 400000 Curr Entry: [0, 532]
Received: 450000 Curr Entry: [0, 599]
Received: 500000 Curr Entry: [0, 661]
Received: 550000 Curr Entry: [0, 724]
Received: 600000 Curr Entry: [0, 799]
Received: 650000 Curr Entry: [0, 866]
Received: 700000 Curr Entry: [0, 943]
Received: 750000 Curr Entry: [0, 1005]
Received: 800000 Curr Entry: [0, 1073]
Received: 850000 Curr Entry: [0, 1133]
Received: 900000 Curr Entry: [0, 1205]
Received: 950000 Curr Entry: [0, 1268]
Received: 1000000 Curr Entry: [0, 1340]
Received: 1050000 Curr Entry: [0, 1402]
Received: 1100000 Curr Entry: [0, 1460]
Received: 1150000 Curr Entry: [0, 1523]
Received: 1200000 Curr Entry: [0, 1588]
Received: 1250000 Curr Entry: [0, 1647]
Received: 1300000 Curr Entry: [0, 1705]
Received: 1350000 Curr Entry: [0, 1764]
Received: 1400000 Curr Entry: [0, 1825]
Received: 1450000 Curr Entry: [0, 1884]
Received: 1500000 Curr Entry: [0, 1941]
Received: 1550000 Curr Entry: [0, 1998]
Received: 1600000 Curr Entry: [0, 2064]
Received: 1650000 Curr Entry: [0, 2122]
Received: 1700000 Curr Entry: [0, 2184]
Received: 1750000 Curr Entry: [0, 2241]
Received: 1800000 Curr Entry: [0, 2295]
Received: 1850000 Curr Entry: [0, 2364]
Received: 1900000 Curr Entry: [0, 2425]
Received: 1950000 Curr Entry: [0, 2481]
Received: 2000000 Curr Entry: [0, 2546]
Read phase complete with message (0,2546,561,-1)
2018-10-20 22:44:15.596 INFO  ConsumerImpl:761 | [persistent://vanlightly/cluster-1/ns1/kill-m-bookies_1, reader-b23b69, 0] Closed consumer 0
Results --------------------------------------------
Final send count: 2000000
Final ack count: 2000000
Final positive ack count: 2000000
Final negative ack count: 0
Messages received: 2000000
Acked messages missing: 0
Non-acked messages received: 0
Out-of-order: 0
Duplicates: 0
----------------------------------------------------

5 次测试的运行结果如下。

$ cat kill-m-bookies_output.txt
22:59:01 Start test
23:01:37: 
23:01:37: Test Run #1 on topic kill-m-bookies_1  ------------
23:04:49: Results --------------------------------------------
23:04:49: Final send count: 2000000
23:04:49: Final ack count: 2000000
23:04:49: Final positive ack count: 2000000
23:04:49: Final negative ack count: 0
23:04:49: Messages received: 2000000
23:04:49: Acked messages missing: 0
23:04:49: Non-acked messages received: 0
23:04:49: Out-of-order: 0
23:04:49: Duplicates: 0
23:04:49: ----------------------------------------------------
23:07:26: 
23:07:26: Test Run #2 on topic kill-m-bookies_2  ------------
23:09:23: Results --------------------------------------------
23:09:23: Final send count: 2000000
23:09:23: Final ack count: 2000000
23:09:23: Final positive ack count: 2000000
23:09:23: Final negative ack count: 0
23:09:23: Messages received: 2005107
23:09:23: Acked messages missing: 0
23:09:23: Non-acked messages received: 0
23:09:23: Out-of-order: 0
23:09:23: Duplicates: 5107
23:09:23: ----------------------------------------------------
23:11:56: 
23:11:56: Test Run #3 on topic kill-m-bookies_3  ------------
23:14:48: Results --------------------------------------------
23:14:48: Final send count: 2000000
23:14:48: Final ack count: 2000000
23:14:48: Final positive ack count: 2000000
23:14:48: Final negative ack count: 0
23:14:48: Messages received: 2000000
23:14:48: Acked messages missing: 0
23:14:48: Non-acked messages received: 0
23:14:48: Out-of-order: 0
23:14:48: Duplicates: 0
23:14:48: ----------------------------------------------------
23:17:25: 
23:17:25: Test Run #4 on topic kill-m-bookies_4  ------------
23:20:24: Results --------------------------------------------
23:20:24: Final send count: 2000000
23:20:24: Final ack count: 2000000
23:20:24: Final positive ack count: 2000000
23:20:24: Final negative ack count: 0
23:20:24: Messages received: 2000000
23:20:24: Acked messages missing: 0
23:20:24: Non-acked messages received: 0
23:20:24: Out-of-order: 0
23:20:24: Duplicates: 0
23:20:24: ----------------------------------------------------
23:23:05: 
23:23:05: Test Run #5 on topic kill-m-bookies_5  ------------
23:25:03: Results --------------------------------------------
23:25:03: Final send count: 2000000
23:25:03: Final ack count: 2000000
23:25:03: Final positive ack count: 2000000
23:25:03: Final negative ack count: 0
23:25:03: Messages received: 2000000
23:25:03: Acked messages missing: 0
23:25:03: Non-acked messages received: 0
23:25:03: Out-of-order: 0
23:25:03: Duplicates: 0
23:25:03: ----------------------------------------------------

在一次测试中出现了 5000 多条重复消息。通过检查运行的完整输出,我发现了和 producer id 相关的连接错误。

Send count: 557331 Ack count: 550000 Pos: 550000 Neg: 0
Send count: 601885 Ack count: 600000 Pos: 600000 Neg: 0
2018-10-20 23:07:49.283 INFO  HandlerBase:129 | [persistent://vanlightly/cluster-1/ns1/kill-m-bookies_2, cluster-1-2-0] Schedule reconnection in 0.1 s
2018-10-20 23:07:49.319 ERROR ClientConnection:638 | [192.168.208.1:58276 -> 192.168.208.11:6650] Got invalid producer Id in SendReceipt: 0 -- msg: 648684
2018-10-20 23:07:49.319 ERROR ClientConnection:638 | [192.168.208.1:58276 -> 192.168.208.11:6650] Got invalid producer Id in SendReceipt: 0 -- msg: 649684
2018-10-20 23:07:49.320 ERROR ClientConnection:638 | [192.168.208.1:58276 -> 192.168.208.11:6650] Got invalid producer Id in SendReceipt: 0 -- msg: 650684
2018-10-20 23:07:49.320 ERROR ClientConnection:638 | [192.168.208.1:58276 -> 192.168.208.11:6650] Got invalid producer Id in SendReceipt: 0 -- msg: 651684
2018-10-20 23:07:49.321 ERROR ClientConnection:638 | [192.168.208.1:58276 -> 192.168.208.11:6650] Got invalid producer Id in SendReceipt: 0 -- msg: 652684
2018-10-20 23:07:49.321 ERROR ClientConnection:638 | [192.168.208.1:58276 -> 192.168.208.11:6650] Got invalid producer Id in SendReceipt: 0 -- msg: 653047

具体原因还不能确定,可能与 bookie 故障相关。无论是否和 bookie 故障相关,启用消息去重都可以有效解决类似的消息重复问题。

没有出现消息丢失或乱序。

如果关闭 Qw 个 bookie,会出现怎样的结果呢?虽然这样的测试设定会损坏所有消息副本,我们还是来实际测试一下。

测试场景 7 - 关闭 Qw 个 Bookie

参数配置 2-2-1 3-5。

本次测试中,我们关闭了组成当前 ledger 中 ensemble 的两个 bookie。

单次运行的完整输出如下。

$ python pulsar-test.py kill-bookies[2] kill-qw-bookies 1 2000000 50000 2-2-1 3-5 false
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 bookie4 bookie5 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #1 on topic kill-qw-bookies_1  ------------
-------------------------------------------------
WRITE PHASE
-------------------------------------------------
2018-10-20 23:46:24.587 INFO  ConnectionPool:63 | Created connection for pulsar://172.20.0.11:6650
2018-10-20 23:46:24.588 INFO  ClientConnection:285 | [172.20.0.1:40032 -> 172.20.0.11:6650] Connected to broker
2018-10-20 23:46:25.620 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-20 23:46:25.620 INFO  ClientConnection:287 | [172.20.0.1:40036 -> 172.20.0.11:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-20 23:46:26.098 INFO  ProducerImpl:155 | [persistent://vanlightly/cluster-1/ns1/kill-qw-bookies_1, ] Created producer on broker [172.20.0.1:40036 -> 172.20.0.11:6650] 
Send count: 56001 Ack count: 50000 Pos: 50000 Neg: 0
-------------------------------------------------
Identifing first 2 bookies in ensemble
-------------------------------------------------
Send count: 101422 Ack count: 100000 Pos: 100000 Neg: 0
Send count: 150870 Ack count: 150000 Pos: 150000 Neg: 0
Send count: 204342 Ack count: 200000 Pos: 200000 Neg: 0
Send count: 252210 Ack count: 250000 Pos: 250000 Neg: 0
Send count: 304805 Ack count: 300000 Pos: 300000 Neg: 0
Send count: 357891 Ack count: 350000 Pos: 350000 Neg: 0
Send count: 400852 Ack count: 400000 Pos: 400000 Neg: 0
Send count: 450256 Ack count: 450000 Pos: 450000 Neg: 0
Send count: 502910 Ack count: 500000 Pos: 500000 Neg: 0
Send count: 551386 Ack count: 550000 Pos: 550000 Neg: 0
Send count: 604204 Ack count: 600000 Pos: 600000 Neg: 0
Send count: 653056 Ack count: 650000 Pos: 650000 Neg: 0
Send count: 709835 Ack count: 700000 Pos: 700000 Neg: 0
-------------------------------------------------
2 BOOKIES KILLED!
-------------------------------------------------
Send count: 753606 Ack count: 750000 Pos: 750000 Neg: 0
Send count: 803318 Ack count: 800000 Pos: 800000 Neg: 0
Send count: 853127 Ack count: 850000 Pos: 850000 Neg: 0
Send count: 905068 Ack count: 900000 Pos: 900000 Neg: 0
Send count: 951923 Ack count: 950000 Pos: 950000 Neg: 0
Send count: 1004825 Ack count: 1000000 Pos: 1000000 Neg: 0
Send count: 1051416 Ack count: 1050000 Pos: 1050000 Neg: 0
Send count: 1102301 Ack count: 1100000 Pos: 1100000 Neg: 0
Send count: 1151166 Ack count: 1150000 Pos: 1150000 Neg: 0
Send count: 1205605 Ack count: 1200000 Pos: 1200000 Neg: 0
Send count: 1251999 Ack count: 1250000 Pos: 1250000 Neg: 0
Send count: 1306006 Ack count: 1300000 Pos: 1300000 Neg: 0
Send count: 1357391 Ack count: 1350000 Pos: 1350000 Neg: 0
Send count: 1406744 Ack count: 1400000 Pos: 1400000 Neg: 0
Send count: 1458207 Ack count: 1450000 Pos: 1450000 Neg: 0
Send count: 1509467 Ack count: 1500000 Pos: 1500000 Neg: 0
Send count: 1558339 Ack count: 1550000 Pos: 1550000 Neg: 0
Send count: 1603861 Ack count: 1600000 Pos: 1600000 Neg: 0
Send count: 1654170 Ack count: 1650000 Pos: 1650000 Neg: 0
Send count: 1702673 Ack count: 1700000 Pos: 1700000 Neg: 0
Send count: 1751781 Ack count: 1750000 Pos: 1750000 Neg: 0
Send count: 1803040 Ack count: 1800000 Pos: 1800000 Neg: 0
Send count: 1850616 Ack count: 1850000 Pos: 1850000 Neg: 0
Send count: 1906137 Ack count: 1900000 Pos: 1900000 Neg: 0
Send count: 1955752 Ack count: 1950000 Pos: 1950000 Neg: 0
Send count: 2000000 Ack count: 2000000 Pos: 2000000 Neg: 0
2018-10-20 23:47:28.728 INFO  ProducerImpl:467 | [persistent://vanlightly/cluster-1/ns1/kill-qw-bookies_1, cluster-1-0-0] Closed producer
-------------------------------------------------
READ PHASE
-------------------------------------------------
2018-10-20 23:47:38.923 INFO  ConnectionPool:63 | Created connection for pulsar://172.20.0.11:6650
2018-10-20 23:47:38.924 INFO  ClientConnection:285 | [172.20.0.1:40216 -> 172.20.0.11:6650] Connected to broker
2018-10-20 23:47:38.931 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-qw-bookies_1, reader-1b446f, 0] Getting connection from pool
2018-10-20 23:47:39.176 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-20 23:47:39.176 INFO  ClientConnection:287 | [172.20.0.1:40220 -> 172.20.0.11:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-20 23:47:39.276 INFO  ConsumerImpl:168 | [persistent://vanlightly/cluster-1/ns1/kill-qw-bookies_1, reader-1b446f, 0] Created consumer on broker [172.20.0.1:40220 -> 172.20.0.11:6650] 
Last confirmed entry: [0, 2506]

可以继续发送消息。同时关闭两个 bookie 后,broker 将会通过新的 ensemble 创建新分片。原本有 5 个 bookie,关闭 2 个后,剩下的 3 个 bookie 将会和 E、2 个 Qw 一起负责新分片上的操作。

但是,由于 reader 会尝试从没有数据的分片读取数据,reader 会阻塞在 read_next() 上(除非重启已关闭的 bookie,且数据完整)。

我们可以在 ZooKeeper 中检查复制不足的 ledger。下面的 bash 脚本可以检测这些 ledger 的路径。

$ bash monitor-underrep.sh 
[0000]

也可以查看第一个 ledger 的详细信息。

$ bash show-ledger.sh 00/0000/L0000
...(omitted a bunch of uniteresting output)
quorumSize: 2
ensembleSize: 2
length: 0
lastEntryId: -1
state: OPEN
segment {
  ensembleMember: "172.20.0.3:3181"
  ensembleMember: "172.20.0.5:3181"
  firstEntryId: 0
}
segment {
  ensembleMember: "172.20.0.7:3181"
  ensembleMember: "172.20.0.4:3181"
  firstEntryId: 949
}
segment {
  ensembleMember: "172.20.0.7:3181"
  ensembleMember: "172.20.0.6:3181"
  firstEntryId: 959
}
segment {
  ensembleMember: "172.20.0.7:3181"
  ensembleMember: "172.20.0.6:3181"
  firstEntryId: 1340
}
digestType: CRC32C
password: ""
ackQuorumSize: 1

Reader 阻塞的第一个分片上有两个 IP 地址:172.20.0.3 和 172.20.0.5,分别对应于关闭的两个 bookie。

由于第一个分片上的数据已丢失,因此无法恢复 ledger。如果已关闭的两个 bookie 上的数据已永久丢失,我们只能跳转至 ID 为 949 的 entry 继续读取数据。

但是如果可以找回关闭的两个 bookie 中任意一个 bookie 上的数据,就可以通过 AutoRecovery 完全复制分片且数据可用。我们来模拟一下,在集群目录中使用 blockade start bookie1 命令即可启动 bookie1。

我们注意到,reader 复活,消费 65 万条消息后发生阻塞,直到达到我们设置的 60 秒超时,测试结束。

Start reading from (0,0,0,-1)
Received: 50000 Curr Entry: [0, 66]
Received: 100000 Curr Entry: [0, 158]
Received: 150000 Curr Entry: [0, 246]
Received: 200000 Curr Entry: [0, 312]
Received: 250000 Curr Entry: [0, 382]
Received: 300000 Curr Entry: [0, 444]
Received: 350000 Curr Entry: [0, 501]
Received: 400000 Curr Entry: [0, 566]
Received: 450000 Curr Entry: [0, 635]
Received: 500000 Curr Entry: [0, 698]
Received: 550000 Curr Entry: [0, 761]
Received: 600000 Curr Entry: [0, 826]
Received: 650000 Curr Entry: [0, 895]
Read phase complete with message (0,946,478,-1)
2018-10-21 00:04:15.760 INFO  ConsumerImpl:761 | [persistent://vanlightly/cluster-1/ns1/kill-qw-bookies_1, reader-1b446f, 0] Closed consumer 0
Results --------------------------------------------
Final send count: 2000000
Final ack count: 2000000
Final positive ack count: 2000000
Final negative ack count: 0
Messages received: 691566
Acked messages missing: 1308433
Non-acked messages received: 0
Out-of-order: 0
Duplicates: 0
----------------------------------------------------

Reader 读取的最后一条 entry ID 为 946,继续数 3 条就到第二个分片了。为什么 reader 在第一个分片的结尾处发生了阻塞?

当 Qa=1 时,同时关闭 ensemble 中的两个 bookie,最后 3 个 entry 持久化到 1 个 bookie 上。而这个 bookie 仍然处于关闭状态,因此最后 3 个 entry 不可读。

我们使用 reader.py 从 topic 读取数据。在第 65 万条消息处出现了阻塞。现在启用 bookie3:blockade start bookie3

由于对读操作应用了指数退避算法,在读取失败并恢复读取时,指数退避算法会像 TCP 一样缓慢启动,逐渐增加从 bookie 读取的批大小。reader.py 在消费到阻塞处时,跳过这一条消息后消费速度有所下降,然后迅速消费了 200 万条消息。

在 ZooKeeper 中复制不足的路径下,没再出现 ledger 0000。理论上,在 ledger 完成复制后可以再次关闭 bookie3,因为 bookie1 中有第一个分片中的所有 entry,所以应该立即启用 AutoRecovery 并重新复制 ledger 到第二个 bookie。

$ blockade kill bookie3
$ bash monitor-underrep.sh 
[]
[]
[]
[]
[]
[]
[]
[]
[]
[0000]
[0000]
[0000]
[0000]
[]
[]

[] 表示不存在 ledger 复制不足。查看此路径需要 2 秒钟(运行 Docker、zookeeper-shell 等),确认 ledger 0000 复制不足一共需要 16 秒,但复制操作只需 8 秒钟。

再次检查 ledger,第一个分片上的 ensemble 应该有所变化:

$bash show-ledger.sh 00/0000/L0000
...
quorumSize: 2
ensembleSize: 2
length: 0
lastEntryId: -1
state: OPEN
segment {
  ensembleMember: "172.20.0.6:3181"
  ensembleMember: "172.20.0.4:3181"
  firstEntryId: 0
}
segment {
  ensembleMember: "172.20.0.7:3181"
  ensembleMember: "172.20.0.4:3181"
  firstEntryId: 949
}
segment {
  ensembleMember: "172.20.0.7:3181"
  ensembleMember: "172.20.0.6:3181"
  firstEntryId: 959
}
segment {
  ensembleMember: "172.20.0.7:3181"
  ensembleMember: "172.20.0.6:3181"
  firstEntryId: 1340
}
digestType: CRC32C
password: ""
ackQuorumSize: 1

Bookie3 (172.20.0.5)不再属于第一个分片,AutoRecovery 也完成了相应的复制。

通过以上测试,我们发现:

如果不想丢消息,最多可以关闭 Qw-1 个 bookie。如果关闭 Qw 个 bookie,要保证不出现数据丢失,则需要恢复这些 bookie 中的所有数据。

如果关闭 bookie 的数量等于 Qa 的数量,AutoRecovery 不能修复 ledger 复制不足的问题。当 Qa=1,关闭 1 个 bookie 时,AutoRecovery 不能修复 ledger。

这种情况下,BookKeeper 不能恢复 ledger ,Ivan Kelly 提出,“关闭 bookie 的操作有问题。关闭操作需要确认分片的最后一个 entry。当 Qa=1 时,如果 write quorum 中有 1 个 bookie 出现故障,则无法确认分片的最后一个 entry,因为该 bookie 可能已经 ack 了写操作(严格来说,可能会 ack 写操作)。”

因此 Qa=1 的配置非常冒险,因为一旦有 1 个 bookie 出现故障,就无法修复 ledger 复制不足的问题。

测试总结

  • 确认丢失消息总数:0
  • 乱序消息总数:0
  • 所有测试中均未出现消息丢失和消息乱序。
  • 在关闭 Qw 个 bookie 的测试中,同时恢复 2 个 bookie 不会出现数据丢失。但这一场景不在上述总结范围内。

结语

通过测试可以发现,关闭 broker 或将 broker 与 ZooKeeper 隔离都不会造成消息丢失,但会读取未 ack 的消息(很常见,不可避免),如果未启用消息去重,还会出现消息重复。即使在故障转移场景中,也可以使用消息去重。下次我会进一步测试消息去重。

Qw 大于等于 2 时,关闭 1 个 bookie 或将 bookie 与 ZooKeeper 隔离也不会造成消息丢失。

关闭 ensemble 中的全部 bookie 会阻塞 reader 直到恢复 bookie。如果不能成功恢复 bookie,则会丢失所有数据。丢失全部数据副本是我们最不想看到的情况。

最小值:

  • Write Quorum(Qw)。1 个 bookie 宕机时,要确保数据不丢失,Qw 必须大于等于 2。
  • Ack Quorum(Qa)。1 个 bookie 宕机时,要通过 AutoRecovery 重新复制 ledger,Qa 必须大于等于 2。
  • 至少需要默认值 E 个 bookie 才可以创建 reader 和 consumer,因为需要使用这个值创建游标。

总之,Apache Pulsar 功能强大,在不同配置情况下也不容易出现数据丢失、重复、乱序的问题。在后续测试中,仍有一些需要注意的地方,如在 broker 故障转移后,如果正在运行的消息数量接近内部等待消息队列大小,producer 的速度会迅速下降。但如果仅考虑是否出现消息丢失和乱序,Apache Pulsar 完美无缺。

可以进一步测试的方向有以下几个:

  • 消息去重和 broker 故障转移
  • 使用速度较慢的 producer 运行较长时间(几个小时),运行中每 5 分钟随机关闭 1 个节点(broker 或 bookie)。运行结束后,检查是否出现消息丢失和乱序的问题。
  • 发送大量消息。在几个小时之内,每 5-10 分钟随机关闭 1 个 bookie。运行结束后,检查是否通过 AutoRecovery 保证复制了全部消息,没有出现消息丢失。
  • 分区 topic - 所有类型的测试
  • 在网络不稳定或网速较慢的条件下进行测试

相关阅读

译文 | Apache Pulsar 集群如何确保消息不丢_第2张图片

点击链接 ,为 Apache Pulsar 点赞!

你可能感兴趣的