Skip to content

Coordination error sometimes happens when trying to export partition #1173

@Selfeer

Description

@Selfeer

We sometimes hit the following error when trying to export partition:

                         Code: 999. DB::Exception: Received from localhost:9000. Coordination::Exception. Coordination::Exception: Coordination error: Operation timeout, path /clickho
use/tables/shard0/source_ee53e34b_c9f6_11f0_9209_4369e6456e8f/exports/5_default.s3_ee53e35c_c9f6_11f0_9209_4369e6456e8f. (KEEPER_EXCEPTION)
                         (query: ALTER TABLE source_ee53e34b_c9f6_11f0_9209_4369e6456e8f EXPORT PARTITION ID '5' TO TABLE s3_ee53e35c_c9f6_11f0_9209_4369e6456e8f
                         )

Recently this happened when trying to export partition on two different nodes on a cluster.

[clickhouse1] CREATE TABLE source_ee53e34b_c9f6_11f0_9209_4369e6456e8f ON CLUSTER sharded_cluster (
    p UInt8,
    i UInt64
)  ENGINE = ReplicatedMergeTree('/clickhouse/tables/shard0/source_ee53e34b_c9f6_11f0_9209_4369e6456e8f', '{replica}') ORDER BY tuple() PARTITION BY p;

[clickhouse1] CREATE TABLE s3_ee53e35c_c9f6_11f0_9209_4369e6456e8f ON CLUSTER sharded_cluster (
    p UInt8,
    i UInt64
)  ENGINE =
        S3(
            '[masked]:Secret(name='minio_uri')/root/data/export_part/tmp_ee53e35b_c9f6_11f0_9209_4369e6456e8f/',
            '[masked]:Secret(name='minio_root_user')',
            '[masked]:Secret(name='minio_root_password')',
            filename='s3_ee53e35c_c9f6_11f0_9209_4369e6456e8f',
            format='Parquet',
            compression='auto',
            partition_strategy='hive'
        )
     PARTITION BY p;

[clickhouse1] INSERT INTO source_ee53e34b_c9f6_11f0_9209_4369e6456e8f (p, i) SELECT 2, rand64() FROM numbers(3);

[clickhouse1] INSERT INTO source_ee53e34b_c9f6_11f0_9209_4369e6456e8f (p, i) SELECT 3, rand64() FROM numbers(3);

[clickhouse1] INSERT INTO source_ee53e34b_c9f6_11f0_9209_4369e6456e8f (p, i) SELECT 4, rand64() FROM numbers(3);

[clickhouse1] INSERT INTO source_ee53e34b_c9f6_11f0_9209_4369e6456e8f (p, i) SELECT 5, rand64() FROM numbers(3);

[clickhouse2] INSERT INTO source_ee53e34b_c9f6_11f0_9209_4369e6456e8f (p, i) SELECT 1, rand64() FROM numbers(3);

[clickhouse2] INSERT INTO source_ee53e34b_c9f6_11f0_9209_4369e6456e8f (p, i) SELECT 2, rand64() FROM numbers(3);

[clickhouse2] INSERT INTO source_ee53e34b_c9f6_11f0_9209_4369e6456e8f (p, i) SELECT 3, rand64() FROM numbers(3);

[clickhouse2] INSERT INTO source_ee53e34b_c9f6_11f0_9209_4369e6456e8f (p, i) SELECT 4, rand64() FROM numbers(3);

[clickhouse2] INSERT INTO source_ee53e34b_c9f6_11f0_9209_4369e6456e8f (p, i) SELECT 5, rand64() FROM numbers(3);

We export partitions on node1

ALTER TABLE source_ee53e34b_c9f6_11f0_9209_4369e6456e8f EXPORT PARTITION ID '1' TO TABLE s3_ee53e35c_c9f6_11f0_9209_4369e6456e8f
ALTER TABLE source_ee53e34b_c9f6_11f0_9209_4369e6456e8f EXPORT PARTITION ID '2' TO TABLE s3_ee53e35c_c9f6_11f0_9209_4369e6456e8f
ALTER TABLE source_ee53e34b_c9f6_11f0_9209_4369e6456e8f EXPORT PARTITION ID '3' TO TABLE s3_ee53e35c_c9f6_11f0_9209_4369e6456e8f
ALTER TABLE source_ee53e34b_c9f6_11f0_9209_4369e6456e8f EXPORT PARTITION ID '4' TO TABLE s3_ee53e35c_c9f6_11f0_9209_4369e6456e8f
ALTER TABLE source_ee53e34b_c9f6_11f0_9209_4369e6456e8f EXPORT PARTITION ID '5' TO TABLE s3_ee53e35c_c9f6_11f0_9209_4369e6456e8f

and when doing the same export on node 2 but with export_merge_tree_partition_force_export we hit the error at one of the exports

SET allow_experimental_export_merge_tree_part = 1; 
SET export_merge_tree_partition_force_export = 1;

ALTER TABLE source_ee53e34b_c9f6_11f0_9209_4369e6456e8f EXPORT PARTITION ID '1' TO TABLE s3_ee53e35c_c9f6_11f0_9209_4369e6456e8f
ALTER TABLE source_ee53e34b_c9f6_11f0_9209_4369e6456e8f EXPORT PARTITION ID '2' TO TABLE s3_ee53e35c_c9f6_11f0_9209_4369e6456e8f
ALTER TABLE source_ee53e34b_c9f6_11f0_9209_4369e6456e8f EXPORT PARTITION ID '3' TO TABLE s3_ee53e35c_c9f6_11f0_9209_4369e6456e8f
ALTER TABLE source_ee53e34b_c9f6_11f0_9209_4369e6456e8f EXPORT PARTITION ID '4' TO TABLE s3_ee53e35c_c9f6_11f0_9209_4369e6456e8f
ALTER TABLE source_ee53e34b_c9f6_11f0_9209_4369e6456e8f EXPORT PARTITION ID '5' TO TABLE s3_ee53e35c_c9f6_11f0_9209_4369e6456e8f

2025.11.25 13:04:56.565544 [ 4220 ] {} <Error> TCPHandler: Code: 999. Coordination::Exception: Coordination error: Operation timeout, path /clickhouse/tables/shard0/source_ee53e34b_c9f6_11f0_9209_4369e6456e8f/exports/5_default.s3_ee53e35c_c9f6_11f0_9209_4369e6456e8f. (KEEPER_EXCEPTION), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x00000000133d959f
1. DB::Exception::Exception(String&&, int, String, bool) @ 0x000000000c88438e
2. DB::Exception::Exception(PreformattedMessage&&, int) @ 0x000000000c883e40
3. DB::Exception::Exception<char const*, String const&>(int, FormatStringHelperImpl<std::type_identity<char const*>::type, std::type_identity<String const&>::type>, char const*&&, String const&) @ 0x000000000ed5172b
4. Coordination::Exception::fromPath(Coordination::Error, String const&) @ 0x000000000ed50e68
5. zkutil::ZooKeeper::existsWatch(String const&, Coordination::Stat*, std::function<void (Coordination::WatchResponse const&)>) @ 0x000000001a5901da
6. zkutil::ZooKeeper::exists(String const&, Coordination::Stat*, std::shared_ptr<Poco::Event> const&) @ 0x000000001a58ca9f
7. DB::StorageReplicatedMergeTree::exportPartitionToTable(DB::PartitionCommand const&, std::shared_ptr<DB::Context const>) @ 0x0000000018d3e67b
8. DB::MergeTreeData::alterPartition(std::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::vector<DB::PartitionCommand, std::allocator<DB::PartitionCommand>> const&, std::shared_ptr<DB::Context const>) @ 0x000000001935525e
9. DB::InterpreterAlterQuery::executeToTable(DB::ASTAlterQuery const&) @ 0x0000000017f43b91
10. DB::InterpreterAlterQuery::execute() @ 0x0000000017f4058d
11. DB::executeQueryImpl(char const*, char const*, std::shared_ptr<DB::Context>, DB::QueryFlags, DB::QueryProcessingStage::Enum, std::unique_ptr<DB::ReadBuffer, std::default_delete<DB::ReadBuffer>>&, std::shared_ptr<DB::IAST>&, std::shared_ptr<DB::ImplicitTransactionControlExecutor>) @ 0x000000001840a2d2
12. DB::executeQuery(String const&, std::shared_ptr<DB::Context>, DB::QueryFlags, DB::QueryProcessingStage::Enum) @ 0x000000001840254b
13. DB::TCPHandler::runImpl() @ 0x0000000019b3818a
14. DB::TCPHandler::run() @ 0x0000000019b5a1d9
15. Poco::Net::TCPServerConnection::start() @ 0x000000001f084fc7
16. Poco::Net::TCPServerDispatcher::run() @ 0x000000001f085459
17. Poco::PooledThread::run() @ 0x000000001f04ba87
18. Poco::ThreadImpl::runnableEntry(void*) @ 0x000000001f049e81
19. ? @ 0x0000000000094ac3
20. ? @ 0x0000000000126850

I'm not sure if export_merge_tree_partition_force_export plays any role in that issue, because we've hit that same error on regular export partitions as well where no additional settings are set.

Cluster structure:

        <sharded_cluster>
            <shard>
                <replica>
                    <host>clickhouse1</host>
                    <port>9000</port>
                </replica>
            </shard>
            <shard>
                <replica>
                    <host>clickhouse2</host>
                    <port>9000</port>
                </replica>
            </shard>
            <shard>
                <replica>
                    <host>clickhouse3</host>
                    <port>9000</port>
                </replica>
            </shard>
        </sharded_cluster>

Logs:

clickhouse_server.err.log
clickhouse_server.log
keeper.log

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions