How to Handle Recycle Bin Data in the Doris Cluster?
Scenarios Leading to Data Deletion and Recycle Bin Storage
- During data balancing, tablets from high-load disks are copied to low-load disks. The original tablets are placed in the recycle bin without physical deletion, resulting in additional junk files.
- Operations like Delete, Drop, and Truncate only remove data logically, not physically, resulting in additional junk files.
- Post-merging of data files, the outdated data remains unerased, further contributing to junk file accumulation.
Junk File Implications
- An overabundance of junk files can consume disk space, limiting availability for valid data and potentially causing data loss.
- These files also serve a protective function against accidental deletion, allowing Doris to recover data. A minimal number of junk files could compromise this safeguard.
Viewing the Data in the Recycle Bin
- Log in to the CloudTable console.
- Create a Doris cluster.
- Connect to a Doris cluster. For details, see "Using Doris" > "Connecting to a Doris Cluster" > "Accessing a Cluster Using MySQL" in the CloudTable Service User Guide.
- Run this command to view data in the recycle bin.show trash;
Figure 1 Data in the recycle bin
- Restore data in the recycle bin.curl -X POST http://{be_host}:{be_webserver_port} /api/restore_tablet?tablet_id={tablet_id}\&schema_hash={schema_hash}
- be_host: node address.
- be_webserver_port: port number of the node.
Setting the Recycle Bin Duration
- The principle behind the recycle bin is that deleted data is not immediately removed from the disk. It is first moved to the recycle bin. Once the designated timeout period lapses, the data is then permanently erased from the disk.
- When setting the recycle bin duration, consider the following:
- A prolonged recycle bin duration leads to junk file buildup, occupying disk space.
- If the recycle bin timeout is extensive, calling the admin clean trash command might cause data imbalances, triggering further data balancing and junk file generation.
- A brief recycle bin duration risks accidental tablet deletions or the inability to recover data in exceptional cases. You are advised to gauge the average disk space the recycle bin occupies based on actual service usage and set an appropriate timeout period that balances space utilization with a safety window to prevent accidental deletions.curl -X POST http://{be_ip}:{be_http_port}/api/update_config?trash_file_expire_time_sec={value}\&persist=true
- be_host: node address.
- be_webserver_port: port number of the node.
- trash_file_expire_time_sec: interval for clearing the recycle bin, which is 72 hours. When the disk space is insufficient, the storage period of files in the trash directory does not comply with this parameter. The default value is 259200.
Parent topic: General
- Scenarios Leading to Data Deletion and Recycle Bin Storage
- Junk File Implications
- Viewing the Data in the Recycle Bin
- Setting the Recycle Bin Duration