Tools: Latest: Fast Fragment Scanning for Tables in GBase 8a

Tools: Latest: Fast Fragment Scanning for Tables in GBase 8a

Why Fast Fragment Detection Matters

How It Works

Deploying and Running the Script

1. Installation

2. Configuration

3. Execution

Full Script Reference

Why This Works Disk fragmentation inside a database can silently eat up storage and degrade I/O performance. In GBase 8a, the China-domestically developed MPP cluster database from GBASE, frequent inserts and deletes leave "holes" in table files over time. Spotting these fragmented tables quickly is a common operational challenge—traditional methods that scan metadata tables are slow and don't scale well across large clusters. This article introduces a lightweight, shell-based approach that uses the Linux filefrag utility to scan for fragmented tables directly from the filesystem layer, significantly reducing detection time in a gbase database environment. GBase 8a's data files can become heavily fragmented under write-intensive workloads. Fragmentation not only wastes disk space but also adds random I/O overhead during scans. Conventional approaches that query internal metadata tables struggle with performance at scale. By contrast, filefrag reads filesystem-level extent maps, making it orders of magnitude faster for bulk scanning. The script iterates through all user tables (obtained from gbase.table_distribution), locates the data directory for a specific shard suffix (e.g., n1), and runs filefrag -v against each file. It extracts the number of extents (contiguous disk regions) per file and calculates the average. If a table's average extent count exceeds the threshold (default: 2), the table is flagged as fragmented. Place the sweep.sh script under /home/gbase/sweep on a composite node. Adjust the variables at the top of the script: After the scan finishes, the /home/gbase/sweep/log/ directory will contain one file per flagged table, named database_name.table_name. These files serve as a direct to-do list for defragmentation operations. The key insight is offloading the work to the filesystem layer. GBase 8a's gbase.table_distribution system table gives the authoritative list of table names, and filefrag handles the low-level extent counting without touching the gbase database engine. The script then parallelizes the scanning using a named-pipe-based concurrency model, making it practical to run even across hundreds of tables. GBASE continues to provide robust tools for managing GBase 8a clusters, and combining OS-level utilities with database metadata is a pattern that can be extended to other operational tasks. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

gbase.table_distribution filefrag -v /home/gbase/sweep threads_num sh sweep.sh sh sweep.sh sh sweep.sh /home/gbase/sweep/log/ database_name.table_name #!/bin/bash # author: Bryan zhao # date: 2025-07-29 user="gbase" passwd="******" threads_num=2 # Concurrency suffix="n1" # Typically scan one shard only avg_sum=2 # Mark table if avg extents exceed this > tb.list if [ ! -d "`pwd`/log" ]; then mkdir -p "`pwd`/log" else rm -rf "`pwd`/log" mkdir -p "`pwd`/log" fi cli="${GCLUSTER_HOME}/bin/gbase -u${user} -p${passwd} -N" $cli -e "SELECT dbname,tbname FROM gbase.table_distribution where dbname not in ('information_schema','performance_schema','gbase','gctmpdb','gclusterdb')" > tb.list child() { local dbname=$1 tbname=$2 suffix=$3 avg_sum=$4 local mulu="${GBASE_BASE}/userdata/gbase/${dbname}/sys_tablespace/${tbname}_${suffix}" [ ! -d "$mulu" ] && { echo "Error: Directory $mulu not found"; return 1; } local avg_extents=$(find "$mulu" -type f -exec filefrag -v {} \; 2>/dev/null | grep "extent" | awk '{sum+=$2} END {print (NR>0)? sum/NR : 0}') if (( $(echo "$avg_extents > $avg_sum" | bc -l) )); then touch "`pwd`/log/${dbname}.${tbname}" fi } tmp_fifofile="/tmp/$$_`date +%N`.fifo" mkfifo $tmp_fifofile exec 6<>$tmp_fifofile rm -rf $tmp_fifofile for i in `seq 1 ${threads_num}` do echo >&6 done while read -r db table do read -u6 { child ${db} ${table} ${suffix} ${avg_sum} echo >&6 }& done < tb.list wait exec 6>&- #!/bin/bash # author: Bryan zhao # date: 2025-07-29 user="gbase" passwd="******" threads_num=2 # Concurrency suffix="n1" # Typically scan one shard only avg_sum=2 # Mark table if avg extents exceed this > tb.list if [ ! -d "`pwd`/log" ]; then mkdir -p "`pwd`/log" else rm -rf "`pwd`/log" mkdir -p "`pwd`/log" fi cli="${GCLUSTER_HOME}/bin/gbase -u${user} -p${passwd} -N" $cli -e "SELECT dbname,tbname FROM gbase.table_distribution where dbname not in ('information_schema','performance_schema','gbase','gctmpdb','gclusterdb')" > tb.list child() { local dbname=$1 tbname=$2 suffix=$3 avg_sum=$4 local mulu="${GBASE_BASE}/userdata/gbase/${dbname}/sys_tablespace/${tbname}_${suffix}" [ ! -d "$mulu" ] && { echo "Error: Directory $mulu not found"; return 1; } local avg_extents=$(find "$mulu" -type f -exec filefrag -v {} \; 2>/dev/null | grep "extent" | awk '{sum+=$2} END {print (NR>0)? sum/NR : 0}') if (( $(echo "$avg_extents > $avg_sum" | bc -l) )); then touch "`pwd`/log/${dbname}.${tbname}" fi } tmp_fifofile="/tmp/$$_`date +%N`.fifo" mkfifo $tmp_fifofile exec 6<>$tmp_fifofile rm -rf $tmp_fifofile for i in `seq 1 ${threads_num}` do echo >&6 done while read -r db table do read -u6 { child ${db} ${table} ${suffix} ${avg_sum} echo >&6 }& done < tb.list wait exec 6>&- #!/bin/bash # author: Bryan zhao # date: 2025-07-29 user="gbase" passwd="******" threads_num=2 # Concurrency suffix="n1" # Typically scan one shard only avg_sum=2 # Mark table if avg extents exceed this > tb.list if [ ! -d "`pwd`/log" ]; then mkdir -p "`pwd`/log" else rm -rf "`pwd`/log" mkdir -p "`pwd`/log" fi cli="${GCLUSTER_HOME}/bin/gbase -u${user} -p${passwd} -N" $cli -e "SELECT dbname,tbname FROM gbase.table_distribution where dbname not in ('information_schema','performance_schema','gbase','gctmpdb','gclusterdb')" > tb.list child() { local dbname=$1 tbname=$2 suffix=$3 avg_sum=$4 local mulu="${GBASE_BASE}/userdata/gbase/${dbname}/sys_tablespace/${tbname}_${suffix}" [ ! -d "$mulu" ] && { echo "Error: Directory $mulu not found"; return 1; } local avg_extents=$(find "$mulu" -type f -exec filefrag -v {} \; 2>/dev/null | grep "extent" | awk '{sum+=$2} END {print (NR>0)? sum/NR : 0}') if (( $(echo "$avg_extents > $avg_sum" | bc -l) )); then touch "`pwd`/log/${dbname}.${tbname}" fi } tmp_fifofile="/tmp/$$_`date +%N`.fifo" mkfifo $tmp_fifofile exec 6<>$tmp_fifofile rm -rf $tmp_fifofile for i in `seq 1 ${threads_num}` do echo >&6 done while read -r db table do read -u6 { child ${db} ${table} ${suffix} ${avg_sum} echo >&6 }& done < tb.list wait exec 6>&- gbase.table_distribution - user — database user (typically gbase) - passwd — corresponding password - threads_num — number of parallel scanning threads (default: 2) - suffix — shard suffix to scan (usually n1) - avg_sum — average extent threshold to trigger a flag (default: 2)