Skip to content

Conversation

@Ethan-a2
Copy link

@Ethan-a2 Ethan-a2 commented Dec 7, 2025

Adding CPU-side visual trace for hexagon

hex-itrace

Known issue: when the running log is too large, it will overflow, resulting in the inability to generate the core json file for trace, but the impact is not significant.

@Ethan-a2
Copy link
Author

Ethan-a2 commented Dec 7, 2025

TRACE=1 M=LFM2-1.2B-Q4_0.gguf D=HTP0 ./scripts/snapdragon/adb/run-cli.sh -no-cnv -p "1+1=?"

  • adb shell cd /data/local/tmp/llama.cpp; ulimit -c unlimited; LD_LIBRARY_PATH=/data/local/tmp/llama.cpp/./lib ADSP_LIBRARY_PATH=/data/local/tmp/llama.cpp/./lib GGML_HEXAGON_TRACE=1 ././bin/llama-cli --no-mmap -m /data/local/tmp/llama.cpp/../gguf/LFM2-1.2B-Q4_0.gguf --poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 --ctx-size 8192 --batch-size 128 -ctk q8_0 -ctv q8_0 -fa on -ngl 99 --device HTP0 -no-cnv -p "1+1=?"
    ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'

ggml_opencl: device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.47.18.13
ggml_opencl: vector subgroup broadcast support: true
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: device max workgroup size: 1024
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels.............................................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
register_backend: registered backend OpenCL (1 devices)
register_device: registered device GPUOpenCL (QUALCOMM Adreno(TM) 830)
ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1
ggml-hex: Hexagon Arch version v79
ggml-hex: allocating new session: HTP0
ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb40000738d7ef610
register_backend: registered backend HTP (1 devices)
register_device: registered device HTP0 (Hexagon)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (CPU)
build: 7199 (dde95f4) with Android (13624864, +pgo, +bolt, +lto, +mlgo, based on r530567e) clang version 19.0.1 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) for x86_64-unknown-linux-gnu (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device HTP0 (Hexagon) (unknown id) - 2048 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 148 tensors from /data/local/tmp/llama.cpp/../gguf/LFM2-1.2B-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = lfm2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = LFM2 1.2B
llama_model_loader: - kv 3: general.basename str = LFM2
llama_model_loader: - kv 4: general.size_label str = 1.2B
llama_model_loader: - kv 5: general.license str = other
llama_model_loader: - kv 6: general.license.name str = lfm1.0
llama_model_loader: - kv 7: general.license.link str = LICENSE
llama_model_loader: - kv 8: general.tags arr[str,4] = ["liquid", "lfm2", "edge", "text-gene...
llama_model_loader: - kv 9: general.languages arr[str,8] = ["en", "ar", "zh", "fr", "de", "ja", ...
llama_model_loader: - kv 10: lfm2.block_count u32 = 16
llama_model_loader: - kv 11: lfm2.context_length u32 = 128000
llama_model_loader: - kv 12: lfm2.embedding_length u32 = 2048
llama_model_loader: - kv 13: lfm2.feed_forward_length u32 = 8192
llama_model_loader: - kv 14: lfm2.attention.head_count u32 = 32
llama_model_loader: - kv 15: lfm2.attention.head_count_kv arr[i32,16] = [0, 0, 8, 0, 0, 8, 0, 0, 8, 0, 8, 0, ...
llama_model_loader: - kv 16: lfm2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 17: lfm2.vocab_size u32 = 65536
llama_model_loader: - kv 18: lfm2.shortconv.l_cache u32 = 3
llama_model_loader: - kv 19: lfm2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = lfm2
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,65536] = ["<|pad|>", "<|startoftext|>", "<|end...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,65536] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,63683] = ["Ċ Ċ", "Ċ ĊĊ", "ĊĊ Ċ", "Ċ �..
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 7
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 29: tokenizer.ggml.add_sep_token bool = false
llama_model_loader: - kv 30: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 31: tokenizer.chat_template str = {{bos_token}}{% for message in messag...
llama_model_loader: - kv 32: general.quantization_version u32 = 2
llama_model_loader: - kv 33: general.file_type u32 = 2
llama_model_loader: - type f32: 55 tensors
llama_model_loader: - type q4_0: 92 tensors
llama_model_loader: - type q6_K: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_0
print_info: file size = 661.25 MiB (4.74 BPW)
load: printing all EOG tokens:
load: - 2 ('<|endoftext|>')
load: - 7 ('<|im_end|>')
load: special tokens cache size = 507
load: token to piece cache size = 0.3756 MB
print_info: arch = lfm2
print_info: vocab_only = 0
print_info: n_ctx_train = 128000
print_info: n_embd = 2048
print_info: n_embd_inp = 2048
print_info: n_layer = 16
print_info: n_head = 32
print_info: n_head_kv = [0, 0, 8, 0, 0, 8, 0, 0, 8, 0, 8, 0, 8, 0, 8, 0]
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = [0, 0, 4, 0, 0, 4, 0, 0, 4, 0, 4, 0, 4, 0, 4, 0]
print_info: n_embd_k_gqa = [0, 0, 512, 0, 0, 512, 0, 0, 512, 0, 512, 0, 512, 0, 512, 0]
print_info: n_embd_v_gqa = [0, 0, 512, 0, 0, 512, 0, 0, 512, 0, 512, 0, 512, 0, 512, 0]
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 8192
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 128000
print_info: rope_finetuned = unknown
print_info: model type = 1.2B
print_info: model params = 1.17 B
print_info: general.name = LFM2 1.2B
print_info: vocab type = BPE
print_info: n_vocab = 65536
print_info: n_merges = 63683
print_info: BOS token = 1 '<|startoftext|>'
print_info: EOS token = 7 '<|im_end|>'
print_info: EOT token = 2 '<|endoftext|>'
print_info: PAD token = 0 '<|pad|>'
print_info: LF token = 708 'Ċ'
print_info: EOG token = 2 '<|endoftext|>'
print_info: EOG token = 7 '<|im_end|>'
print_info: max token length = 30
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 16 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 17/17 layers to GPU
load_tensors: CPU model buffer size = 105.23 MiB
load_tensors: HTP0 model buffer size = 0.26 MiB
load_tensors: HTP0-REPACK model buffer size = 555.75 MiB
.....................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8192
llama_context: n_ctx_seq = 8192
llama_context: n_batch = 128
llama_context: n_ubatch = 128
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (8192) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.25 MiB
llama_kv_cache: HTP0 KV buffer size = 51.00 MiB
llama_kv_cache: size = 51.00 MiB ( 8192 cells, 6 layers, 1/1 seqs), K (q8_0): 25.50 MiB, V (q8_0): 25.50 MiB
llama_memory_recurrent: HTP0 RS buffer size = 0.16 MiB
llama_memory_recurrent: size = 0.16 MiB ( 1 cells, 16 layers, 1 seqs), R (f32): 0.16 MiB, S (f32): 0.00 MiB
llama_context: HTP0 compute buffer size = 14.00 MiB
llama_context: CPU compute buffer size = 32.00 MiB
llama_context: graph nodes = 549
llama_context: graph splits = 55
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6

system_info: n_threads = 6 (n_threads_batch = 6) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | REPACK = 1 |

File rootname for itrace output: /data/local/tmp/itrace_results/itrace_output
1+1=?sampler seed: 2828806337
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 8192
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 8192, n_batch = 128, n_predict = -1, n_keep = 1

+1 +1=2, but in standard arithmetic, you can't add two 1's. This is a classic example of a contradiction, often used to illustrate the importance of following rules in logic and mathematics. In this case, the correct answer is 2, but the explanation highlights the difference between two forms of addition. If we strictly adhere to standard arithmetic, it's not possible to combine two ones. However, in many logical and algebraic systems, 1+1 is defined as 2, but this is a non-standard arithmetic operation. The question itself is a play on words, emphasizing the distinction between different types of addition or the interpretation of numbers. [end of text]

common_perf_print: sampling time = 78.78 ms
common_perf_print: samplers time = 34.20 ms / 143 tokens
common_perf_print: load time = 5772.05 ms
common_perf_print: prompt eval time = 103.17 ms / 6 tokens ( 17.20 ms per token, 58.16 tokens per second)
common_perf_print: eval time = 8685.21 ms / 136 runs ( 63.86 ms per token, 15.66 tokens per second)
common_perf_print: total time = 8874.35 ms / 142 tokens
common_perf_print: unaccounted time = 7.19 ms / 0.1 % (total - sampling - prompt eval - eval) / (total)
common_perf_print: graphs reused = 0
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 555 = 555 + 0 + 0) + 17592186043860 |
llama_memory_breakdown_print: | - Host | 202 = 105 + 51 + 46 |

@github-actions github-actions bot added script Script related ggml changes relating to the ggml tensor library for machine learning labels Dec 7, 2025
@Ethan-a2 Ethan-a2 changed the title Itrace debug:Adding CPU-side visual trace for hexagon Dec 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning script Script related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants