Skip to content

Conversation

@gkethamallax
Copy link

@gkethamallax gkethamallax commented Dec 16, 2025

Log:

[gkethamallax@ws02 llama.cpp]$ ./build-posix/bin/llama-cli   -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf   -p "my cat's name is"   --device tsavorite -c 4096 --temp 0.0 --n-predict 10 --repeat-penalty 1.5   -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup   -v --verbosity 9999 --log-prefix --log-timestamps | tee test-run.log
 my cat's name is Luna.
I'm a cat person



llama_perf_sampler_print:    sampling time =      36.10 ms /    17 runs   (    2.12 ms per token,   470.86 tokens per second)
llama_perf_context_print:        load time =   52927.94 ms
llama_perf_context_print: prompt eval time =   52319.80 ms /     7 tokens ( 7474.26 ms per token,     0.13 tokens per second)
llama_perf_context_print:        eval time =   33027.61 ms /     9 runs   ( 3669.73 ms per token,     0.27 tokens per second)
llama_perf_context_print:       total time =   85996.30 ms /    16 tokens

=== GGML Perf Summary ===
  Op               Target      Runs  TSI_KERNEL-RUN          Total us            Avg us
  ADD              OPU         2024            2276           5943358           2936.44
  MUL              OPU         2070            2328           2425324           1171.65
  RMS_NORM         OPU         2070            2070           2543994           1228.98
  MUL_MAT          CPU        36323               0         692051426          19052.71
  CONT             CPU         7986               0           4303856            538.93
  RESHAPE          CPU        10072               0             19358              1.92
  VIEW             CPU        17931               0             30931              1.73
  PERMUTE          CPU        14387               0             25581              1.78
  TRANSPOSE        CPU         3100               0              7234              2.33
  GET_ROWS         CPU          413               0              2969              7.19
  SET_ROWS         CPU         7757               0             26381              3.40
  SOFT_MAX         CPU         3895               0            922000            236.71
  ROPE             CPU         7863               0            174860             22.24
  GLU              OPU         1012            1138           3327395           3287.94

OPU Profiling Results:
------------------------------------------------------------------------------------------------------------------------
Calls  Total(ms)    T/call    Self(ms)  Function
------------------------------------------------------------------------------------------------------------------------
    1    88.4130   88.4130      7.9970  [1.03e-01%] [Thread] tsi::runtime::TsavRTPosix::initialize
    1    80.2850   80.2850      2.1210  └─ [9.32e-02%] tsi::runtime::TsavRTPosix::initializeQueues
    1    75.8450   75.8450     75.8450    └─ [8.81e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
    1     2.2350    2.2350      2.2350    └─ [2.60e-03%] tsi::runtime::TsavRTPosix::requestTXEDevice
    1     0.0840    0.0840      0.0740    └─ [9.75e-05%] tsi::runtime::TsavRT::finalizeCommandList
    1     0.0100    0.0100      0.0100      └─ [1.16e-05%] tsi::runtime::executeWithTimeout
    1     0.1310    0.1310      0.1310  └─ [1.52e-04%] tsi::runtime::TsavRT::initialize
------------------------------------------------------------------------------------------------------------------------
[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)
------------------------------------------------------------------------------------------------------------------------
    1     5.1420    5.1420      4.5150  [5.97e-03%] [Thread] tsi::runtime::TsavRT::finalize
    1     0.6160    0.6160      0.0530  └─ [7.15e-04%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
    1     0.5630    0.5630      0.0600    └─ [6.54e-04%] tsi::runtime::TsavRT::executeSyncCommand
    1     0.4710    0.4710      0.4710      └─ [5.47e-04%] tsi::runtime::TsavRT::awaitCommandListCompletion
    1     0.0320    0.0320      0.0280      └─ [3.72e-05%] tsi::runtime::TsavRT::finalizeCommandList
    1     0.0040    0.0040      0.0040        └─ [4.64e-06%] tsi::runtime::executeWithTimeout
    2     0.0110    0.0055      0.0110  └─ [1.28e-05%] tsi::runtime::TsavRT::deallocate
------------------------------------------------------------------------------------------------------------------------
[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)
------------------------------------------------------------------------------------------------------------------------
 2196  1214.6310    0.5531    103.2940  [ 1.41%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
 4392  1109.7500    0.2527   1109.7500  └─ [ 1.29%] tsi::runtime::executeWithTimeout
 2196     1.5870  7.23e-04      1.5870  └─ [1.84e-03%] LOAD_BLOB Command Execution
 2196     0.0000    0.0000      0.0000  └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148009600[0x800...
 2196     0.0000    0.0000      0.0000  └─ [0.00e+00%] TXE 0 Idle
------------------------------------------------------------------------------------------------------------------------
[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)
------------------------------------------------------------------------------------------------------------------------
 2196   691.3580    0.3148     96.8150  [8.03e-01%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
 4392   592.8200    0.1350    592.8200  └─ [6.88e-01%] tsi::runtime::executeWithTimeout
 2196     1.7230  7.85e-04      1.7230  └─ [2.00e-03%] UNLOAD_BLOB Command Execution
 2196     0.0000    0.0000      0.0000  └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148009600[0x8...
 2196     0.0000    0.0000      0.0000  └─ [0.00e+00%] TXE 0 Idle
------------------------------------------------------------------------------------------------------------------------
[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)
------------------------------------------------------------------------------------------------------------------------
 2198   883.1250    0.4018     23.0660  [ 1.03%] [Thread] tsi::runtime::TsavRT::processResponses
 2198   860.0590    0.3913    860.0590  └─ [9.99e-01%] tsi::runtime::executeWithTimeout
------------------------------------------------------------------------------------------------------------------------
[Thread] OPU  (cumulative over all threads)
------------------------------------------------------------------------------------------------------------------------
    1     0.0720    0.0720      0.0460  [8.36e-05%] [Thread] OPU 
    1     0.0260    0.0260      0.0260  └─ [3.02e-05%] tsi::runtime::TsavRT::allocate
------------------------------------------------------------------------------------------------------------------------
[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)
------------------------------------------------------------------------------------------------------------------------
 2196    43.6810    0.0199     39.7560  [5.07e-02%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
 2196     3.9250    0.0018      3.9250  └─ [4.56e-03%] tsi::runtime::executeWithTimeout
------------------------------------------------------------------------------------------------------------------------
[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)
------------------------------------------------------------------------------------------------------------------------
 2198    18.2320    0.0083     18.2320  [2.12e-02%] [Thread] tsi::runtime::TsavRT::allocate
------------------------------------------------------------------------------------------------------------------------
[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)
------------------------------------------------------------------------------------------------------------------------
 2196    11.2810    0.0051     11.2810  [1.31e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList
------------------------------------------------------------------------------------------------------------------------
[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)
------------------------------------------------------------------------------------------------------------------------
 2196  1718.0890    0.7824   1718.0890  [ 1.99%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
------------------------------------------------------------------------------------------------------------------------
[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)
------------------------------------------------------------------------------------------------------------------------
 2196     3.9990    0.0018      3.9990  [4.64e-03%] [Thread] tsi::runtime::TsavRT::deallocate
========================================================================================================================
    - 86123.8300    0.0000  86123.8300  [100.00%] TOTAL
========================================================================================================================

Counter Metrics:
------------------------------------------------------------------------------------------------------------------------
Metric                                    Min            Max            Avg
------------------------------------------------------------------------------------------------------------------------
Queue_0_Occupancy                      0.0000         1.0000         0.9992
------------------------------------------------------------------------------------------------------------------------

Signed-off-by: Ganesh Kethamalla <gkethamallax@tsavoritesi.com>
Signed-off-by: Ganesh Kethamalla <gkethamallax@tsavoritesi.com>
@mikeuhler mikeuhler removed their request for review December 17, 2025 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants