homelab
=======

problem older hardware : 
------------------------
- Xeon E5-2690 v2
- Tesla P100 16GB
- Driver CUDA 11.8

software :
----------
- CUDA 12.8
- AVX512 or AV2 (> xeon v3


problem reranking:
-------------------
- compiled llama.cpp and gguf model which works OK
- unfortunately not recognised in infiniflow/ragflow (not anymore)


possible solution:
-------------------
installing xinference (cpu version) (dockercontainer for GPU too big and CUDA 12.8)


.. code-block:: bash

   
   docker run -d \
       --name xinference-cpu \
       -p 9997:9997 \
       --shm-size=8g \
       --restart unless-stopped \
       -v /home/jan/models:/models \
       xprobe/xinference:latest-cpu \
       xinference-local -H 0.0.0.0
   

.. code-block:: bash

    docker exec xinference-cpu xinference launch \
        --model-name qwen3-reranker-4b \
        --model-format gguf \
        --model-uri /models/Qwen3-Reranker-4B-q5_k_m.gguf \
        --model-type rerank
    

--------------------
Some parameters (especially engine-specific ones for custom/local models like GGUF rerankers) are treated as extra model kwargs and must be prefixed with -- to separate them from the main command options.

.. code-block:: bash

    docker exec -it xinference-cpu xinference launch \
        --model-name Qwen3-Reranker-4B \
        --model-type rerank \
        -- \
        --model-format gguf \
        --model-uri /models/Qwen3-Reranker-4B-q5_k_m.gguf
    
    
    in the container :  (this seems to work OK)
    xinference launch \
        --model-name Qwen3-Reranker-4B \
        --model-type rerank \
        -- \
        --model-format gguf \
        --model-uri /models/Qwen3-Reranker-4B-q5_k_m.gguf