Native Embedder model selection (incl: Multilingual support) (#3835)

* WIP on embedder selection
TODO: apply splitting and query prefixes (if applicable)

* wip on upsert

* Support base model
support nomic-text-embed-v1
support multilingual-e5-small
Add prefixing for both embedding and query for RAG tasks
Add chunking prefix to all vector dbs to apply prefix when possible
Show dropdown and auto-pull on new selection

* norm translations

* move supported models to constants
handle null seelction or invalid selection on dropdown
update comments

* dev

* patch text splitter maximums for now

* normalize translations

* add tests for splitter functionality

* normalize

---------

Co-authored-by: shatfield4 <seanhatfield5@gmail.com>
This commit is contained in:
Timothy Carambat
2025-07-22 10:07:20 -07:00
committed by GitHub
parent 31a8ead823
commit 2c19dd09ed
44 changed files with 463 additions and 80 deletions

View File

@@ -138,6 +138,10 @@ SIG_SALT='salt' # Please generate random string at least 32 chars long.
###########################################
######## Embedding API SElECTION ##########
###########################################
# This will be the assumed default embedding seleciton and model
# EMBEDDING_ENGINE='native'
# EMBEDDING_MODEL_PREF='Xenova/all-MiniLM-L6-v2'
# Only used if you are using an LLM that does not natively support embedding (openai or Azure)
# EMBEDDING_ENGINE='openai'
# OPEN_AI_KEY=sk-xxxx