feat: Add multilingual support for ocr module (#3325)

* Add multilingual support for ocr mudule * Add OCR langauge as server var that is passed into Collector Support all valid tesseract language codes Filter and parse only valid codes with fallbacks' * persist TARGET_OCR_LANG * update docker example env --------- Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2026-04-25 17:15:37 +02:00 · 2025-02-28 04:31:17 +08:00
parent c928d3d0c5
commit df166eb64e
8 changed files with 229 additions and 7 deletions
--- a/docker/.env.example
+++ b/docker/.env.example
@@ -321,3 +321,8 @@ GID='1000'
 # Enable simple SSO passthrough to pre-authenticate users from a third party service.
 # See https://docs.anythingllm.com/configuration#simple-sso-passthrough for more information.
 # SIMPLE_SSO_ENABLED=1
+
+# Specify the target languages for when using OCR to parse images and PDFs.
+# This is a comma separated list of language codes as a string. Unsupported languages will be ignored.
+# Default is English. See https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html for a list of valid language codes.
+# TARGET_OCR_LANG=eng,deu,ita,spa,fra,por,rus,nld,tur,hun,pol,ita,spa,fra,por,rus,nld,tur,hun,pol