* fix(collector): infer file extension from Content-Type for URLs without explicit extensions
When downloading files from URLs like https://arxiv.org/pdf/2307.10265,
the path has no recognizable file extension. The downloaded file gets
saved without an extension (or with a nonsensical one like .10265),
causing processSingleFile to reject it with 'File extension .10265
not supported for parsing'.
Fix: after downloading, check if the filename has a supported file
extension. If not, inspect the response Content-Type header and map
it to the correct extension using the existing ACCEPTED_MIMES table.
For example, a response with Content-Type: application/pdf will cause
the file to be saved with a .pdf extension, allowing it to be processed
correctly.
Fixes#4513
* small refactor
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* Adjust fix path to use ESM import
* normalize fix-path imports and usage across the app
* extract path fix logic to utils for server and collector
* add helpers
* repin strip-ansi in collector
* fix log for localWhisper
lint
* refactor localWhisper to use new custom FFMPEGWrapper class
* stub tests in github actions
* add back wavefile conversion to 16khz 32f to fix docker builds
* use afterEach for cleanup in ffmpeg tests
* remove unused FFMPEG_PATH env check
* use spawnSync for ffmpeg to capture and log output
* lint
* revert removal of try/catch around validateAudioFile for more helpful error msgs
* use readFileSync instead of createReadStream for less overhead
* change import to require for fix-path and stub import in tests
* refactor to singleton to preserve ffmpeg path
dev build
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* iterate over all pages in paperless-ngx data connector
* add error handling and data validation
* refactor to handle edge cases and null values
* catch edge case to prevent infinite loop
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* Added bypassSSL parameter to constructor and implemented SSL bypass logic in fetchConfluenceData method
* Updated generateChunkSource function to include bypassSSL in the encrypted payload
* Updated the request body to include bypassSSL in the JSON payload sent to the backend
* Updated form submission to include bypassSSL parameter from the checkbox
* Added bypass_ssl: "Bypass SSL Certificate Validation" translation
* passed these parameters to fetchconfluencepage function for proper resync functionality
* allow ignore of SSL cert for Confluence
* add translations
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* paperless ngx data connector
* wip resync paperless ngx
* fix generateChunkSource for resyncing paperless ngx
* lint
* Refactor Paperless-NGX connector
Fix issue with date rendering in tooltip + extended width
Move tooltip details to be column for more space
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* Enhance YouTube transcript loading to include video metadata in parsed content when parseOnly is true
* extract to function
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* fix: remove unnecessary toLowerCase in URL validation
* test: enhance URL validation tests to preserve case sensitivity and format
* test: update URL validation tests to ensure domain normalization to lowercase while preserving path case
* small formatting
* fix filenames when downloading live URI
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* Add capability to web scraping feature for document creation to download and parse statically hosted files
* lint
* Remove unneeded comment
* Simplified process by using key of ACCEPTED_MIMES to validate the response content type, as a result unlocked all supported files
* Add TODO comments for future implementation of asDoc.js to handle standard MS Word files in constants.js
* Return captureAs argument to be exposed by scrapeGenericUrl and passed into getPageContent | Return explicit argument of captureAs into scrapeGenericUrl in processLink fn
* Return debug log for scrapeGenericUrl
* Change conditional to a guard clause.
* Add error handling, validation, and JSDOC to getContentType helper fn
* remove unneeded comments
* Simplify URL validation by reusing module
* Rename downloadFileToHotDir to downloadURIToFile and moved up to a global module | Add URL valuidation to downloadURIToFile
* refactor
* add support for webp
remove unused imports
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* fix: youtube transcript collector not work well with non en or non asr caption
* stub YT test in Github actions
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* Create parse endpoint in collector (#4212)
* create parse endpoint in collector
* revert cleanup temp util call
* lint
* remove unused cleanupTempDocuments function
* revert slug change
minor change for destinations
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* Add parsed files table and parse server endpoints (#4222)
* add workspace_parsed_files table + parse endpoints/models
* remove dev api parse endpoint
* remove unneeded imports
* iterate over all files + remove unneeded update function + update telemetry debounce
* Upload UI/UX context window check + frontend alert (#4230)
* prompt user to embed if exceeds prompt window + handle embed + handle cancel
* add tokenCountEstimate to workspace_parsed_files + optimizations
* use util for path locations + use safeJsonParse
* add modal for user decision on overflow of context window
* lint
* dynamic fetching of provider/model combo + inject parsed documents
* remove unneeded comments
* popup ui for attaching/removing files + warning to embed + wip fetching states on update
* remove prop drilling, fetch files/limits directly in attach files popup
* rework ux of FE + BE optimizations
* fix ux of FE + BE optimizations
* Implement bidirectional sync for parsed file states
linting
small changes and comments
* move parse support to another endpoint file
simplify calls and loading of records
* button borders
* enable default users to upload parsed files but NOT embed
* delete cascade on user/workspace/thread deletion to remove parsedFileRecord
* enable bgworker with "always" jobs and optional document sync jobs
orphan document job: Will find any broken reference files to prevent overpollution of the storage folder. This will run 10s after boot and every 12hr after
* change run timeout for orphan job to 1m to allow settling before spawning a worker
* linting and cleanup pr
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* dev build
* fix tooltip hiding during embedding overflow files
* prevent crash log from ERRNO on parse files
* unused import
* update docs link
* Migrate parsed-files to GET endpoint
patch logic for grabbing models names from utils
better handling for undetermined context windows (null instead of Pos_INIFI)
UI placeholder for null context windows
* patch URL
---------
Co-authored-by: Sean Hatfield <seanhatfield5@gmail.com>
* feat: add support for custom table formatting in htmlToText conversion
* fix tables
* feat: improve plain text table formatting for AI readability
* fix options
* improve drupal wiki connector
* final fix
* adjust leading slash to match code
* linting
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* Enable bypass of ip limitations via ENV in collector startup
resolves#3625
connect #3626
* dev build
* bump dockerx build action
* enable runtime setting config of collector requests
* comments and linting for option passing
* unset
* unset
* update docs link
* linting and docs
* Add multilingual support for ocr mudule
* Add OCR langauge as server var that is passed into Collector
Support all valid tesseract language codes
Filter and parse only valid codes with fallbacks'
* persist TARGET_OCR_LANG
* update docker example env
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>