* fix(collector): infer file extension from Content-Type for URLs without explicit extensions
When downloading files from URLs like https://arxiv.org/pdf/2307.10265,
the path has no recognizable file extension. The downloaded file gets
saved without an extension (or with a nonsensical one like .10265),
causing processSingleFile to reject it with 'File extension .10265
not supported for parsing'.
Fix: after downloading, check if the filename has a supported file
extension. If not, inspect the response Content-Type header and map
it to the correct extension using the existing ACCEPTED_MIMES table.
For example, a response with Content-Type: application/pdf will cause
the file to be saved with a .pdf extension, allowing it to be processed
correctly.
Fixes#4513
* small refactor
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* Adjust fix path to use ESM import
* normalize fix-path imports and usage across the app
* extract path fix logic to utils for server and collector
* add helpers
* repin strip-ansi in collector
* fix log for localWhisper
lint
* refactor localWhisper to use new custom FFMPEGWrapper class
* stub tests in github actions
* add back wavefile conversion to 16khz 32f to fix docker builds
* use afterEach for cleanup in ffmpeg tests
* remove unused FFMPEG_PATH env check
* use spawnSync for ffmpeg to capture and log output
* lint
* revert removal of try/catch around validateAudioFile for more helpful error msgs
* use readFileSync instead of createReadStream for less overhead
* change import to require for fix-path and stub import in tests
* refactor to singleton to preserve ffmpeg path
dev build
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* iterate over all pages in paperless-ngx data connector
* add error handling and data validation
* refactor to handle edge cases and null values
* catch edge case to prevent infinite loop
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* Added bypassSSL parameter to constructor and implemented SSL bypass logic in fetchConfluenceData method
* Updated generateChunkSource function to include bypassSSL in the encrypted payload
* Updated the request body to include bypassSSL in the JSON payload sent to the backend
* Updated form submission to include bypassSSL parameter from the checkbox
* Added bypass_ssl: "Bypass SSL Certificate Validation" translation
* passed these parameters to fetchconfluencepage function for proper resync functionality
* allow ignore of SSL cert for Confluence
* add translations
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* paperless ngx data connector
* wip resync paperless ngx
* fix generateChunkSource for resyncing paperless ngx
* lint
* Refactor Paperless-NGX connector
Fix issue with date rendering in tooltip + extended width
Move tooltip details to be column for more space
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* Enhance YouTube transcript loading to include video metadata in parsed content when parseOnly is true
* extract to function
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* fix: remove unnecessary toLowerCase in URL validation
* test: enhance URL validation tests to preserve case sensitivity and format
* test: update URL validation tests to ensure domain normalization to lowercase while preserving path case
* small formatting
* fix filenames when downloading live URI
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* Add capability to web scraping feature for document creation to download and parse statically hosted files
* lint
* Remove unneeded comment
* Simplified process by using key of ACCEPTED_MIMES to validate the response content type, as a result unlocked all supported files
* Add TODO comments for future implementation of asDoc.js to handle standard MS Word files in constants.js
* Return captureAs argument to be exposed by scrapeGenericUrl and passed into getPageContent | Return explicit argument of captureAs into scrapeGenericUrl in processLink fn
* Return debug log for scrapeGenericUrl
* Change conditional to a guard clause.
* Add error handling, validation, and JSDOC to getContentType helper fn
* remove unneeded comments
* Simplify URL validation by reusing module
* Rename downloadFileToHotDir to downloadURIToFile and moved up to a global module | Add URL valuidation to downloadURIToFile
* refactor
* add support for webp
remove unused imports
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
* Add HTTP request logging middleware for development mode
- Introduced httpLogger middleware to log HTTP requests and responses.
- Enabled logging only in development mode to assist with debugging.
* Update httpLogger middleware to disable time logging by default
* Add httpLogger middleware for development mode in collector service
* Refactor httpLogger middleware to rename timeLogs parameter to enableTimestamps for clarity
* Make HTTP Logger only mount in development and environment flag is enabled.
* Update .env.example to clarify HTTP Logger configuration comments
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
* fix: youtube transcript collector not work well with non en or non asr caption
* stub YT test in Github actions
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>