Files
docs/src/backend
lebaudantoine 8b60bc57e6 wip use our internal llm + switch from json to markdown
simplify LLM's job. Do not request Json output with a single key. Instead,
make sure LLM don't output any extra information.

By simplifying LLM's job, we're making sure its output can be parsed.

I did a quick test with the Translate prompt. Adding instructions to output
only translated text seems enough after a bunch of tests.

I did a small prompt engineering, using ChatGPT and Claude to generate
a proper system prompt … it works quite okay BUT there is room for
improvement for sure.

I'ven't searched yet OS prompts we could find in a prompt library.
Perfect translation job seems to be a difficult job for a 8B model.

Please note I haven't updated yet the other prompts, let's discuss it before.
I ran my experiment with our internal LLM which is optimized for throughput,
and not latency (there is a trade-off). I'll try fine tune few of its parameters to
see if I can reduce its latency.

For 880 tokens (based on chatgpt tokens counter online). It takes roughly
17s, vs ~40s for Albert CNRS 70B.

For 180 tokens it takes roughly 3s. Without a proper UX (eg. a nicer loading
animation, streaming tokens) it feels a decade. However, asking Chatgpt the
same job take the same amount, from submitting the request to the last
token being generated.
2024-12-16 00:59:21 +01:00
..
2024-12-13 17:58:43 +01:00
2024-12-13 17:58:43 +01:00