Improve response quality with Gemini. Improve evaluation harness #1150

debanjum · 2025-04-04T14:25:37Z

Improve Gemini usage

Allow text tool to give agent ability to terminate research
Set default context for gemini 2 flash models
2x context window for small, commercial models to 120K
Default temperature of Gemini models to 1.0 to reduce repetition

Improve evaluation harness

Add more knobs to control eval workflow
- Allow running eval with any chat model served over an openai compatible api
- Control random sampling from eval set
- Auto read web page
Use embedded postgres instead of postgres server for eval workflow
Use Gemini 2.0 flash as evaluator. Set seed for evaluator to reduce decision variance

…iance. Gemini 2.0 flash model is cheaper and better than Gemini 1.5 pro

- Control auto read webpage via eval workflow. Prefix env var with KHOJ_ Default to false as it is the default that is going to be used in prod going forward. - Set openai api key via input param in manual eval workflow runs - Simplify evaluating other chat models available over openai compatible api via eval workflow. - Mask input api key as secret in workflow. - Discard unnecessary null setting of env vars. - Control randomization of samples in eval workflow. If randomization is turned off, it'll take the first SAMPLE_SIZE items from the eval dataset instead of a random collection of SAMPLE_SIZE items.

This is the default temperature for non-thinking gemini models on ai studio. See if using this alleviates the problem.

The queries field name in the first example isn't wrapped in double quotes, rest are.

Make research planner consistently select tool before query. As the model should tune it's query for the selected tool. It got space to think about tool to use in the scratchpad already.

We'd moved research planner to only use tools in enum of schema. This enum tool enforcement prevented model from terminating research by setting tool field to empty. Fix the issue by adding text tool to research tools enum and tell model to use that to terminate research and start response instead.

…odels Previously Gemini 2 flash and flash lite were using context window of 10K by default as no defaults were added for it. Increase default context for small commercial models to 120K from 60K as cheaper and faster than their pro models equivalents at 60K context.

debanjum force-pushed the reduce-repetition-by-gemini-and-add-more-knobs-to-evals-workflow branch from 6be6107 to 9156a92 Compare April 4, 2025 14:29

debanjum added 8 commits April 4, 2025 20:11

Use embedded postgres instead of postgres server for eval workflow

0dcb254

Use gemini 2.0 flash as evaluator. Set seed for it to reduce eval var…

911e1bf

…iance. Gemini 2.0 flash model is cheaper and better than Gemini 1.5 pro

Default temperature of Gemini models to 1.0 to try avoid repetition

ae8fb6f

This is the default temperature for non-thinking gemini models on ai studio. See if using this alleviates the problem.

Consistently wrap queries in online search prompt in double quotes

443c5a4

The queries field name in the first example isn't wrapped in double quotes, rest are.

Make ordering of fields expected by research planner consistent

38dd02a

Make research planner consistently select tool before query. As the model should tune it's query for the selected tool. It got space to think about tool to use in the scratchpad already.

debanjum force-pushed the reduce-repetition-by-gemini-and-add-more-knobs-to-evals-workflow branch from 9156a92 to 7f18bc0 Compare April 4, 2025 14:41

debanjum added the upgrade New feature or request label Apr 4, 2025

debanjum changed the title ~~Improve research quality with Gemini. Improve evaluation harness~~ Improve response quality with Gemini. Improve evaluation harness Apr 4, 2025

debanjum merged commit 751215a into master Apr 4, 2025
10 checks passed

debanjum deleted the reduce-repetition-by-gemini-and-add-more-knobs-to-evals-workflow branch April 4, 2025 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve response quality with Gemini. Improve evaluation harness #1150

Improve response quality with Gemini. Improve evaluation harness #1150

debanjum commented Apr 4, 2025

Improve response quality with Gemini. Improve evaluation harness #1150

Improve response quality with Gemini. Improve evaluation harness #1150

Conversation

debanjum commented Apr 4, 2025

Improve Gemini usage

Improve evaluation harness