Talking to a 4-Year-Old: A Multilingual Benchmark for Children's AI Companions
A 2,312-prompt, 23-language benchmark for child–AI conversations that evaluates four production models and validates the LLM-as-judge pipeline with five independent judges (Cohen's κ up to 0.71).