GPT-5.5, Claude Opus 4.7, Kimi K2.6 and DeepSeek V4 benchmarks compared | Deep Research