code evaluation

I tested 9 code review tools to see which is best!

This video evaluates the top AI code review bots to see how well they handle modern full-stack application code, specifically ...

29:48

I tested 9 code review tools to see which is best!

4,175 views

4 days ago

The AI Daily Brief: Artificial Intelligence News

Autoresearch, Agent Loops and the Future of Work

Andrej Karpathy's Autoresearch demonstrates autonomous agent loops: agents edit training code, run fixed five-minute ...

21:05

Autoresearch, Agent Loops and the Future of Work

35,113 views

4 days ago

Vinh Nguyen

https://arxiv.org/pdf/2603.03823 SWE-CI: Evaluating Long-Term Code Maintainability via Continuous Integration Agents The ...

7:55

SWE-CI: Beyond Writing Code

70 views

6 days ago

The AI Automators

Anthropic Just Changed How Agents Call Tools. I Stole It for My Qwen3.5 Agent

Get ALL of our systems & join hundreds of AI builders in our community ...

17:41

Anthropic Just Changed How Agents Call Tools. I Stole It for My Qwen3.5 Agent

77,801 views

7 days ago

AI Research Roundup

SWE-CI: New Benchmark for LLM Code Maintenance

In this AI Research Roundup episode, Alex discusses the paper: 'SWE-CI: Evaluating Agent Capabilities in Maintaining ...

4:33

SWE-CI: New Benchmark for LLM Code Maintenance

49 views

6 days ago

IBM Technology

AI code security: Codex agents & crypto mining

Visit Mixture of Experts podcast page to get more AI content → https://ibm.biz/BdpqsM Can your AI agent hack its own evaluation?

49:32

AI code security: Codex agents & crypto mining

2,500 views

1 day ago

DATA SCIENCE LOVERS

$25 Cost to Review your Code by "Code Review" | Launched by Anthropic-Claude #claude #anthropic

Anthropic has launched "Claude Code Review", a sophisticated tool designed to automate the evaluation of GitHub pull requests ...

2:45

$25 Cost to Review your Code by "Code Review" | Launched by Anthropic-Claude #claude #anthropic

28 views

1 day ago

Software Testing Automation

AI Generated Playwright Testing Code And Where It Fell Short

I'll show you exactly where AI-generated code fell short, and what human code review caught that the tests never would. What we ...

17:51

AI Generated Playwright Testing Code And Where It Fell Short

538 views

2 days ago

Python India

Evals First, Code Later: A Practical Guide to Evaluations, Rerankers & Caches - Saksham Aggarwal

Many retrieval-augmented generation (RAG) and code-search pipelines rely on ad-hoc checks and break when deployed at scale ...

29:49

Evals First, Code Later: A Practical Guide to Evaluations, Rerankers & Caches - Saksham Aggarwal

3 views

6 hours ago

Intercom

Kesha Mykhailov: Teaching an LLM to Review Code Like a Senior Engineer | The Future of Frontend

AI is not a junior engineer. It's the most senior engineer without context. At The Future of Frontend in Dublin, Kesha Mykhailov from ...

37:03

Kesha Mykhailov: Teaching an LLM to Review Code Like a Senior Engineer | The Future of Frontend

218 views

6 days ago

Alex Hitt, The Great Discovery Pro

AutoResearch AI System: Karpathy’s Approach to Continuous Code Testing

The approach replaces sporadic maintenance with ongoing AI driven code evaluation that steadily refines performance and ...

5:23

AutoResearch AI System: Karpathy’s Approach to Continuous Code Testing

177 views

2 days ago

Skill:RE

How to Evaluate AI: The 4-Step Framework for Reliable LLMs | Eval.QA | Learn AI Evaluation

Stop relying on a single metric to judge your AI. Most AI teams face a massive "evaluation blind spot." Your model might score ...

9:34

How to Evaluate AI: The 4-Step Framework for Reliable LLMs | Eval.QA | Learn AI Evaluation

4 views

5 days ago

thecodertherapist

Your LLM is Lying to You — Here's How to Test It (RAGAs Framework)

Your LLM passed the demo. It failed production. Here's how to fix that. Most teams ship RAG pipelines with zero evaluation — no ...

5:49

Your LLM is Lying to You — Here's How to Test It (RAGAs Framework)

134 views

7 days ago

Copilot Interview

I Interview 20 Engineers a Week — This AI Tool Scores Every Answer Instantly

... 0:21 — Real-time scoring: 5/10 with breakdown 0:30 — One-click evaluation report 0:38 — Screenshot code review 0:43 — The ...

1:28

I Interview 20 Engineers a Week — This AI Tool Scores Every Answer Instantly

7 views

7 days ago

Vinh Nguyen

[Podcast] The AI That Hacked Its Own Test

https://www.anthropic.com/engineering/eval-awareness-browsecomp Eval Awareness in Claude Opus 4.6 BrowseComp ...

36:49

[Podcast] The AI That Hacked Its Own Test

2 views

7 days ago

PropShop Discount Hub

Tradeify Discount Code INVEST – 33% OFF Futures Prop Firm

Looking for a working Tradeify discount code? You can use code INVEST to receive 33% OFF your Tradeify futures trading ...

0:23

Tradeify Discount Code INVEST – 33% OFF Futures Prop Firm

9 views

5 days ago

Microsoft Reactor

Model Mondays - AI Developer Experiences

Want to plan, build, evaluate, customize, and deploy your agentic AI solutions right from your IDE? The AI Toolkit accelerates ...

57:28

Model Mondays - AI Developer Experiences

662 views

Streamed 5 days ago

The TWIML AI Podcast with Sam Charrington

Agent Swarms and Knowledge Graphs for Autonomous Software Development [Siddhant Pardeshi] - 763

In this episode, Sid Pardeshi, co-founder and CTO of Blitzy, joins us to discuss building autonomous development systems able to ...

1:15:45

Agent Swarms and Knowledge Graphs for Autonomous Software Development [Siddhant Pardeshi] - 763

766 views

3 days ago

Alex Hitt, The Great Discovery Pro

AutoResearch Explained: Autonomous AI Engineering With Deterministic Evaluation Loops

Autoresearch AI Experiment Framework https://github.com/karpathy/autoresearch AutoResearch at Home Distributed Agent ...

5:00

AutoResearch Explained: Autonomous AI Engineering With Deterministic Evaluation Loops

32 views

2 days ago

Mark P

9:29

Chap 04 - 05 evaluation

1 view

7 days ago

ViewTube