code evaluation

https://arxiv.org/pdf/2603.03823 SWE-CI: Evaluating Long-Term Code Maintainability via Continuous Integration Agents The ...

7:55

SWE-CI: Beyond Writing Code

70 views

6 days ago

Vinh Nguyen

https://www.anthropic.com/engineering/eval-awareness-browsecomp Eval Awareness in Claude Opus 4.6 BrowseComp ...

6:41

The AI That Hacked Its Own Test

57 views

7 days ago

AI Research Roundup

SWE-CI: New Benchmark for LLM Code Maintenance

In this AI Research Roundup episode, Alex discusses the paper: 'SWE-CI: Evaluating Agent Capabilities in Maintaining ...

4:33

SWE-CI: New Benchmark for LLM Code Maintenance

49 views

6 days ago

The AI Automators

Anthropic Just Changed How Agents Call Tools. I Stole It for My Qwen3.5 Agent

Get ALL of our systems & join hundreds of AI builders in our community ...

17:41

Anthropic Just Changed How Agents Call Tools. I Stole It for My Qwen3.5 Agent

77,770 views

7 days ago

Software Testing Automation

AI Generated Playwright Testing Code And Where It Fell Short

I'll show you exactly where AI-generated code fell short, and what human code review caught that the tests never would. What we ...

17:51

AI Generated Playwright Testing Code And Where It Fell Short

541 views

2 days ago

thecodertherapist

Your LLM is Lying to You — Here's How to Test It (RAGAs Framework)

Your LLM passed the demo. It failed production. Here's how to fix that. Most teams ship RAG pipelines with zero evaluation — no ...

5:49

Your LLM is Lying to You — Here's How to Test It (RAGAs Framework)

134 views

7 days ago

Skill:RE

How to Evaluate AI: The 4-Step Framework for Reliable LLMs | Eval.QA | Learn AI Evaluation

Stop relying on a single metric to judge your AI. Most AI teams face a massive "evaluation blind spot." Your model might score ...

9:34

How to Evaluate AI: The 4-Step Framework for Reliable LLMs | Eval.QA | Learn AI Evaluation

4 views

5 days ago

Mark P

9:29

Chap 04 - 05 evaluation

1 view

7 days ago

AI Research Roundup

Training Better LLM Coding Critics with Rubrics

In this AI Research Roundup episode, Alex discusses the paper: 'A Rubric-Supervised Critic from Sparse Real-World Outcomes' ...

5:12

Training Better LLM Coding Critics with Rubrics

8 views

7 days ago

Alex Hitt, The Great Discovery Pro

AutoResearch AI System: Karpathy’s Approach to Continuous Code Testing

The approach replaces sporadic maintenance with ongoing AI driven code evaluation that steadily refines performance and ...

5:23

AutoResearch AI System: Karpathy’s Approach to Continuous Code Testing

169 views

2 days ago

Alex Hitt, The Great Discovery Pro

AutoResearch Explained: Autonomous AI Engineering With Deterministic Evaluation Loops

Autoresearch AI Experiment Framework https://github.com/karpathy/autoresearch AutoResearch at Home Distributed Agent ...

5:00

AutoResearch Explained: Autonomous AI Engineering With Deterministic Evaluation Loops

31 views

2 days ago

QC Workflow

Can AI Answer CWI Exam Questions Correctly? (Surprising Results)

Could AI pass the CWI exam? I tested ChatGPT against ASME B31.3 code with interesting results. In this video, I test artificial ...

9:17

Can AI Answer CWI Exam Questions Correctly? (Surprising Results)

34 views

4 days ago

Professor Py: AI Engineering

Build Golden Dataset for LLM Evals: Label Studio in Python

Your metrics aren't real until your labels are — build a labeling workflow that compounds into a reliable golden set for LLM ...

8:11

Build Golden Dataset for LLM Evals: Label Studio in Python

0 views

7 days ago

Shankar Ganesan

Evaluation Orders for Syntax Directed Definitions in Compiler Design, Evaluation Orders for Syntax Directed Definitions in ...

13:19

Evaluation Orders for Syntax Directed Definitions

0 views

6 days ago

AI Daily

Claude Artifacts Explained: Game-Changer or Overhyped?

Artifacts, projects, computer use, MCP, Claude Code In this episode of AI Daily, we review Claude, Anthropic's powerful AI.

12:40

Claude Artifacts Explained: Game-Changer or Overhyped?

88 views

7 days ago

AI Research Roundup

ZeroDayBench: Evaluating LLMs on Zero-Day Security

In this AI Research Roundup episode, Alex discusses the paper: 'ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day ...

4:54

ZeroDayBench: Evaluating LLMs on Zero-Day Security

7 views

7 days ago

Alex Followell | AI Automation

Claude Code Just Got Scary Good at Building Skills!

Claude Code Skills 2.0 adds auto-evaluations and self-improvement to the skill creator. Anthropic's tests showed 100% accuracy ...

13:22

Claude Code Just Got Scary Good at Building Skills!

124 views

1 day ago

Prompt Engineering

Anthropic just published a paper showing Claude Opus 4.6 figured out it was being tested on BrowseComp, found the encrypted ...

12:15

Opus just got caught ...

5,931 views

3 days ago

Engineer Mindset

Machine Learning 05 | Introduction to Python for Machine Learning | Libraries | Design & Analysis

In this video you'll learn the fundamentals of Python for Machine Learning, important Python libraries used in ML, and how ...

7:53

Machine Learning 05 | Introduction to Python for Machine Learning | Libraries | Design & Analysis

5 views

6 days ago

Revele

This educational presentation breaks down the simplified approach to Evaluation and Management (EM) coding under new ...

7:51

Demystifying EM Medcial Coding - Revele

8 views

21 hours ago

ViewTube