ViewTube

ViewTube
Sign inSign upSubscriptions
Filters

Upload date

Type

Duration

Sort by

Features

Reset

652 results

Vinh Nguyen
SWE-CI: Beyond Writing Code

https://arxiv.org/pdf/2603.03823 SWE-CI: Evaluating Long-Term Code Maintainability via Continuous Integration Agents The ...

7:55
SWE-CI: Beyond Writing Code

70 views

6 days ago

Vinh Nguyen
The AI That Hacked Its Own Test

https://www.anthropic.com/engineering/eval-awareness-browsecomp Eval Awareness in Claude Opus 4.6 BrowseComp ...

6:41
The AI That Hacked Its Own Test

57 views

7 days ago

AI Research Roundup
SWE-CI: New Benchmark for LLM Code Maintenance

In this AI Research Roundup episode, Alex discusses the paper: 'SWE-CI: Evaluating Agent Capabilities in Maintaining ...

4:33
SWE-CI: New Benchmark for LLM Code Maintenance

49 views

6 days ago

The AI Automators
Anthropic Just Changed How Agents Call Tools. I Stole It for My Qwen3.5 Agent

Get ALL of our systems & join hundreds of AI builders in our community ...

17:41
Anthropic Just Changed How Agents Call Tools. I Stole It for My Qwen3.5 Agent

77,770 views

7 days ago

Software Testing Automation
AI Generated Playwright Testing Code   And Where It Fell Short

I'll show you exactly where AI-generated code fell short, and what human code review caught that the tests never would. What we ...

17:51
AI Generated Playwright Testing Code And Where It Fell Short

541 views

2 days ago

thecodertherapist
Your LLM is Lying to You — Here's How to Test It (RAGAs Framework)

Your LLM passed the demo. It failed production. Here's how to fix that. Most teams ship RAG pipelines with zero evaluation — no ...

5:49
Your LLM is Lying to You — Here's How to Test It (RAGAs Framework)

134 views

7 days ago

Skill:RE
How to Evaluate AI: The 4-Step Framework for Reliable LLMs | Eval.QA | Learn AI Evaluation

Stop relying on a single metric to judge your AI. Most AI teams face a massive "evaluation blind spot." Your model might score ...

9:34
How to Evaluate AI: The 4-Step Framework for Reliable LLMs | Eval.QA | Learn AI Evaluation

4 views

5 days ago

Mark P
Chap 04 - 05 evaluation
9:29
Chap 04 - 05 evaluation

1 view

7 days ago

AI Research Roundup
Training Better LLM Coding Critics with Rubrics

In this AI Research Roundup episode, Alex discusses the paper: 'A Rubric-Supervised Critic from Sparse Real-World Outcomes' ...

5:12
Training Better LLM Coding Critics with Rubrics

8 views

7 days ago

Alex Hitt, The Great Discovery Pro
AutoResearch AI System: Karpathy’s Approach to Continuous Code Testing

The approach replaces sporadic maintenance with ongoing AI driven code evaluation that steadily refines performance and ...

5:23
AutoResearch AI System: Karpathy’s Approach to Continuous Code Testing

169 views

2 days ago

Alex Hitt, The Great Discovery Pro
AutoResearch Explained: Autonomous AI Engineering With Deterministic Evaluation Loops

Autoresearch AI Experiment Framework https://github.com/karpathy/autoresearch AutoResearch at Home Distributed Agent ...

5:00
AutoResearch Explained: Autonomous AI Engineering With Deterministic Evaluation Loops

31 views

2 days ago

QC Workflow
Can AI Answer CWI Exam Questions Correctly? (Surprising Results)

Could AI pass the CWI exam? I tested ChatGPT against ASME B31.3 code with interesting results. In this video, I test artificial ...

9:17
Can AI Answer CWI Exam Questions Correctly? (Surprising Results)

34 views

4 days ago

Professor Py: AI Engineering
Build Golden Dataset for LLM Evals: Label Studio in Python

Your metrics aren't real until your labels are — build a labeling workflow that compounds into a reliable golden set for LLM ...

8:11
Build Golden Dataset for LLM Evals: Label Studio in Python

0 views

7 days ago

Shankar Ganesan
Evaluation Orders for Syntax Directed Definitions

Evaluation Orders for Syntax Directed Definitions in Compiler Design, Evaluation Orders for Syntax Directed Definitions in ...

13:19
Evaluation Orders for Syntax Directed Definitions

0 views

6 days ago

AI Daily
Claude Artifacts Explained: Game-Changer or Overhyped?

Artifacts, projects, computer use, MCP, Claude Code In this episode of AI Daily, we review Claude, Anthropic's powerful AI.

12:40
Claude Artifacts Explained: Game-Changer or Overhyped?

88 views

7 days ago

AI Research Roundup
ZeroDayBench: Evaluating LLMs on Zero-Day Security

In this AI Research Roundup episode, Alex discusses the paper: 'ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day ...

4:54
ZeroDayBench: Evaluating LLMs on Zero-Day Security

7 views

7 days ago

Alex Followell | AI Automation
Claude Code Just Got Scary Good at Building Skills!

Claude Code Skills 2.0 adds auto-evaluations and self-improvement to the skill creator. Anthropic's tests showed 100% accuracy ...

13:22
Claude Code Just Got Scary Good at Building Skills!

124 views

1 day ago

Prompt Engineering
Opus just got caught ...

Anthropic just published a paper showing Claude Opus 4.6 figured out it was being tested on BrowseComp, found the encrypted ...

12:15
Opus just got caught ...

5,931 views

3 days ago

Engineer Mindset
Machine Learning 05 | Introduction to Python for Machine Learning | Libraries | Design & Analysis

In this video you'll learn the fundamentals of Python for Machine Learning, important Python libraries used in ML, and how ...

7:53
Machine Learning 05 | Introduction to Python for Machine Learning | Libraries | Design & Analysis

5 views

6 days ago

Revele
Demystifying EM Medcial Coding - Revele

This educational presentation breaks down the simplified approach to Evaluation and Management (EM) coding under new ...

7:51
Demystifying EM Medcial Coding - Revele

8 views

21 hours ago