Thoughtworks Research
Thoughtworks Research aims to advance the field of AI reliability by developing new theories, algorithms, methodologies and tools. Our open-source libraries, accelerators and research publications will empower you to build more reliable AI systems.
Q4 2025
This article introduces a Lie algebra framework for evaluating LLM-generated summaries, offering a mathematically grounded way to assess how well summaries preserve semantic structure and meaning
Q2 2025
Uncertainty is inherent in AI systems. This article explores how models can quantify and manage uncertainty — including both statistical (aleatoric) and epistemic sources — using tools like Bayesian neural networks and dropout to improve reliability and trust.
Drawing on recent keynotes from AI’s top minds, this article outlines the emerging frontiers in the field — from exploring alternatives to transformer models and evolving hardware, to agentic and physical AI, open-source momentum, and AI ‘factories.’
Q1 2025
Exploring how Semantic Entropy, a meaning-based uncertainty metric, offers a more reliable way to evaluate LLMs — especially for detecting confabulations — than traditional lexical or token-based measures
Q4 2024
Understanding benchmarks, evals, and tests in the context of LLMs — arguing that benchmarks compare models, evals probe real-world behavior, and tests validate system reliability
Q4 2023
Examines structural and conceptual uncertainties in LLMs, offering methods to better predict model behavior and improve the reliability of generated responses
Q3 2023
This article presents a surprisingly effective embedding-based method to estimate the importance of individual tokens in LLM prompts — a lightweight proxy for attribution that compares favorably to more complex techniques
Q3 2021
A gentle introduction to machine teaching, a paradigm that shifts focus from making models smarter to empowering experts to teach them more efficiently — easing the bottleneck of domain expertise in AI workflows
Explains how probabilistic machine learning and weak supervision enable subject matter experts to collaboratively label data using heuristics, enhancing model performance through iterative refinement