Enable javascript in your browser for better experience. Need to know to enable it? Go here.

Turning up the heat: Min-p samling for creative and coherent creative outputs

Large Language Models (LLMs) generate text by sampling the next token from a probability distribution over the vocabulary at each decoding step. Popular sampling methods like top-p (nucleus sampling) often struggle to balance quality and diversity, especially at higher temperatures which lead to incoherent or repetitive outputs. 

 

We propose min-p sampling, a dynamic truncation method that adjusts the sampling threshold based on the model's confidence by using the top token's probability as a scaling factor. Our experiments on benchmarks including GPQA, GSM8K, and Alpaca Eval Creative Writing show that min-p sampling improves both the quality and diversity of generated text across different model families (Mistral and Llama 3) and model sizes (1B to 123B parameters), especially at higher temperatures. 

 

Human evaluations further show a clear preference for min-p sampling, in both text quality and creativity. Min-p sampling has been adopted by popular open-source LLM frameworks, including Hugging Face Transformers, VLLM, and many others, highlighting its considerable impact on improving text generation quality.

 

Research submission here.

P-less Sampling: A robust hyperparameter-free approach for LLM decoding

LLMs traditionally generate tokens by using one or more hyperparameters to filter a subset of tokens for sampling, such as in methods like top-p and top-k. However, there is a lack of robust support on reliable hyperparameter values other than empirical hyperparameter tuning or practitioner concurrence. 

 

To address the issue, we formulate the P-less method principled in probability and statistics instead of using arbitrary hyperparameter values. We propose P-less as a hyperparameter-free and information-theoretic method for decoding LLM outputs reliably.

 

Experiments show our method to produce competitive or best LLM generation evaluations against other methods on several math and logical reasoning datasets such as GSM8k and creative writing datasets such as Alpaca. These results provide empirical evidence that P-less is a reliable approach to LLM output generation.

 

Research submission here.

Beyond I am sorry, I can’t: dissecting large language model refusal

Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. 

 

Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. 

 

This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.

 

Research submission here.

Distribution-aware feature selection for SAEs

TopK Sparse Autoencoders (SAEs) break down neural activations into understandable features, but they're inefficient because they reconstruct each token using a fixed number of the most active features. A more advanced version, BatchTopK, improves this by selecting the most active features across an entire batch of tokens. However, this can lead to an "activation lottery," where a few very high-magnitude features dominate the selection process, potentially at the expense of other more informative features that have a lower magnitude.

 

We have developed a new approach called Sampled-SAE. This technique works by first scoring all the potential features in a batch of data. It then creates a smaller, curated "candidate pool" of the best features, which it selects from. The size of this pool is controlled by a new parameter, l. By adjusting l, researchers can find a balance between using globally important features and using more specific, rare ones. For example, a small l forces the model to use only the most globally consistent features, while a large l allows for a wider variety of fine-grained features. This makes BatchTopK a more flexible, tunable approach that can be adjusted based on the specific trade-offs needed for a given task, such as prioritizing a model's performance over its interpretability.

Research submission here.

Towards transparent AI grading:Entropy as a signal for human-AI disagreement

Automated grading system can quickly score short-answer questions but they don’t always score when the decision is uncertain or controversial. This work proposes a novel method called “Semantic entropy” that measures how different GPT-4 explanations for the same students are especially when the human-graders have a disagreement. This work similar explanations and calculate how diverse these groups are without just looking at the final scores. Three research questions are addressed. They are:

 

  1. Does semantic entropy align with human grader disagreement? 

  2. Does it generalize across academic subjects? 

  3. Is it sensitive to structural task features such as source dependency?

 

Experiments on the ASAP-SAS dataset show that semantic entropy correlates with rater disagreement, varies meaningfully across subjects, and increases in tasks requiring interpretive reasoning. These results underscore semantic entropy’s potential as a domain- and task-sensitive signal for triaging ambiguous or contentious grading cases in educational settings.

 

 (Research submission here.)