knowledge-distillation-kernel-v1

Version: 1.0.0

Knowledge distillation kernel — temperature-scaled KL divergence + cross-entropy

References

Hinton et al. (2015) Distilling the Knowledge in a Neural Network
Ba & Caruana (2014) Do Deep Nets Really Need to be Deep?

Dependencies

Dependency Graph

graph LR
    knowledge_distillation_kernel_v1["knowledge-distillation-kernel-v1"] --> softmax_kernel_v1["softmax-kernel-v1"]
    knowledge_distillation_kernel_v1["knowledge-distillation-kernel-v1"] --> cross_entropy_kernel_v1["cross-entropy-kernel-v1"]

Equations

kd_loss

$$ L_KD = alpha * KL(softmax(z_t/T) || softmax(z_s/T)) * T^2 + (1-alpha) * CE(y, z_s) $$

Domain: $z_t, z_s in R^V, T > 0, alpha in [0,1]$

Codomain: $L_KD in [0, +inf)$

Invariants:

$L_KD >= 0 (non-negativity from KL and CE non-negativity)$
$alpha=0 => L_KD = CE(y, z_s) (pure hard label)$
$alpha=1 => L_KD = T^2 * KL(teacher || student) (pure soft label)$

kl_divergence

$$ KL(P || Q) = sum_i P(i) * \log(P(i) / Q(i)) $$

Domain: $P, Q valid probability distributions over V classes$

Codomain: $KL in [0, +inf)$

Invariants:

$KL(P || Q) >= 0 (Gibbs inequality)$
$KL(P || P) = 0 (identity)$

temperature_softmax

$$ softmax(z/T)_i = \exp(z_i/T) / sum_j \exp(z_j/T) $$

Domain: $z in R^V, T > 0$

Codomain: $softmax in (0, 1)^V, sum = 1$

Invariants:

$All outputs strictly positive$
$Outputs sum to 1$
$T -> inf => uniform distribution$
$T -> 0 => one-hot on argmax$

Proof Obligations

#	Type	Property	Formal
1	invariant	KL non-negativity	$KL(P \|\| Q) >= 0 for all valid P, Q$
2	bound	Temperature scaling produces valid distribution	$softmax(z/T)_i > 0 and sum_i softmax(z/T)_i = 1 for T > 0$
3	invariant	Alpha interpolation bound	$alpha=0 => L_KD = CE; alpha=1 => L_KD = T^2 * KL$
4	equivalence	Gradient correctness	$analytical gradient matches numerical gradient within 1e-4$
5	invariant	T^2 gradient compensation	$gradient magnitude approximately constant across T in [1, 10]$
6	equivalence	SIMD matches scalar within ULP

Kernel Phases

teacher_softmax: Compute softmax(z_t / T) — teacher soft targets — output is valid probability distribution
student_softmax: Compute softmax(z_s / T) — student soft predictions — output is valid probability distribution
kl_divergence: Compute KL(teacher || student) — result >= 0
cross_entropy: Compute CE(y, z_s) — hard label loss — result >= 0
combine: Combine: alpha * T^2 * KL + (1-alpha) * CE — result >= 0

Falsification Tests

ID	Rule	Prediction	If Fails
FALSIFY-KD-001	KL non-negativity	KL(teacher \|\| student) >= 0 for all batches	Log-domain computation error or softmax numerical instability
FALSIFY-KD-002	Temperature boundary	softmax(z/T) approaches uniform as T -> inf	Overflow in exp(z/T) for small T or large z
FALSIFY-KD-003	Alpha boundary conditions	alpha=0 => KD loss equals CE loss exactly	Alpha interpolation not applied correctly
FALSIFY-KD-004	Gradient correctness	Analytical gradient matches finite-difference within 1e-4	Derivative of KL or CE computed incorrectly
FALSIFY-KD-005	Distillation value	albor-distill avg benchmark > albor-base avg benchmark	Teacher logits corrupted, T too high/low, or alpha miscalibrated

Kani Harnesses

ID	Obligation	Bound	Strategy
KANI-KD-001	KD-INV-001	8	stub_float
KANI-KD-002	KD-INV-002	8	stub_float

QA Gate

Knowledge Distillation Contract (F-KD-001)

KD loss correctness for Albor distillation pipeline

Checks: kl_non_negativity, temperature_validity, alpha_interpolation, gradient_correctness

Pass criteria: All 5 falsification tests pass + 2 Kani harnesses verify

Keyboard shortcuts

Albor LLM Specification