Case Study: Code Feature Extraction for Defect Prediction

Extract 8-dimensional feature vectors from code commits for defect prediction, based on D'Ambros et al. (2012) benchmark methodology.

Quick Start

use aprender::synthetic::code_features::{
    CodeFeatureExtractor, CommitFeatures, CommitDiff
};

let extractor = CodeFeatureExtractor::new();

let diff = CommitDiff::new()
    .with_files_changed(3)
    .with_lines_added(150)
    .with_lines_deleted(50)
    .with_timestamp(1700000000)
    .with_message("fix: resolve memory leak");

let features = extractor.extract(&diff);

// 8-dimensional feature vector
let vector = features.to_vec();
assert_eq!(vector.len(), 8);

The 8-Dimensional Feature Vector

CommitFeatures contains standardized metrics for ML pipelines:

Index	Field	Type	Description
0	`defect_category`	u8	Predicted defect type (0-4)
1	`files_changed`	f32	Number of modified files
2	`lines_added`	f32	Lines of code added
3	`lines_deleted`	f32	Lines of code removed
4	`complexity_delta`	f32	Estimated complexity change
5	`timestamp`	f64	Unix timestamp
6	`hour_of_day`	u8	Hour (0-23 UTC)
7	`day_of_week`	u8	Day (0=Sunday, 6=Saturday)

Defect Classification

The extractor automatically classifies commits based on message keywords:

Category	Value	Keywords
Clean/Unknown	0	(no matches)
Bug Fix	1	fix, bug, error, crash, fault, defect, problem, wrong, broken, fail
Security	2	security, vulnerability, cve, exploit, injection, xss, csrf, auth
Performance	3	performance, perf, optimize, speed, fast, slow, memory, cache
Refactoring	4	refactor, clean, rename, move, reorganize, restructure, simplify

Priority Order

Security > Bug > Performance > Refactor > Clean

// Message contains both "security" and "bug"
let diff = CommitDiff::new()
    .with_message("fix security vulnerability bug");

let features = extractor.extract(&diff);
assert_eq!(features.defect_category, 2);  // Security takes priority

Complexity Estimation

Complexity delta is estimated from line changes:

complexity_delta = (lines_added - lines_deleted) / complexity_factor

Default complexity_factor = 10.0 (approximately 10 lines per complexity point).

let extractor = CodeFeatureExtractor::new()
    .with_complexity_factor(10.0);

let diff = CommitDiff::new()
    .with_lines_added(100)
    .with_lines_deleted(20);

let features = extractor.extract(&diff);
// (100 - 20) / 10 = 8.0
assert!((features.complexity_delta - 8.0).abs() < f32::EPSILON);

Time-Based Features

Extracts temporal patterns from Unix timestamps:

// 1700000000 = Tuesday, November 14, 2023 22:13:20 UTC
let diff = CommitDiff::new()
    .with_timestamp(1700000000);

let features = extractor.extract(&diff);
assert_eq!(features.hour_of_day, 22);   // 10 PM UTC
assert_eq!(features.day_of_week, 2);    // Tuesday

Why time matters for defect prediction:

Late-night commits (hour 22-4) correlate with higher defect rates
Friday commits show higher bug introduction rates
These patterns help ML models learn temporal risk factors

Batch Processing

Extract features from multiple commits efficiently:

let diffs = vec![
    CommitDiff::new()
        .with_files_changed(1)
        .with_message("feat: add login"),
    CommitDiff::new()
        .with_files_changed(5)
        .with_message("fix: null pointer crash"),
    CommitDiff::new()
        .with_files_changed(2)
        .with_message("refactor: clean utils"),
];

let features = extractor.extract_batch(&diffs);
assert_eq!(features.len(), 3);
assert_eq!(features[1].defect_category, 1);  // Bug fix

Feature Normalization

Normalize features for ML pipelines using dataset statistics:

use aprender::synthetic::code_features::FeatureStats;

// Collect statistics from training data
let all_features = extractor.extract_batch(&training_diffs);
let stats = FeatureStats::from_features(&all_features);

// Normalize new features to [0, 1]
let normalized = extractor.normalize(&features, &stats);

FeatureStats

pub struct FeatureStats {
    pub files_changed_max: f32,
    pub lines_added_max: f32,
    pub lines_deleted_max: f32,
    pub complexity_max: f32,
}

Derived Metrics

Churn

Total lines modified (useful for change-proneness analysis):

let features = CommitFeatures {
    lines_added: 100.0,
    lines_deleted: 50.0,
    ..Default::default()
};

let churn = features.churn();        // 150.0
let net = features.net_change();     // 50.0

Fix Detection

Check if commit is a bug fix:

if features.is_fix() {
    println!("This commit fixes a bug");
}

Custom Keywords

Extend keyword sets for domain-specific classification:

let mut extractor = CodeFeatureExtractor::new();

// Add custom bug keywords
extractor.add_bug_keywords(&["glitch", "oops", "typo"]);

// Add custom security keywords
extractor.add_security_keywords(&["hack", "breach", "leak"]);

Integration with aprender-shell

The aprender-shell CLI includes an analyze command:

# Analyze recent commits
aprender-shell analyze

# Output:
# Commit Analysis (last 10 commits):
#   abc123: [BUG] fix: resolve null pointer (churn: 45)
#   def456: [CLEAN] feat: add dashboard (churn: 230)
#   ghi789: [PERF] optimize: cache queries (churn: 12)

ML Pipeline Example

Train a defect predictor using extracted features:

use aprender::classification::LogisticRegression;

// Extract features from historical commits
let features: Vec<Vec<f32>> = commits
    .iter()
    .map(|c| extractor.extract(c).to_vec())
    .collect();

// Labels: 1 = introduced defect, 0 = clean
let labels: Vec<f32> = commits
    .iter()
    .map(|c| if c.had_defect { 1.0 } else { 0.0 })
    .collect();

// Train classifier
let mut model = LogisticRegression::default();
model.fit(&features, &labels)?;

// Predict defect probability for new commit
let new_features = extractor.extract(&new_commit).to_vec();
let defect_prob = model.predict_proba(&[new_features])?;

Use Cases

1. CI/CD Risk Scoring

Flag high-risk commits before merge:

fn risk_score(features: &CommitFeatures) -> f32 {
    let mut score = 0.0;

    // Large changes are riskier
    if features.files_changed > 10.0 { score += 0.2; }
    if features.churn() > 500.0 { score += 0.3; }

    // Late-night commits
    if features.hour_of_day >= 22 || features.hour_of_day <= 4 {
        score += 0.15;
    }

    // Friday commits
    if features.day_of_week == 5 { score += 0.1; }

    // Bug fixes might introduce new bugs
    if features.is_fix() { score += 0.1; }

    score.min(1.0)
}

2. Developer Analytics

Track individual developer patterns:

let dev_commits: Vec<CommitFeatures> = /* ... */;

let avg_churn = dev_commits.iter()
    .map(|f| f.churn())
    .sum::<f32>() / dev_commits.len() as f32;

let fix_rate = dev_commits.iter()
    .filter(|f| f.is_fix())
    .count() as f32 / dev_commits.len() as f32;

println!("Avg churn: {:.0} lines, Fix rate: {:.1}%",
    avg_churn, fix_rate * 100.0);

3. Technical Debt Tracking

Monitor complexity growth over time:

let weekly_delta: f32 = week_commits
    .iter()
    .map(|f| f.complexity_delta)
    .sum();

if weekly_delta > 50.0 {
    println!("Warning: Significant complexity increase this week");
}

Performance

Operation	Complexity	Throughput
Single extraction	O(m)	~1M commits/sec
Batch extraction	O(n*m)	~500K commits/sec
Normalization	O(1)	~10M/sec

Where m = message length, n = batch size.

References

D'Ambros et al. (2012). "Evaluating Defect Prediction Approaches: A Benchmark and an Extensive Comparison"
Mockus & Votta (2000). "Identifying Reasons for Software Changes Using Historic Databases"
Hassan (2009). "Predicting Faults Using the Complexity of Code Changes"

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning