Case Study: Code-Aware EDA (Easy Data Augmentation)
Syntax-aware data augmentation for source code, preserving semantic validity while generating diverse training samples.
Quick Start
use aprender::synthetic::code_eda::{CodeEda, CodeEdaConfig, CodeLanguage};
use aprender::synthetic::{SyntheticGenerator, SyntheticConfig};
// Configure for Rust code
let config = CodeEdaConfig::default()
.with_language(CodeLanguage::Rust)
.with_rename_prob(0.15)
.with_comment_prob(0.1);
let generator = CodeEda::new(config);
// Augment code samples
let seeds = vec![
"let x = 42;\nprintln!(\"{}\", x);".to_string(),
];
let synth_config = SyntheticConfig::default()
.with_augmentation_ratio(2.0)
.with_quality_threshold(0.3)
.with_seed(42);
let augmented = generator.generate(&seeds, &synth_config)?;
Why Code-Specific Augmentation?
Traditional EDA (Wei & Zou, 2019) works on natural language but fails on code:
| Text EDA | Code EDA |
|---|---|
| Random word swap | Preserves syntax |
| Synonym replacement | Variable renaming |
| Random deletion | Dead code removal |
| Random insertion | Comment insertion |
Key difference: Code has structure. x = 1; y = 2; can become y = 2; x = 1; only if statements are independent.
Augmentation Operations
1. Variable Renaming (VR)
Replace identifiers with semantic synonyms:
// Original
let x = calculate();
let i = 0;
let buf = Vec::new();
// Augmented
let value = calculate(); // x → value
let index = 0; // i → index
let buffer = Vec::new(); // buf → buffer
Built-in synonym mappings:
| Original | Alternatives |
|---|---|
x | value, val |
y | result, res |
i | index, idx |
j | inner, jdx |
n | count, num |
tmp | temp, scratch |
buf | buffer, data |
len | length, size |
err | error, e |
Reserved keywords are never renamed:
- Rust:
let,mut,fn,impl,struct,enum,trait, etc. - Python:
def,class,import,if,for,while, etc.
2. Comment Insertion (CI)
Add language-appropriate comments:
// Rust
let x = 42;
// TODO: review ← inserted
let y = x + 1;
# Python
x = 42
# NOTE: temp ← inserted
y = x + 1
3. Statement Reorder (SR)
Swap adjacent independent statements:
// Original
let a = 1;
let b = 2;
let c = 3;
// Augmented (swap a,b)
let b = 2;
let a = 1;
let c = 3;
Delimiter detection:
- Rust: semicolons (
;) - Python: newlines (
\n)
4. Dead Code Removal (DCR)
Remove comments and collapse whitespace:
// Original
let x = 1; // important value
let y = 2; /* temp */
// Augmented
let x = 1;
let y = 2;
Configuration
CodeEdaConfig
CodeEdaConfig::default()
.with_rename_prob(0.15) // Variable rename probability
.with_comment_prob(0.1) // Comment insertion probability
.with_reorder_prob(0.05) // Statement reorder probability
.with_remove_prob(0.1) // Dead code removal probability
.with_num_augments(4) // Augmentations per input
.with_min_tokens(5) // Skip short code
.with_language(CodeLanguage::Rust)
Supported Languages
pub enum CodeLanguage {
Rust, // Full syntax awareness
Python, // Full syntax awareness
Generic, // Language-agnostic operations only
}
Quality Metrics
Token Overlap
Measures semantic preservation via Jaccard similarity:
let generator = CodeEda::new(CodeEdaConfig::default());
let original = "let x = 42;";
let augmented = "let value = 42;";
let overlap = generator.token_overlap(original, augmented);
// overlap ≈ 0.75 (shared: let, =, 42, ;)
Quality Score
Penalizes extremes (too similar or too different):
| Overlap | Quality | Interpretation |
|---|---|---|
| > 0.95 | 0.5 | Too similar, little augmentation |
| 0.3-0.95 | overlap | Good augmentation |
| < 0.3 | 0.3 | Too different, likely corrupted |
Diversity Score
Measures batch diversity (inverse of average pairwise overlap):
let batch = vec![
"let x = 1;".to_string(),
"fn foo() {}".to_string(),
];
let diversity = generator.diversity_score(&batch);
// diversity > 0.5 (different code patterns)
Integration with aprender-shell
The aprender-shell CLI supports CodeEDA for shell command augmentation:
# Train with code-aware augmentation
aprender-shell augment --use-code-eda
# View augmentation statistics
aprender-shell stats --augmented
Use Cases
1. Defect Prediction Training
Augment labeled commit diffs to improve classifier robustness:
let buggy_code = vec![
"if (x = null) return;".to_string(), // Assignment instead of comparison
];
let augmented = generator.generate(&buggy_code, &config)?;
// Train classifier on original + augmented samples
2. Code Clone Detection
Generate synthetic near-clones for contrastive learning:
let original = "fn add(a: i32, b: i32) -> i32 { a + b }";
// Generate variations with same semantics
let clones = generator.generate(&[original.to_string()], &config)?;
3. Code Completion Training
Augment training data for autocomplete models:
let completions = vec![
"git commit -m 'fix bug'".to_string(),
"cargo build --release".to_string(),
];
// 2x training data with variations
let augmented = generator.generate(&completions, &SyntheticConfig::default()
.with_augmentation_ratio(2.0))?;
Deterministic Generation
CodeEDA uses a seeded PRNG for reproducibility:
let generator = CodeEda::new(CodeEdaConfig::default());
let aug1 = generator.augment("let x = 1;", 42);
let aug2 = generator.augment("let x = 1;", 42);
assert_eq!(aug1, aug2); // Same seed = same output
Custom Synonyms
Extend the synonym dictionary:
use aprender::synthetic::code_eda::VariableSynonyms;
let mut synonyms = VariableSynonyms::new();
synonyms.add_synonym(
"conn".to_string(),
vec!["connection".to_string(), "db".to_string()],
);
synonyms.add_synonym(
"ctx".to_string(),
vec!["context".to_string(), "cx".to_string()],
);
Performance
CodeEDA is designed for batch augmentation efficiency:
| Operation | Complexity | Notes |
|---|---|---|
| Tokenization | O(n) | Single pass, no regex |
| Variable rename | O(n) | HashMap lookup |
| Comment insertion | O(n) | Single pass |
| Statement reorder | O(n) | Split + swap |
| Quality score | O(n) | Token set operations |
Typical throughput: 50,000+ augmentations/second on modern hardware.
References
- Wei & Zou (2019). "EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks"
- D'Ambros et al. (2012). "Evaluating Defect Prediction Approaches" (defect prediction context)
- Synthetic Data Generation - General EDA for text
See Also
- CodeFeatureExtractor - 8-dimensional commit feature extraction
- Shell Completion - AI-powered shell autocomplete
- Shell Completion Benchmarks - Sub-10ms latency verification