Case Study: Code-Aware EDA (Easy Data Augmentation)

Syntax-aware data augmentation for source code, preserving semantic validity while generating diverse training samples.

Quick Start

use aprender::synthetic::code_eda::{CodeEda, CodeEdaConfig, CodeLanguage};
use aprender::synthetic::{SyntheticGenerator, SyntheticConfig};

// Configure for Rust code
let config = CodeEdaConfig::default()
    .with_language(CodeLanguage::Rust)
    .with_rename_prob(0.15)
    .with_comment_prob(0.1);

let generator = CodeEda::new(config);

// Augment code samples
let seeds = vec![
    "let x = 42;\nprintln!(\"{}\", x);".to_string(),
];

let synth_config = SyntheticConfig::default()
    .with_augmentation_ratio(2.0)
    .with_quality_threshold(0.3)
    .with_seed(42);

let augmented = generator.generate(&seeds, &synth_config)?;

Why Code-Specific Augmentation?

Traditional EDA (Wei & Zou, 2019) works on natural language but fails on code:

Text EDA	Code EDA
Random word swap	Preserves syntax
Synonym replacement	Variable renaming
Random deletion	Dead code removal
Random insertion	Comment insertion

Key difference: Code has structure. x = 1; y = 2; can become y = 2; x = 1; only if statements are independent.

Augmentation Operations

1. Variable Renaming (VR)

Replace identifiers with semantic synonyms:

// Original
let x = calculate();
let i = 0;
let buf = Vec::new();

// Augmented
let value = calculate();  // x → value
let index = 0;            // i → index
let buffer = Vec::new();  // buf → buffer

Built-in synonym mappings:

Original	Alternatives
`x`	`value`, `val`
`y`	`result`, `res`
`i`	`index`, `idx`
`j`	`inner`, `jdx`
`n`	`count`, `num`
`tmp`	`temp`, `scratch`
`buf`	`buffer`, `data`
`len`	`length`, `size`
`err`	`error`, `e`

Reserved keywords are never renamed:

Rust: let, mut, fn, impl, struct, enum, trait, etc.
Python: def, class, import, if, for, while, etc.

2. Comment Insertion (CI)

Add language-appropriate comments:

// Rust
let x = 42;
// TODO: review    ← inserted
let y = x + 1;

# Python
x = 42
# NOTE: temp       ← inserted
y = x + 1

3. Statement Reorder (SR)

Swap adjacent independent statements:

// Original
let a = 1;
let b = 2;
let c = 3;

// Augmented (swap a,b)
let b = 2;
let a = 1;
let c = 3;

Delimiter detection:

Rust: semicolons (;)
Python: newlines (\n)

4. Dead Code Removal (DCR)

Remove comments and collapse whitespace:

// Original
let x = 1;  // important value
let y = 2;  /* temp */

// Augmented
let x = 1;
let y = 2;

Configuration

`CodeEdaConfig`

CodeEdaConfig::default()
    .with_rename_prob(0.15)      // Variable rename probability
    .with_comment_prob(0.1)      // Comment insertion probability
    .with_reorder_prob(0.05)     // Statement reorder probability
    .with_remove_prob(0.1)       // Dead code removal probability
    .with_num_augments(4)        // Augmentations per input
    .with_min_tokens(5)          // Skip short code
    .with_language(CodeLanguage::Rust)

Supported Languages

pub enum CodeLanguage {
    Rust,    // Full syntax awareness
    Python,  // Full syntax awareness
    Generic, // Language-agnostic operations only
}

Quality Metrics

Token Overlap

Measures semantic preservation via Jaccard similarity:

let generator = CodeEda::new(CodeEdaConfig::default());

let original = "let x = 42;";
let augmented = "let value = 42;";

let overlap = generator.token_overlap(original, augmented);
// overlap ≈ 0.75 (shared: let, =, 42, ;)

Quality Score

Penalizes extremes (too similar or too different):

Overlap	Quality	Interpretation
> 0.95	0.5	Too similar, little augmentation
0.3-0.95	overlap	Good augmentation
< 0.3	0.3	Too different, likely corrupted

Diversity Score

Measures batch diversity (inverse of average pairwise overlap):

let batch = vec![
    "let x = 1;".to_string(),
    "fn foo() {}".to_string(),
];

let diversity = generator.diversity_score(&batch);
// diversity > 0.5 (different code patterns)

Integration with aprender-shell

The aprender-shell CLI supports CodeEDA for shell command augmentation:

# Train with code-aware augmentation
aprender-shell augment --use-code-eda

# View augmentation statistics
aprender-shell stats --augmented

Use Cases

1. Defect Prediction Training

Augment labeled commit diffs to improve classifier robustness:

let buggy_code = vec![
    "if (x = null) return;".to_string(),  // Assignment instead of comparison
];

let augmented = generator.generate(&buggy_code, &config)?;
// Train classifier on original + augmented samples

2. Code Clone Detection

Generate synthetic near-clones for contrastive learning:

let original = "fn add(a: i32, b: i32) -> i32 { a + b }";

// Generate variations with same semantics
let clones = generator.generate(&[original.to_string()], &config)?;

3. Code Completion Training

Augment training data for autocomplete models:

let completions = vec![
    "git commit -m 'fix bug'".to_string(),
    "cargo build --release".to_string(),
];

// 2x training data with variations
let augmented = generator.generate(&completions, &SyntheticConfig::default()
    .with_augmentation_ratio(2.0))?;

Deterministic Generation

CodeEDA uses a seeded PRNG for reproducibility:

let generator = CodeEda::new(CodeEdaConfig::default());

let aug1 = generator.augment("let x = 1;", 42);
let aug2 = generator.augment("let x = 1;", 42);

assert_eq!(aug1, aug2);  // Same seed = same output

Custom Synonyms

Extend the synonym dictionary:

use aprender::synthetic::code_eda::VariableSynonyms;

let mut synonyms = VariableSynonyms::new();
synonyms.add_synonym(
    "conn".to_string(),
    vec!["connection".to_string(), "db".to_string()],
);
synonyms.add_synonym(
    "ctx".to_string(),
    vec!["context".to_string(), "cx".to_string()],
);

Performance

CodeEDA is designed for batch augmentation efficiency:

Operation	Complexity	Notes
Tokenization	O(n)	Single pass, no regex
Variable rename	O(n)	HashMap lookup
Comment insertion	O(n)	Single pass
Statement reorder	O(n)	Split + swap
Quality score	O(n)	Token set operations

Typical throughput: 50,000+ augmentations/second on modern hardware.

References

Wei & Zou (2019). "EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks"
D'Ambros et al. (2012). "Evaluating Defect Prediction Approaches" (defect prediction context)
Synthetic Data Generation - General EDA for text

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning