Case Study: Synthetic Data Generation for ML

Synthetic data generation augments training datasets when labeled data is scarce. This example demonstrates aprender's synthetic data module for text augmentation, template-based generation, and weak supervision.

Running the Example

cargo run --example synthetic_data_generation

Techniques Demonstrated

1. EDA (Easy Data Augmentation)

EDA applies simple text transformations to generate variations:

use aprender::synthetic::eda::{EdaConfig, EdaGenerator};
use aprender::synthetic::{SyntheticConfig, SyntheticGenerator};

let generator = EdaGenerator::new(EdaConfig::default());

let seeds = vec![
    "git commit -m 'fix bug'".to_string(),
    "cargo build --release".to_string(),
];

let config = SyntheticConfig::default()
    .with_augmentation_ratio(2.0)  // 2x original data
    .with_quality_threshold(0.3)
    .with_seed(42);

let augmented = generator.generate(&seeds, &config)?;

Output:

Original commands (3):
  git commit -m 'fix bug'
  cargo build --release
  docker run nginx

Augmented commands (6):
  git commit -m 'fix bug' (quality: 1.00)
  git -m commit 'fix bug' (quality: 0.67)
  cargo build --release (quality: 1.00)
  cargo --release build (quality: 0.67)

2. Template-Based Generation

Generate structured commands from templates with variable slots:

use aprender::synthetic::template::{Template, TemplateGenerator};

let git_template = Template::new("git {action} {args}")
    .with_slot("action", &["commit", "push", "pull", "checkout"])
    .with_slot("args", &["-m 'update'", "--all", "main"]);

let cargo_template = Template::new("cargo {cmd} {flags}")
    .with_slot("cmd", &["build", "test", "run", "check"])
    .with_slot("flags", &["--release", "--all-features", ""]);

let generator = TemplateGenerator::new()
    .with_template(git_template)
    .with_template(cargo_template);

// Total combinations = 4*3 + 4*3 = 24
println!("Possible combinations: {}", generator.total_combinations());

3. Weak Supervision

Label unlabeled data using heuristic labeling functions:

use aprender::synthetic::weak_supervision::{
    WeakSupervisionGenerator, WeakSupervisionConfig,
    AggregationStrategy, KeywordLF, LabelVote,
};

let mut generator = WeakSupervisionGenerator::<String>::new()
    .with_config(
        WeakSupervisionConfig::new()
            .with_aggregation(AggregationStrategy::MajorityVote)
            .with_min_votes(1)
            .with_min_confidence(0.5),
    );

// Add domain-specific labeling functions
generator.add_lf(Box::new(KeywordLF::new(
    "version_control",
    &["git", "svn", "commit", "push"],
    LabelVote::Positive,
)));

generator.add_lf(Box::new(KeywordLF::new(
    "dangerous",
    &["rm -rf", "sudo rm", "format"],
    LabelVote::Negative,
)));

let samples = vec![
    "git push origin main".to_string(),
    "rm -rf /tmp/cache".to_string(),
];

let labeled = generator.generate(&samples, &config)?;

Output:

Labeled samples:
  [SAFE] (conf: 0.75) git push origin main
  [UNSAFE] (conf: 0.80) rm -rf /tmp/cache
  [SAFE] (conf: 0.65) cargo test --all
  [UNKNOWN] (conf: 0.20) echo hello world

4. Caching for Efficiency

Cache generated data to avoid redundant computation:

use aprender::synthetic::cache::SyntheticCache;

let mut cache = SyntheticCache::<String>::new(100_000); // 100KB cache
let generator = EdaGenerator::new(EdaConfig::default());

// First call - cache miss, runs generation
let result1 = cache.get_or_generate(&seeds, &config, &generator)?;

// Second call - cache hit, returns cached result
let result2 = cache.get_or_generate(&seeds, &config, &generator)?;

println!("Hit rate: {:.1}%", cache.stats().hit_rate() * 100.0);

Quality Metrics

Diversity Score

Measures how diverse the generated samples are:

let diversity = generator.diversity_score(&augmented);
// Returns value between 0.0 (identical) and 1.0 (completely diverse)

Quality Score

Measures how well generated samples preserve semantic meaning:

let quality = generator.quality_score(&generated_sample, &original_seed);
// Returns value between 0.0 (unrelated) and 1.0 (identical)

Use Cases

Technique	Best For	Example
EDA	Text classification	Sentiment analysis training
Templates	Structured data	Command generation
Weak Supervision	Unlabeled data	Auto-labeling datasets
Caching	Repeated generation	Batch augmentation pipelines

Configuration Reference

`SyntheticConfig`

SyntheticConfig::default()
    .with_augmentation_ratio(2.0)   // Generate 2x original
    .with_quality_threshold(0.3)    // Minimum quality score
    .with_seed(42)                  // Reproducible randomness

`EdaConfig`

EdaConfig::default()
    .with_swap_probability(0.1)     // Word swap chance
    .with_delete_probability(0.1)   // Word deletion chance
    .with_insert_probability(0.1)   // Word insertion chance

`WeakSupervisionConfig`

WeakSupervisionConfig::new()
    .with_aggregation(AggregationStrategy::MajorityVote)
    .with_min_votes(2)              // Need 2+ LFs to agree
    .with_min_confidence(0.5)       // 50% confidence threshold

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning