Add: windows mvp - transparent bugs not fixed

2026-02-12 22:58:33 +08:00
commit 61825f647d
147 changed files with 28498 additions and 0 deletions
--- a/skills/m10-performance/SKILL.md
+++ b/skills/m10-performance/SKILL.md
@@ -0,0 +1,157 @@
+---
+name: m10-performance
+description: "CRITICAL: Use for performance optimization. Triggers: performance, optimization, benchmark, profiling, flamegraph, criterion, slow, fast, allocation, cache, SIMD, make it faster, 性能优化, 基准测试"
+user-invocable: false
+---
+
+# Performance Optimization
+
+> **Layer 2: Design Choices**
+
+## Core Question
+
+**What's the bottleneck, and is optimization worth it?**
+
+Before optimizing:
+- Have you measured? (Don't guess)
+- What's the acceptable performance?
+- Will optimization add complexity?
+
+---
+
+## Performance Decision → Implementation
+
+| Goal | Design Choice | Implementation |
+|------|---------------|----------------|
+| Reduce allocations | Pre-allocate, reuse | `with_capacity`, object pools |
+| Improve cache | Contiguous data | `Vec`, `SmallVec` |
+| Parallelize | Data parallelism | `rayon`, threads |
+| Avoid copies | Zero-copy | References, `Cow<T>` |
+| Reduce indirection | Inline data | `smallvec`, arrays |
+
+---
+
+## Thinking Prompt
+
+Before optimizing:
+
+1. **Have you measured?**
+   - Profile first → flamegraph, perf
+   - Benchmark → criterion, cargo bench
+   - Identify actual hotspots
+
+2. **What's the priority?**
+   - Algorithm (10x-1000x improvement)
+   - Data structure (2x-10x)
+   - Allocation (2x-5x)
+   - Cache (1.5x-3x)
+
+3. **What's the trade-off?**
+   - Complexity vs speed
+   - Memory vs CPU
+   - Latency vs throughput
+
+---
+
+## Trace Up ↑
+
+To domain constraints (Layer 3):
+
+```
+"How fast does this need to be?"
+    ↑ Ask: What's the performance SLA?
+    ↑ Check: domain-* (latency requirements)
+    ↑ Check: Business requirements (acceptable response time)
+```
+
+| Question | Trace To | Ask |
+|----------|----------|-----|
+| Latency requirements | domain-* | What's acceptable response time? |
+| Throughput needs | domain-* | How many requests per second? |
+| Memory constraints | domain-* | What's the memory budget? |
+
+---
+
+## Trace Down ↓
+
+To implementation (Layer 1):
+
+```
+"Need to reduce allocations"
+    ↓ m01-ownership: Use references, avoid clone
+    ↓ m02-resource: Pre-allocate with_capacity
+
+"Need to parallelize"
+    ↓ m07-concurrency: Choose rayon or threads
+    ↓ m07-concurrency: Consider async for I/O-bound
+
+"Need cache efficiency"
+    ↓ Data layout: Prefer Vec over HashMap when possible
+    ↓ Access patterns: Sequential over random access
+```
+
+---
+
+## Quick Reference
+
+| Tool | Purpose |
+|------|---------|
+| `cargo bench` | Micro-benchmarks |
+| `criterion` | Statistical benchmarks |
+| `perf` / `flamegraph` | CPU profiling |
+| `heaptrack` | Allocation tracking |
+| `valgrind` / `cachegrind` | Cache analysis |
+
+## Optimization Priority
+
+```
+1. Algorithm choice     (10x - 1000x)
+2. Data structure       (2x - 10x)
+3. Allocation reduction (2x - 5x)
+4. Cache optimization   (1.5x - 3x)
+5. SIMD/Parallelism     (2x - 8x)
+```
+
+## Common Techniques
+
+| Technique | When | How |
+|-----------|------|-----|
+| Pre-allocation | Known size | `Vec::with_capacity(n)` |
+| Avoid cloning | Hot paths | Use references or `Cow<T>` |
+| Batch operations | Many small ops | Collect then process |
+| SmallVec | Usually small | `smallvec::SmallVec<[T; N]>` |
+| Inline buffers | Fixed-size data | Arrays over Vec |
+
+---
+
+## Common Mistakes
+
+| Mistake | Why Wrong | Better |
+|---------|-----------|--------|
+| Optimize without profiling | Wrong target | Profile first |
+| Benchmark in debug mode | Meaningless | Always `--release` |
+| Use LinkedList | Cache unfriendly | `Vec` or `VecDeque` |
+| Hidden `.clone()` | Unnecessary allocs | Use references |
+| Premature optimization | Wasted effort | Make it work first |
+
+---
+
+## Anti-Patterns
+
+| Anti-Pattern | Why Bad | Better |
+|--------------|---------|--------|
+| Clone to avoid lifetimes | Performance cost | Proper ownership |
+| Box everything | Indirection cost | Stack when possible |
+| HashMap for small sets | Overhead | Vec with linear search |
+| String concat in loop | O(n^2) | `String::with_capacity` or `format!` |
+
+---
+
+## Related Skills
+
+| When | See |
+|------|-----|
+| Reducing clones | m01-ownership |
+| Concurrency options | m07-concurrency |
+| Smart pointer choice | m02-resource |
+| Domain requirements | domain-* |
--- a/skills/m10-performance/patterns/optimization-guide.md
+++ b/skills/m10-performance/patterns/optimization-guide.md
@@ -0,0 +1,365 @@
+# Rust Performance Optimization Guide
+
+## Profiling First
+
+### Tools
+```bash
+# CPU profiling
+cargo install flamegraph
+cargo flamegraph --bin myapp
+
+# Memory profiling
+cargo install cargo-instruments  # macOS
+heaptrack ./target/release/myapp  # Linux
+
+# Benchmarking
+cargo bench  # with criterion
+
+# Cache analysis
+valgrind --tool=cachegrind ./target/release/myapp
+```
+
+### Criterion Benchmarks
+```rust
+use criterion::{criterion_group, criterion_main, Criterion};
+
+fn benchmark_parse(c: &mut Criterion) {
+    let input = "test data".repeat(1000);
+
+    c.bench_function("parse_v1", |b| {
+        b.iter(|| parse_v1(&input))
+    });
+
+    c.bench_function("parse_v2", |b| {
+        b.iter(|| parse_v2(&input))
+    });
+}
+
+criterion_group!(benches, benchmark_parse);
+criterion_main!(benches);
+```
+
+---
+
+## Common Optimizations
+
+### 1. Avoid Unnecessary Allocations
+
+```rust
+// BAD: allocates on every call
+fn to_uppercase(s: &str) -> String {
+    s.to_uppercase()
+}
+
+// GOOD: return Cow, allocate only if needed
+use std::borrow::Cow;
+
+fn to_uppercase(s: &str) -> Cow<'_, str> {
+    if s.chars().all(|c| c.is_uppercase()) {
+        Cow::Borrowed(s)
+    } else {
+        Cow::Owned(s.to_uppercase())
+    }
+}
+```
+
+### 2. Reuse Allocations
+
+```rust
+// BAD: creates new Vec each iteration
+for item in items {
+    let mut buffer = Vec::new();
+    process(&mut buffer, item);
+}
+
+// GOOD: reuse buffer
+let mut buffer = Vec::new();
+for item in items {
+    buffer.clear();
+    process(&mut buffer, item);
+}
+```
+
+### 3. Use Appropriate Collections
+
+| Need | Collection | Notes |
+|------|------------|-------|
+| Sequential access | `Vec<T>` | Best cache locality |
+| Random access by key | `HashMap<K, V>` | O(1) lookup |
+| Ordered keys | `BTreeMap<K, V>` | O(log n) lookup |
+| Small sets (<20) | `Vec<T>` + linear search | Lower overhead |
+| FIFO queue | `VecDeque<T>` | O(1) push/pop both ends |
+
+### 4. Pre-allocate Capacity
+
+```rust
+// BAD: many reallocations
+let mut v = Vec::new();
+for i in 0..10000 {
+    v.push(i);
+}
+
+// GOOD: single allocation
+let mut v = Vec::with_capacity(10000);
+for i in 0..10000 {
+    v.push(i);
+}
+```
+
+---
+
+## String Optimization
+
+### Avoid String Concatenation in Loops
+
+```rust
+// BAD: O(n²) allocations
+let mut result = String::new();
+for s in strings {
+    result = result + &s;
+}
+
+// GOOD: O(n) with push_str
+let mut result = String::new();
+for s in strings {
+    result.push_str(&s);
+}
+
+// BETTER: pre-calculate capacity
+let total_len: usize = strings.iter().map(|s| s.len()).sum();
+let mut result = String::with_capacity(total_len);
+for s in strings {
+    result.push_str(&s);
+}
+
+// BEST: use join for simple cases
+let result = strings.join("");
+```
+
+### Use &str When Possible
+
+```rust
+// BAD: requires allocation
+fn greet(name: String) {
+    println!("Hello, {}", name);
+}
+
+// GOOD: borrows, no allocation
+fn greet(name: &str) {
+    println!("Hello, {}", name);
+}
+
+// Works with both:
+greet("world");                    // &str
+greet(&String::from("world"));     // &String coerces to &str
+```
+
+---
+
+## Iterator Optimization
+
+### Use Iterators Over Indexing
+
+```rust
+// BAD: bounds checking on each access
+let mut sum = 0;
+for i in 0..vec.len() {
+    sum += vec[i];
+}
+
+// GOOD: no bounds checking
+let sum: i32 = vec.iter().sum();
+
+// GOOD: when index needed
+for (i, item) in vec.iter().enumerate() {
+    // ...
+}
+```
+
+### Lazy Evaluation
+
+```rust
+// Iterators are lazy - computation happens at collect
+let result: Vec<_> = data
+    .iter()
+    .filter(|x| x.is_valid())
+    .map(|x| x.process())
+    .take(10)  // stop after 10 items
+    .collect();
+```
+
+### Avoid Collecting When Not Needed
+
+```rust
+// BAD: unnecessary intermediate allocation
+let filtered: Vec<_> = items.iter().filter(|x| x.valid).collect();
+let count = filtered.len();
+
+// GOOD: no allocation
+let count = items.iter().filter(|x| x.valid).count();
+```
+
+---
+
+## Parallelism with Rayon
+
+```rust
+use rayon::prelude::*;
+
+// Sequential
+let sum: i32 = (0..1_000_000).map(|x| x * x).sum();
+
+// Parallel (automatic work stealing)
+let sum: i32 = (0..1_000_000).into_par_iter().map(|x| x * x).sum();
+
+// Parallel with custom chunk size
+let results: Vec<_> = data
+    .par_chunks(1000)
+    .map(|chunk| process_chunk(chunk))
+    .collect();
+```
+
+---
+
+## Memory Layout
+
+### Use Appropriate Integer Sizes
+
+```rust
+// If values are small, use smaller types
+struct Item {
+    count: u8,      // 0-255, not u64
+    flags: u8,      // small enum
+    id: u32,        // if 4 billion is enough
+}
+```
+
+### Pack Structs Efficiently
+
+```rust
+// BAD: 24 bytes due to padding
+struct Bad {
+    a: u8,   // 1 byte + 7 padding
+    b: u64,  // 8 bytes
+    c: u8,   // 1 byte + 7 padding
+}
+
+// GOOD: 16 bytes (or use #[repr(packed)])
+struct Good {
+    b: u64,  // 8 bytes
+    a: u8,   // 1 byte
+    c: u8,   // 1 byte + 6 padding
+}
+```
+
+### Box Large Values
+
+```rust
+// Large enum variants waste space
+enum Message {
+    Quit,
+    Data([u8; 10000]),  // all variants are 10000+ bytes
+}
+
+// Better: box the large variant
+enum Message {
+    Quit,
+    Data(Box<[u8; 10000]>),  // variants are pointer-sized
+}
+```
+
+---
+
+## Async Performance
+
+### Avoid Blocking in Async
+
+```rust
+// BAD: blocks the executor
+async fn bad() {
+    std::thread::sleep(Duration::from_secs(1));  // blocking!
+    std::fs::read_to_string("file.txt").unwrap();  // blocking!
+}
+
+// GOOD: use async versions
+async fn good() {
+    tokio::time::sleep(Duration::from_secs(1)).await;
+    tokio::fs::read_to_string("file.txt").await.unwrap();
+}
+
+// For CPU work: spawn_blocking
+async fn compute() -> i32 {
+    tokio::task::spawn_blocking(|| {
+        heavy_computation()
+    }).await.unwrap()
+}
+```
+
+### Buffer Async I/O
+
+```rust
+use tokio::io::{AsyncBufReadExt, BufReader};
+
+// BAD: many small reads
+async fn bad(file: File) {
+    let mut byte = [0u8];
+    while file.read(&mut byte).await.unwrap() > 0 {
+        process(byte[0]);
+    }
+}
+
+// GOOD: buffered reading
+async fn good(file: File) {
+    let reader = BufReader::new(file);
+    let mut lines = reader.lines();
+    while let Some(line) = lines.next_line().await.unwrap() {
+        process(&line);
+    }
+}
+```
+
+---
+
+## Release Build Optimization
+
+### Cargo.toml Settings
+
+```toml
+[profile.release]
+lto = true           # Link-time optimization
+codegen-units = 1    # Single codegen unit (slower compile, faster code)
+panic = "abort"      # Smaller binary, no unwinding
+strip = true         # Strip symbols
+
+[profile.release-fast]
+inherits = "release"
+opt-level = 3        # Maximum optimization
+
+[profile.release-small]
+inherits = "release"
+opt-level = "s"      # Optimize for size
+```
+
+### Compile-Time Assertions
+
+```rust
+// Zero runtime cost
+const _: () = assert!(std::mem::size_of::<MyStruct>() <= 64);
+```
+
+---
+
+## Checklist
+
+Before optimizing:
+- [ ] Profile to find actual bottlenecks
+- [ ] Have benchmarks to measure improvement
+- [ ] Consider if optimization is worth complexity
+
+Common wins:
+- [ ] Reduce allocations (Cow, reuse buffers)
+- [ ] Use appropriate collections
+- [ ] Pre-allocate with_capacity
+- [ ] Use iterators instead of indexing
+- [ ] Enable LTO for release builds
+- [ ] Use rayon for parallel workloads