Add: windows mvp - transparent bugs not fixed

This commit is contained in:
DaZuo0122
2026-02-12 22:58:33 +08:00
commit 61825f647d
147 changed files with 28498 additions and 0 deletions

View File

@@ -0,0 +1,157 @@
---
name: m10-performance
description: "CRITICAL: Use for performance optimization. Triggers: performance, optimization, benchmark, profiling, flamegraph, criterion, slow, fast, allocation, cache, SIMD, make it faster, 性能优化, 基准测试"
user-invocable: false
---
# Performance Optimization
> **Layer 2: Design Choices**
## Core Question
**What's the bottleneck, and is optimization worth it?**
Before optimizing:
- Have you measured? (Don't guess)
- What's the acceptable performance?
- Will optimization add complexity?
---
## Performance Decision → Implementation
| Goal | Design Choice | Implementation |
|------|---------------|----------------|
| Reduce allocations | Pre-allocate, reuse | `with_capacity`, object pools |
| Improve cache | Contiguous data | `Vec`, `SmallVec` |
| Parallelize | Data parallelism | `rayon`, threads |
| Avoid copies | Zero-copy | References, `Cow<T>` |
| Reduce indirection | Inline data | `smallvec`, arrays |
---
## Thinking Prompt
Before optimizing:
1. **Have you measured?**
- Profile first → flamegraph, perf
- Benchmark → criterion, cargo bench
- Identify actual hotspots
2. **What's the priority?**
- Algorithm (10x-1000x improvement)
- Data structure (2x-10x)
- Allocation (2x-5x)
- Cache (1.5x-3x)
3. **What's the trade-off?**
- Complexity vs speed
- Memory vs CPU
- Latency vs throughput
---
## Trace Up ↑
To domain constraints (Layer 3):
```
"How fast does this need to be?"
↑ Ask: What's the performance SLA?
↑ Check: domain-* (latency requirements)
↑ Check: Business requirements (acceptable response time)
```
| Question | Trace To | Ask |
|----------|----------|-----|
| Latency requirements | domain-* | What's acceptable response time? |
| Throughput needs | domain-* | How many requests per second? |
| Memory constraints | domain-* | What's the memory budget? |
---
## Trace Down ↓
To implementation (Layer 1):
```
"Need to reduce allocations"
↓ m01-ownership: Use references, avoid clone
↓ m02-resource: Pre-allocate with_capacity
"Need to parallelize"
↓ m07-concurrency: Choose rayon or threads
↓ m07-concurrency: Consider async for I/O-bound
"Need cache efficiency"
↓ Data layout: Prefer Vec over HashMap when possible
↓ Access patterns: Sequential over random access
```
---
## Quick Reference
| Tool | Purpose |
|------|---------|
| `cargo bench` | Micro-benchmarks |
| `criterion` | Statistical benchmarks |
| `perf` / `flamegraph` | CPU profiling |
| `heaptrack` | Allocation tracking |
| `valgrind` / `cachegrind` | Cache analysis |
## Optimization Priority
```
1. Algorithm choice (10x - 1000x)
2. Data structure (2x - 10x)
3. Allocation reduction (2x - 5x)
4. Cache optimization (1.5x - 3x)
5. SIMD/Parallelism (2x - 8x)
```
## Common Techniques
| Technique | When | How |
|-----------|------|-----|
| Pre-allocation | Known size | `Vec::with_capacity(n)` |
| Avoid cloning | Hot paths | Use references or `Cow<T>` |
| Batch operations | Many small ops | Collect then process |
| SmallVec | Usually small | `smallvec::SmallVec<[T; N]>` |
| Inline buffers | Fixed-size data | Arrays over Vec |
---
## Common Mistakes
| Mistake | Why Wrong | Better |
|---------|-----------|--------|
| Optimize without profiling | Wrong target | Profile first |
| Benchmark in debug mode | Meaningless | Always `--release` |
| Use LinkedList | Cache unfriendly | `Vec` or `VecDeque` |
| Hidden `.clone()` | Unnecessary allocs | Use references |
| Premature optimization | Wasted effort | Make it work first |
---
## Anti-Patterns
| Anti-Pattern | Why Bad | Better |
|--------------|---------|--------|
| Clone to avoid lifetimes | Performance cost | Proper ownership |
| Box everything | Indirection cost | Stack when possible |
| HashMap for small sets | Overhead | Vec with linear search |
| String concat in loop | O(n^2) | `String::with_capacity` or `format!` |
---
## Related Skills
| When | See |
|------|-----|
| Reducing clones | m01-ownership |
| Concurrency options | m07-concurrency |
| Smart pointer choice | m02-resource |
| Domain requirements | domain-* |

View File

@@ -0,0 +1,365 @@
# Rust Performance Optimization Guide
## Profiling First
### Tools
```bash
# CPU profiling
cargo install flamegraph
cargo flamegraph --bin myapp
# Memory profiling
cargo install cargo-instruments # macOS
heaptrack ./target/release/myapp # Linux
# Benchmarking
cargo bench # with criterion
# Cache analysis
valgrind --tool=cachegrind ./target/release/myapp
```
### Criterion Benchmarks
```rust
use criterion::{criterion_group, criterion_main, Criterion};
fn benchmark_parse(c: &mut Criterion) {
let input = "test data".repeat(1000);
c.bench_function("parse_v1", |b| {
b.iter(|| parse_v1(&input))
});
c.bench_function("parse_v2", |b| {
b.iter(|| parse_v2(&input))
});
}
criterion_group!(benches, benchmark_parse);
criterion_main!(benches);
```
---
## Common Optimizations
### 1. Avoid Unnecessary Allocations
```rust
// BAD: allocates on every call
fn to_uppercase(s: &str) -> String {
s.to_uppercase()
}
// GOOD: return Cow, allocate only if needed
use std::borrow::Cow;
fn to_uppercase(s: &str) -> Cow<'_, str> {
if s.chars().all(|c| c.is_uppercase()) {
Cow::Borrowed(s)
} else {
Cow::Owned(s.to_uppercase())
}
}
```
### 2. Reuse Allocations
```rust
// BAD: creates new Vec each iteration
for item in items {
let mut buffer = Vec::new();
process(&mut buffer, item);
}
// GOOD: reuse buffer
let mut buffer = Vec::new();
for item in items {
buffer.clear();
process(&mut buffer, item);
}
```
### 3. Use Appropriate Collections
| Need | Collection | Notes |
|------|------------|-------|
| Sequential access | `Vec<T>` | Best cache locality |
| Random access by key | `HashMap<K, V>` | O(1) lookup |
| Ordered keys | `BTreeMap<K, V>` | O(log n) lookup |
| Small sets (<20) | `Vec<T>` + linear search | Lower overhead |
| FIFO queue | `VecDeque<T>` | O(1) push/pop both ends |
### 4. Pre-allocate Capacity
```rust
// BAD: many reallocations
let mut v = Vec::new();
for i in 0..10000 {
v.push(i);
}
// GOOD: single allocation
let mut v = Vec::with_capacity(10000);
for i in 0..10000 {
v.push(i);
}
```
---
## String Optimization
### Avoid String Concatenation in Loops
```rust
// BAD: O(n²) allocations
let mut result = String::new();
for s in strings {
result = result + &s;
}
// GOOD: O(n) with push_str
let mut result = String::new();
for s in strings {
result.push_str(&s);
}
// BETTER: pre-calculate capacity
let total_len: usize = strings.iter().map(|s| s.len()).sum();
let mut result = String::with_capacity(total_len);
for s in strings {
result.push_str(&s);
}
// BEST: use join for simple cases
let result = strings.join("");
```
### Use &str When Possible
```rust
// BAD: requires allocation
fn greet(name: String) {
println!("Hello, {}", name);
}
// GOOD: borrows, no allocation
fn greet(name: &str) {
println!("Hello, {}", name);
}
// Works with both:
greet("world"); // &str
greet(&String::from("world")); // &String coerces to &str
```
---
## Iterator Optimization
### Use Iterators Over Indexing
```rust
// BAD: bounds checking on each access
let mut sum = 0;
for i in 0..vec.len() {
sum += vec[i];
}
// GOOD: no bounds checking
let sum: i32 = vec.iter().sum();
// GOOD: when index needed
for (i, item) in vec.iter().enumerate() {
// ...
}
```
### Lazy Evaluation
```rust
// Iterators are lazy - computation happens at collect
let result: Vec<_> = data
.iter()
.filter(|x| x.is_valid())
.map(|x| x.process())
.take(10) // stop after 10 items
.collect();
```
### Avoid Collecting When Not Needed
```rust
// BAD: unnecessary intermediate allocation
let filtered: Vec<_> = items.iter().filter(|x| x.valid).collect();
let count = filtered.len();
// GOOD: no allocation
let count = items.iter().filter(|x| x.valid).count();
```
---
## Parallelism with Rayon
```rust
use rayon::prelude::*;
// Sequential
let sum: i32 = (0..1_000_000).map(|x| x * x).sum();
// Parallel (automatic work stealing)
let sum: i32 = (0..1_000_000).into_par_iter().map(|x| x * x).sum();
// Parallel with custom chunk size
let results: Vec<_> = data
.par_chunks(1000)
.map(|chunk| process_chunk(chunk))
.collect();
```
---
## Memory Layout
### Use Appropriate Integer Sizes
```rust
// If values are small, use smaller types
struct Item {
count: u8, // 0-255, not u64
flags: u8, // small enum
id: u32, // if 4 billion is enough
}
```
### Pack Structs Efficiently
```rust
// BAD: 24 bytes due to padding
struct Bad {
a: u8, // 1 byte + 7 padding
b: u64, // 8 bytes
c: u8, // 1 byte + 7 padding
}
// GOOD: 16 bytes (or use #[repr(packed)])
struct Good {
b: u64, // 8 bytes
a: u8, // 1 byte
c: u8, // 1 byte + 6 padding
}
```
### Box Large Values
```rust
// Large enum variants waste space
enum Message {
Quit,
Data([u8; 10000]), // all variants are 10000+ bytes
}
// Better: box the large variant
enum Message {
Quit,
Data(Box<[u8; 10000]>), // variants are pointer-sized
}
```
---
## Async Performance
### Avoid Blocking in Async
```rust
// BAD: blocks the executor
async fn bad() {
std::thread::sleep(Duration::from_secs(1)); // blocking!
std::fs::read_to_string("file.txt").unwrap(); // blocking!
}
// GOOD: use async versions
async fn good() {
tokio::time::sleep(Duration::from_secs(1)).await;
tokio::fs::read_to_string("file.txt").await.unwrap();
}
// For CPU work: spawn_blocking
async fn compute() -> i32 {
tokio::task::spawn_blocking(|| {
heavy_computation()
}).await.unwrap()
}
```
### Buffer Async I/O
```rust
use tokio::io::{AsyncBufReadExt, BufReader};
// BAD: many small reads
async fn bad(file: File) {
let mut byte = [0u8];
while file.read(&mut byte).await.unwrap() > 0 {
process(byte[0]);
}
}
// GOOD: buffered reading
async fn good(file: File) {
let reader = BufReader::new(file);
let mut lines = reader.lines();
while let Some(line) = lines.next_line().await.unwrap() {
process(&line);
}
}
```
---
## Release Build Optimization
### Cargo.toml Settings
```toml
[profile.release]
lto = true # Link-time optimization
codegen-units = 1 # Single codegen unit (slower compile, faster code)
panic = "abort" # Smaller binary, no unwinding
strip = true # Strip symbols
[profile.release-fast]
inherits = "release"
opt-level = 3 # Maximum optimization
[profile.release-small]
inherits = "release"
opt-level = "s" # Optimize for size
```
### Compile-Time Assertions
```rust
// Zero runtime cost
const _: () = assert!(std::mem::size_of::<MyStruct>() <= 64);
```
---
## Checklist
Before optimizing:
- [ ] Profile to find actual bottlenecks
- [ ] Have benchmarks to measure improvement
- [ ] Consider if optimization is worth complexity
Common wins:
- [ ] Reduce allocations (Cow, reuse buffers)
- [ ] Use appropriate collections
- [ ] Pre-allocate with_capacity
- [ ] Use iterators instead of indexing
- [ ] Enable LTO for release builds
- [ ] Use rayon for parallel workloads