I presented a type of frozen neural network a long time ago. I just re-blogged about it:
https://ko-fi.com/post/Frozen-Neural-Networks-Fixed-Filter-Bank-Architec-Z8Z31KRNGP
Anyway I asked chatGPT5 its thoughts about it and it replied:
here’s a practical, modernized variant of the frozen-weights / learned-activations idea, with architecture patterns, training tips, code sketch, evaluation plan, and suggested extensions. I’ll keep it concrete so you could implement or present it without extra back-and-forth.
Modernized Frozen Network (overview)
The core idea: keep most linear mixing fixed (fast structured transforms or fixed random orthogonal maps) and learn a small, expressive set of parameters placed in (a) parameterized activation functions, (b) small adapters, and (c) light normalization/gating modules. This hybrid keeps compute and memory low while restoring much of the adaptability of fully learned nets.
Key design goals:
-
High throughput inference (O(n log n) mixing where possible).
-
Small trainable parameter count (so fewer gradient steps, less overfitting).
-
Rich per-channel nonlinear adaptation (so fixed mixing isn’t a bottleneck).
-
Compatibility with modern building blocks (CNNs, MLPs, Transformers).
Architecture blueprint
-
Input → Fixed Mixing Layer(s)
-
Use structured orthogonal transforms (Walsh–Hadamard, DCT, or FFT), or fixed random orthogonal matrices (stored implicitly).
-
Optionally interleave simple channel shuffles or permutations.
-
Compute cost: O(n log n) if using WHT/FFT.
-
-
Channel/Feature-wise Parameterized Activation (PAct)
-
Replace simple ReLU with an expressive but low-dimensional parametric function per channel.
-
Examples: two-slope leaky ReLU with learnable slopes and bias; piecewise linear with 3 segments; monotone cubic splines with a few control points; or gated linear units with a learned gate.
-
Trainable parameters are per-channel and modest in number (e.g., 2–8 scalars/channel).
-
-
Small Trainable Adapters (optional)
-
Insert tiny trainable modules between blocks: e.g., LoRA-style low-rank matrices, 1×1 convs, or depthwise separable bottleneck MLPs.
-
Keep their rank/bottleneck small (k ≪ dimension).
-
These give extra flexibility when activation parameters alone aren’t enough.
-
-
Light Normalization & Gates
-
Per-channel scale and bias (like LayerNorm/BatchNorm but light), optionally learned.
-
Global learned gating scalars to enable or attenuate blocks during training.
-
-
Readout / Head
- Small trainable head (e.g., 2-layer MLP or a linear classifier). This is where most task-specific parameters live.
Concrete module: Parameterized Activation (PAct) patterns
A) Bi-slope PReLU variant (very cheap)
f(x) = { a·x + c if x ≥ 0 ; b·x + c if x < 0 }
Trainable per-channel: a, b, c (3 scalars/channel)
B) Piecewise linear with 3 knots (medium expressivity)
Knots at {t₁, t₂} with slopes s₀,s₁,s₂ — learned per-channel (5 scalars/channel).
C) Monotone cubic spline (higher expressivity)
Use 3–5 control points, constrain monotonicity if useful; still only ~5–10 scalars/channel.
D) Gated linear unit (GLU) with learned bias
f(x) = x · sigmoid(α·x + β) ; α,β per-channel (2 scalars/channel)
Why learn activations?
-
They change the transfer characteristics of each feature channel, effectively re-weighting or routing information coming from the fixed mixing.
-
Because activations are nonlinear and per-channel, they can create a surprisingly expressive representation with far fewer weights.
-
Easier to regularize and interpret (e.g., inspect slopes, thresholds).
Hybridize with adapters
If activations alone underperform:
-
Add small LoRA adapters on top of fixed mixing: learn ΔW ≈ U·V where U∈R^{d×r}, V∈R^{r×d}, r small (4–32).
-
Or add 1×1 trainable convs (depthwise separable for efficiency) between transform blocks.
Training recipe
-
Optimizer: AdamW (or Adam) with moderate LR for adapters/heads (e.g., 1e-3), and lower LR for activation params (1e-4 — they can be sensitive).
-
Weight decay: small for activation params (0–1e-6), normal for adapters.
-
Learning schedule: cosine or linear warmup + decay.
-
Batch norm: prefer layernorm-like normalization that’s not brittle with fixed transforms.
-
Initialization: set activation params to replicate identity/nonlinearity at start (e.g., a=1,b=0 for leaky PReLU, c=0).
-
Regularization: small dropout in heads, L2 on adapter weights; optionally sparsity penalty on activation parameter deltas to encourage minimal change from frozen baseline.
Implementation sketch (PyTorch-like)
import torch
import torch.nn as nn
import torch.nn.functional as F
class BiSlopePAct(nn.Module):
# per-channel learnable slopes and bias
def __init__(self, channels, init_pos=1.0, init_neg=0.01, init_bias=0.0):
super().__init__()
self.a = nn.Parameter(torch.full((channels,), init_pos)) # slope >=0 for x>=0
self.b = nn.Parameter(torch.full((channels,), init_neg)) # slope for x<0
self.c = nn.Parameter(torch.full((channels,), init_bias))
def forward(self, x):
# x shape: (B, C, ...)
a = self.a.view(1, -1, *([1] * (x.ndim - 2)))
b = self.b.view(1, -1, *([1] * (x.ndim - 2)))
c = self.c.view(1, -1, *([1] * (x.ndim - 2)))
pos = F.relu(x)
neg = -F.relu(-x)
return a * pos + b * neg + c
# Example frozen mixing using Walsh-Hadamard (fast)
def hadamard_transform(x):
# simple recursive H transform for last dim (assumes dim size power of 2)
# For production use optimized library.
return torch.linalg.hadamard(x) # placeholder, implement efficient variant
class FrozenBlock(nn.Module):
def __init__(self, channels, use_adapter=False, adapter_rank=8):
super().__init__()
self.channels = channels
self.pact = BiSlopePAct(channels)
self.use_adapter = use_adapter
if use_adapter:
r = adapter_rank
self.U = nn.Parameter(torch.randn(channels, r) * 0.01)
self.V = nn.Parameter(torch.randn(r, channels) * 0.01)
# fixed transform has no params
def forward(self, x):
# x: [B, C]
x = hadamard_transform(x) # fixed, no grad if implemented via fixed kernel
x = self.pact(x)
if self.use_adapter:
delta = x @ (self.U @ self.V) # small correction
x = x + delta
return x
Notes: use efficient fast transform libraries or implement in CUDA for speed; treat hadamard as deterministic, non-trainable.
Practical tips & pitfalls
-
Dimension alignment: Fast transforms typically need power-of-two sizes (WHT). Use small linear projections (fixed) to resize feature dims or pad/truncate.
-
Expressivity vs. simplicity: Start with 2–3 scalar parameters per channel. If you see systematic failure modes, increase toward piecewise linear splines.
-
Optimization stability: Activation parameter gradients can be noisy; reduce LR, use gradient clipping.
-
Initialization matters: Initialize activation params to identity-like mapping so early training isn’t chaotic.
-
Batch statistics: Fixed transforms can change activation distributions—normalize (LayerNorm) before/after PAct if needed.
Where this shines (use cases)
-
Edge devices / microcontrollers: low memory, fast structured ops fit well.
-
Few-shot / small-data: fewer trainable params → less overfitting.
-
Applications with known transforms: audio (FFT), images (DCT), time-series — the fixed mixing can encode domain priors.
-
Ensembling / interpretability: fixed part is analyzable; learned activation shapes give interpretable modulation.
Evaluation and ablations (how to show results)
Suggested experiments:
-
Image classification (CIFAR-10/CIFAR-100): compare baseline MLP/CNN vs. Frozen + PAct, Frozen + adapters.
-
Audio classification (Speech commands): test FFT-based frozen mixing.
-
Ablations:
-
PAct only vs. PAct + adapters.
-
Different PAct complexity (2-parameter vs 5-parameter vs spline).
-
Different transforms: WHT vs random orthogonal vs identity.
-
Parameter budget curves (plot accuracy vs. #trainable params).
-
Metrics:
- Accuracy / F1, inference latency, memory footprint (model size), training steps to convergence.
Extensions & research directions
-
Transformer-bodied variant: replace attention mixing with fixed orthogonal projections (or frozen attention weights) + learned per-head activations/gates. Combine with LoRA on key/value projections.
-
Learned permutations: keep base transform fixed but learn low-cost permutations between layers to tune mixing topology.
-
Meta-learn activation priors: learn initialization of activation parameters across tasks (MAML-style) so frozen backbone adapts faster.
-
Sparsity and compression: quantize activation params or enforce sparse adapters for ultra-small deployments.
-
Theoretical analysis: study representational capacity of frozen mixing + PAct relative to low-rank learned matrices.
Presentation-friendly summary (3 bullets)
-
Use fast, fixed linear transforms as the network backbone to get O(n log n) mixing and a tiny memory footprint.
-
Learn per-channel, expressive activation function parameters and (optionally) tiny adapters to recover task adaptability.
-
That hybrid achieves strong efficiency, interpretability, and competitive performance in many small-data or constrained settings.