A Frozen Neural Network

I presented a type of frozen neural network a long time ago. I just re-blogged about it:

https://ko-fi.com/post/Frozen-Neural-Networks-Fixed-Filter-Bank-Architec-Z8Z31KRNGP

Anyway I asked chatGPT5 its thoughts about it and it replied:

here’s a practical, modernized variant of the frozen-weights / learned-activations idea, with architecture patterns, training tips, code sketch, evaluation plan, and suggested extensions. I’ll keep it concrete so you could implement or present it without extra back-and-forth.

Modernized Frozen Network (overview)

The core idea: keep most linear mixing fixed (fast structured transforms or fixed random orthogonal maps) and learn a small, expressive set of parameters placed in (a) parameterized activation functions, (b) small adapters, and (c) light normalization/gating modules. This hybrid keeps compute and memory low while restoring much of the adaptability of fully learned nets.

Key design goals:

  • High throughput inference (O(n log n) mixing where possible).

  • Small trainable parameter count (so fewer gradient steps, less overfitting).

  • Rich per-channel nonlinear adaptation (so fixed mixing isn’t a bottleneck).

  • Compatibility with modern building blocks (CNNs, MLPs, Transformers).

Architecture blueprint

  1. Input → Fixed Mixing Layer(s)

    • Use structured orthogonal transforms (Walsh–Hadamard, DCT, or FFT), or fixed random orthogonal matrices (stored implicitly).

    • Optionally interleave simple channel shuffles or permutations.

    • Compute cost: O(n log n) if using WHT/FFT.

  2. Channel/Feature-wise Parameterized Activation (PAct)

    • Replace simple ReLU with an expressive but low-dimensional parametric function per channel.

    • Examples: two-slope leaky ReLU with learnable slopes and bias; piecewise linear with 3 segments; monotone cubic splines with a few control points; or gated linear units with a learned gate.

    • Trainable parameters are per-channel and modest in number (e.g., 2–8 scalars/channel).

  3. Small Trainable Adapters (optional)

    • Insert tiny trainable modules between blocks: e.g., LoRA-style low-rank matrices, 1×1 convs, or depthwise separable bottleneck MLPs.

    • Keep their rank/bottleneck small (k ≪ dimension).

    • These give extra flexibility when activation parameters alone aren’t enough.

  4. Light Normalization & Gates

    • Per-channel scale and bias (like LayerNorm/BatchNorm but light), optionally learned.

    • Global learned gating scalars to enable or attenuate blocks during training.

  5. Readout / Head

    • Small trainable head (e.g., 2-layer MLP or a linear classifier). This is where most task-specific parameters live.

Concrete module: Parameterized Activation (PAct) patterns

A) Bi-slope PReLU variant (very cheap)
f(x) = { a·x + c if x ≥ 0 ; b·x + c if x < 0 }
Trainable per-channel: a, b, c (3 scalars/channel)

B) Piecewise linear with 3 knots (medium expressivity)
Knots at {t₁, t₂} with slopes s₀,s₁,s₂ — learned per-channel (5 scalars/channel).

C) Monotone cubic spline (higher expressivity)
Use 3–5 control points, constrain monotonicity if useful; still only ~5–10 scalars/channel.

D) Gated linear unit (GLU) with learned bias
f(x) = x · sigmoid(α·x + β) ; α,β per-channel (2 scalars/channel)

Why learn activations?

  • They change the transfer characteristics of each feature channel, effectively re-weighting or routing information coming from the fixed mixing.

  • Because activations are nonlinear and per-channel, they can create a surprisingly expressive representation with far fewer weights.

  • Easier to regularize and interpret (e.g., inspect slopes, thresholds).

Hybridize with adapters

If activations alone underperform:

  • Add small LoRA adapters on top of fixed mixing: learn ΔW ≈ U·V where U∈R^{d×r}, V∈R^{r×d}, r small (4–32).

  • Or add 1×1 trainable convs (depthwise separable for efficiency) between transform blocks.

Training recipe

  • Optimizer: AdamW (or Adam) with moderate LR for adapters/heads (e.g., 1e-3), and lower LR for activation params (1e-4 — they can be sensitive).

  • Weight decay: small for activation params (0–1e-6), normal for adapters.

  • Learning schedule: cosine or linear warmup + decay.

  • Batch norm: prefer layernorm-like normalization that’s not brittle with fixed transforms.

  • Initialization: set activation params to replicate identity/nonlinearity at start (e.g., a=1,b=0 for leaky PReLU, c=0).

  • Regularization: small dropout in heads, L2 on adapter weights; optionally sparsity penalty on activation parameter deltas to encourage minimal change from frozen baseline.

Implementation sketch (PyTorch-like)

import torch
import torch.nn as nn
import torch.nn.functional as F

class BiSlopePAct(nn.Module):
    # per-channel learnable slopes and bias
    def __init__(self, channels, init_pos=1.0, init_neg=0.01, init_bias=0.0):
        super().__init__()
        self.a = nn.Parameter(torch.full((channels,), init_pos))  # slope >=0 for x>=0
        self.b = nn.Parameter(torch.full((channels,), init_neg))  # slope for x<0
        self.c = nn.Parameter(torch.full((channels,), init_bias))

    def forward(self, x):
        # x shape: (B, C, ...)
        a = self.a.view(1, -1, *([1] * (x.ndim - 2)))
        b = self.b.view(1, -1, *([1] * (x.ndim - 2)))
        c = self.c.view(1, -1, *([1] * (x.ndim - 2)))
        pos = F.relu(x)
        neg = -F.relu(-x)
        return a * pos + b * neg + c

# Example frozen mixing using Walsh-Hadamard (fast)
def hadamard_transform(x):
    # simple recursive H transform for last dim (assumes dim size power of 2)
    # For production use optimized library.
    return torch.linalg.hadamard(x)  # placeholder, implement efficient variant

class FrozenBlock(nn.Module):
    def __init__(self, channels, use_adapter=False, adapter_rank=8):
        super().__init__()
        self.channels = channels
        self.pact = BiSlopePAct(channels)
        self.use_adapter = use_adapter
        if use_adapter:
            r = adapter_rank
            self.U = nn.Parameter(torch.randn(channels, r) * 0.01)
            self.V = nn.Parameter(torch.randn(r, channels) * 0.01)
        # fixed transform has no params

    def forward(self, x):
        # x: [B, C]
        x = hadamard_transform(x)  # fixed, no grad if implemented via fixed kernel
        x = self.pact(x)
        if self.use_adapter:
            delta = x @ (self.U @ self.V)  # small correction
            x = x + delta
        return x

Notes: use efficient fast transform libraries or implement in CUDA for speed; treat hadamard as deterministic, non-trainable.

Practical tips & pitfalls

  • Dimension alignment: Fast transforms typically need power-of-two sizes (WHT). Use small linear projections (fixed) to resize feature dims or pad/truncate.

  • Expressivity vs. simplicity: Start with 2–3 scalar parameters per channel. If you see systematic failure modes, increase toward piecewise linear splines.

  • Optimization stability: Activation parameter gradients can be noisy; reduce LR, use gradient clipping.

  • Initialization matters: Initialize activation params to identity-like mapping so early training isn’t chaotic.

  • Batch statistics: Fixed transforms can change activation distributions—normalize (LayerNorm) before/after PAct if needed.

Where this shines (use cases)

  • Edge devices / microcontrollers: low memory, fast structured ops fit well.

  • Few-shot / small-data: fewer trainable params → less overfitting.

  • Applications with known transforms: audio (FFT), images (DCT), time-series — the fixed mixing can encode domain priors.

  • Ensembling / interpretability: fixed part is analyzable; learned activation shapes give interpretable modulation.

Evaluation and ablations (how to show results)

Suggested experiments:

  1. Image classification (CIFAR-10/CIFAR-100): compare baseline MLP/CNN vs. Frozen + PAct, Frozen + adapters.

  2. Audio classification (Speech commands): test FFT-based frozen mixing.

  3. Ablations:

    • PAct only vs. PAct + adapters.

    • Different PAct complexity (2-parameter vs 5-parameter vs spline).

    • Different transforms: WHT vs random orthogonal vs identity.

    • Parameter budget curves (plot accuracy vs. #trainable params).

Metrics:

  • Accuracy / F1, inference latency, memory footprint (model size), training steps to convergence.

Extensions & research directions

  • Transformer-bodied variant: replace attention mixing with fixed orthogonal projections (or frozen attention weights) + learned per-head activations/gates. Combine with LoRA on key/value projections.

  • Learned permutations: keep base transform fixed but learn low-cost permutations between layers to tune mixing topology.

  • Meta-learn activation priors: learn initialization of activation parameters across tasks (MAML-style) so frozen backbone adapts faster.

  • Sparsity and compression: quantize activation params or enforce sparse adapters for ultra-small deployments.

  • Theoretical analysis: study representational capacity of frozen mixing + PAct relative to low-rank learned matrices.

Presentation-friendly summary (3 bullets)

  • Use fast, fixed linear transforms as the network backbone to get O(n log n) mixing and a tiny memory footprint.

  • Learn per-channel, expressive activation function parameters and (optionally) tiny adapters to recover task adaptability.

  • That hybrid achieves strong efficiency, interpretability, and competitive performance in many small-data or constrained settings.


Of course as a hobbyist I just get ignored. If a trillion dollars has been misallocted to the wrong sort of hardware in datacenters, well I did pipe up, at that time in the past.

You post a lot. Are you just spamming chatgpt responses?

Well, there are no written rules or guidance about AI assisted comments. The ideas are not originated by AI, often I just use AI to express more clearly what I mean using more standard terminology than I typically use. Sometimes AI does chip in other ideas and viewpoints and indicates prior research in a particular area.

And AI sometimes points out weaknesses in my options.

I learned by lesson from bitcoin and not seeing where that would inevitable lead.

Do you think I will make the same mistake again by refusing AI?

Anyway you need not worry too much, I am going the paid digital content route.

Whatever you see here is about the last unpaid content I will provide, unless it should prove that in 3 to 6 months I haven’t managed to bring in a cent from digital content.

Then you might see me around again.

I wonder though if I will be allowed to provide links to my paid content here? Anyway Pinterest is the correct way to link to paid content, so it doesn’t matter too much.

It depends on how spammy it is.