A Frozen Neural Network

I presented a type of frozen neural network a long time ago. I just re-blogged about it:

https://ko-fi.com/post/Frozen-Neural-Networks-Fixed-Filter-Bank-Architec-Z8Z31KRNGP

Anyway I asked chatGPT5 its thoughts about it and it replied:

here’s a practical, modernized variant of the frozen-weights / learned-activations idea, with architecture patterns, training tips, code sketch, evaluation plan, and suggested extensions. I’ll keep it concrete so you could implement or present it without extra back-and-forth.

Modernized Frozen Network (overview)

The core idea: keep most linear mixing fixed (fast structured transforms or fixed random orthogonal maps) and learn a small, expressive set of parameters placed in (a) parameterized activation functions, (b) small adapters, and (c) light normalization/gating modules. This hybrid keeps compute and memory low while restoring much of the adaptability of fully learned nets.

Key design goals:

  • High throughput inference (O(n log n) mixing where possible).

  • Small trainable parameter count (so fewer gradient steps, less overfitting).

  • Rich per-channel nonlinear adaptation (so fixed mixing isn’t a bottleneck).

  • Compatibility with modern building blocks (CNNs, MLPs, Transformers).

Architecture blueprint

  1. Input → Fixed Mixing Layer(s)

    • Use structured orthogonal transforms (Walsh–Hadamard, DCT, or FFT), or fixed random orthogonal matrices (stored implicitly).

    • Optionally interleave simple channel shuffles or permutations.

    • Compute cost: O(n log n) if using WHT/FFT.

  2. Channel/Feature-wise Parameterized Activation (PAct)

    • Replace simple ReLU with an expressive but low-dimensional parametric function per channel.

    • Examples: two-slope leaky ReLU with learnable slopes and bias; piecewise linear with 3 segments; monotone cubic splines with a few control points; or gated linear units with a learned gate.

    • Trainable parameters are per-channel and modest in number (e.g., 2–8 scalars/channel).

  3. Small Trainable Adapters (optional)

    • Insert tiny trainable modules between blocks: e.g., LoRA-style low-rank matrices, 1×1 convs, or depthwise separable bottleneck MLPs.

    • Keep their rank/bottleneck small (k ≪ dimension).

    • These give extra flexibility when activation parameters alone aren’t enough.

  4. Light Normalization & Gates

    • Per-channel scale and bias (like LayerNorm/BatchNorm but light), optionally learned.

    • Global learned gating scalars to enable or attenuate blocks during training.

  5. Readout / Head

    • Small trainable head (e.g., 2-layer MLP or a linear classifier). This is where most task-specific parameters live.

Concrete module: Parameterized Activation (PAct) patterns

A) Bi-slope PReLU variant (very cheap)
f(x) = { a·x + c if x ≥ 0 ; b·x + c if x < 0 }
Trainable per-channel: a, b, c (3 scalars/channel)

B) Piecewise linear with 3 knots (medium expressivity)
Knots at {t₁, t₂} with slopes s₀,s₁,s₂ — learned per-channel (5 scalars/channel).

C) Monotone cubic spline (higher expressivity)
Use 3–5 control points, constrain monotonicity if useful; still only ~5–10 scalars/channel.

D) Gated linear unit (GLU) with learned bias
f(x) = x · sigmoid(α·x + β) ; α,β per-channel (2 scalars/channel)

Why learn activations?

  • They change the transfer characteristics of each feature channel, effectively re-weighting or routing information coming from the fixed mixing.

  • Because activations are nonlinear and per-channel, they can create a surprisingly expressive representation with far fewer weights.

  • Easier to regularize and interpret (e.g., inspect slopes, thresholds).

Hybridize with adapters

If activations alone underperform:

  • Add small LoRA adapters on top of fixed mixing: learn ΔW ≈ U·V where U∈R^{d×r}, V∈R^{r×d}, r small (4–32).

  • Or add 1×1 trainable convs (depthwise separable for efficiency) between transform blocks.

Training recipe

  • Optimizer: AdamW (or Adam) with moderate LR for adapters/heads (e.g., 1e-3), and lower LR for activation params (1e-4 — they can be sensitive).

  • Weight decay: small for activation params (0–1e-6), normal for adapters.

  • Learning schedule: cosine or linear warmup + decay.

  • Batch norm: prefer layernorm-like normalization that’s not brittle with fixed transforms.

  • Initialization: set activation params to replicate identity/nonlinearity at start (e.g., a=1,b=0 for leaky PReLU, c=0).

  • Regularization: small dropout in heads, L2 on adapter weights; optionally sparsity penalty on activation parameter deltas to encourage minimal change from frozen baseline.

Implementation sketch (PyTorch-like)

import torch
import torch.nn as nn
import torch.nn.functional as F

class BiSlopePAct(nn.Module):
    # per-channel learnable slopes and bias
    def __init__(self, channels, init_pos=1.0, init_neg=0.01, init_bias=0.0):
        super().__init__()
        self.a = nn.Parameter(torch.full((channels,), init_pos))  # slope >=0 for x>=0
        self.b = nn.Parameter(torch.full((channels,), init_neg))  # slope for x<0
        self.c = nn.Parameter(torch.full((channels,), init_bias))

    def forward(self, x):
        # x shape: (B, C, ...)
        a = self.a.view(1, -1, *([1] * (x.ndim - 2)))
        b = self.b.view(1, -1, *([1] * (x.ndim - 2)))
        c = self.c.view(1, -1, *([1] * (x.ndim - 2)))
        pos = F.relu(x)
        neg = -F.relu(-x)
        return a * pos + b * neg + c

# Example frozen mixing using Walsh-Hadamard (fast)
def hadamard_transform(x):
    # simple recursive H transform for last dim (assumes dim size power of 2)
    # For production use optimized library.
    return torch.linalg.hadamard(x)  # placeholder, implement efficient variant

class FrozenBlock(nn.Module):
    def __init__(self, channels, use_adapter=False, adapter_rank=8):
        super().__init__()
        self.channels = channels
        self.pact = BiSlopePAct(channels)
        self.use_adapter = use_adapter
        if use_adapter:
            r = adapter_rank
            self.U = nn.Parameter(torch.randn(channels, r) * 0.01)
            self.V = nn.Parameter(torch.randn(r, channels) * 0.01)
        # fixed transform has no params

    def forward(self, x):
        # x: [B, C]
        x = hadamard_transform(x)  # fixed, no grad if implemented via fixed kernel
        x = self.pact(x)
        if self.use_adapter:
            delta = x @ (self.U @ self.V)  # small correction
            x = x + delta
        return x

Notes: use efficient fast transform libraries or implement in CUDA for speed; treat hadamard as deterministic, non-trainable.

Practical tips & pitfalls

  • Dimension alignment: Fast transforms typically need power-of-two sizes (WHT). Use small linear projections (fixed) to resize feature dims or pad/truncate.

  • Expressivity vs. simplicity: Start with 2–3 scalar parameters per channel. If you see systematic failure modes, increase toward piecewise linear splines.

  • Optimization stability: Activation parameter gradients can be noisy; reduce LR, use gradient clipping.

  • Initialization matters: Initialize activation params to identity-like mapping so early training isn’t chaotic.

  • Batch statistics: Fixed transforms can change activation distributions—normalize (LayerNorm) before/after PAct if needed.

Where this shines (use cases)

  • Edge devices / microcontrollers: low memory, fast structured ops fit well.

  • Few-shot / small-data: fewer trainable params → less overfitting.

  • Applications with known transforms: audio (FFT), images (DCT), time-series — the fixed mixing can encode domain priors.

  • Ensembling / interpretability: fixed part is analyzable; learned activation shapes give interpretable modulation.

Evaluation and ablations (how to show results)

Suggested experiments:

  1. Image classification (CIFAR-10/CIFAR-100): compare baseline MLP/CNN vs. Frozen + PAct, Frozen + adapters.

  2. Audio classification (Speech commands): test FFT-based frozen mixing.

  3. Ablations:

    • PAct only vs. PAct + adapters.

    • Different PAct complexity (2-parameter vs 5-parameter vs spline).

    • Different transforms: WHT vs random orthogonal vs identity.

    • Parameter budget curves (plot accuracy vs. #trainable params).

Metrics:

  • Accuracy / F1, inference latency, memory footprint (model size), training steps to convergence.

Extensions & research directions

  • Transformer-bodied variant: replace attention mixing with fixed orthogonal projections (or frozen attention weights) + learned per-head activations/gates. Combine with LoRA on key/value projections.

  • Learned permutations: keep base transform fixed but learn low-cost permutations between layers to tune mixing topology.

  • Meta-learn activation priors: learn initialization of activation parameters across tasks (MAML-style) so frozen backbone adapts faster.

  • Sparsity and compression: quantize activation params or enforce sparse adapters for ultra-small deployments.

  • Theoretical analysis: study representational capacity of frozen mixing + PAct relative to low-rank learned matrices.

Presentation-friendly summary (3 bullets)

  • Use fast, fixed linear transforms as the network backbone to get O(n log n) mixing and a tiny memory footprint.

  • Learn per-channel, expressive activation function parameters and (optionally) tiny adapters to recover task adaptability.

  • That hybrid achieves strong efficiency, interpretability, and competitive performance in many small-data or constrained settings.


2 Likes

Of course as a hobbyist I just get ignored. If a trillion dollars has been misallocted to the wrong sort of hardware in datacenters, well I did pipe up, at that time in the past.

1 Like

You post a lot. Are you just spamming chatgpt responses?

2 Likes

Well, there are no written rules or guidance about AI assisted comments. The ideas are not originated by AI, often I just use AI to express more clearly what I mean using more standard terminology than I typically use. Sometimes AI does chip in other ideas and viewpoints and indicates prior research in a particular area.

And AI sometimes points out weaknesses in my options.

I learned by lesson from bitcoin and not seeing where that would inevitable lead.

Do you think I will make the same mistake again by refusing AI?

Anyway you need not worry too much, I am going the paid digital content route.

Whatever you see here is about the last unpaid content I will provide, unless it should prove that in 3 to 6 months I haven’t managed to bring in a cent from digital content.

Then you might see me around again.

I wonder though if I will be allowed to provide links to my paid content here? Anyway Pinterest is the correct way to link to paid content, so it doesn’t matter too much.

2 Likes

It depends on how spammy it is.

1 Like