Fra Robin

Revisjon per 29. nov 2017 kl. 16:44 av Martijho (Diskusjon | bidrag)
Gå til: navigasjon, søk


Evolved paths through a Modular Super Neural Network

Research question

How would different evolutionary algorithms influence outcomes in training a PathNet structure on multiple tasks? What evolutionary strategies make the most sense in the scheme of training an SNN? Can evolutionary strategies easily be implemented as a search technique for a Pathnet structure?


Different evolutionary algorithms would probably not change the PathNet results significantly for a limited number of tasks but might prove fruitful for a search for an optimal path in a saturated PathNet. Here, the search domain consists of pre-trained modules, hopefully with a memetic separation for each layer/module. This would ensure good transferability between tasks, and in the end, simplify the search and training of the task-specific softmax layer given the new task.

Gradual Learning in Super Neural Networks

Research question

Can the modularity of the SNN help show what level of transferability it is between modules used in the different tasks in the curriculum? How large is the reduction in training necessary to learn a new task when a saturated PathNet is provided compared to learning de novo?


By testing what modules are used in which optimal paths, this study might show a reuse of some modules in multiple tasks, which would indicate the value of curriculum design. A high level of reuse might even point towards the possibility of one-shot learning in a saturated SNN

Suggested Experiment

Training an RL agent on some simple toy-environment like the LunarLander from OpenAI gym. This requires some rework of the reward signal from the environment to fake rewards for subtasks in the curriculum. Rewards in early subtasks might be clear-cut values (1 if reached sub-goal, 0 if fail)

Read up on curriculum design techniques

Create then a sequence of sub-tasks gradually increasing in complexity, and search for an optimal path through the PathNet for each of the sub-tasks. This implementation would use some version of Temporal Difference learning (Q-learning), and each path would represent some approximation of a value function.

Capacity Increase

Research question

Can we estimate the decline in needed capacity for each new sub-task learned from the curriculum? How "much" capacity is needed to learn a new meme?


Previous studies show a decline in needed capacity for each new sub-task (cite: Progressive Neural Networks-paper). If a metric can be defined for measuring the capacity change, we expect the results to confirm this.

Search for the first path?

Research question

Is there anything to gain from performing a proper search for the first path versus just picking a random path and training the weights? In a two-task system, whats the difference between picking a first path and a PNN?


I think performance will have the same asymptote, but it will be reached in fewer training iterations. The only thing that might be influenced by this path selection is that the modules in PathNet might have more interconnected dependencies. Maybe the layers are more "independent" when the weights are updated as part of multiple paths? This might be important for transferability when learning future tasks.

Suggested experiment

Performing multiple small multi-task learning scenarios. Two tasks should be enough, but it is necessary to show that modules are reused in each scenario. Test both picking a path and the full-on search for a path and compare convergence time for the second task.

Run multiple executions of a first-task search for a path in a binary mnist classification problem up to 99% classification accuracy on a test-set (like original pathnet paper). Log the training-counter for each optimal path and take a look at the average number of training iterations each path has. (so far: around 12?)

12 x 50 = 600 => 600 backpropagations of batchsize 16 or => 9600 training-examples shown

Then run multiple iterations where random paths of same average size as in the original experiment is trained for 600 iterations. compare classification accuracy of each path

Training counter for each module and average of each path
Path size (number of modules in path) connected to capacity?
Reuse of modules (transferability)

Implemented experiment

Problem: Binary MNIST classification (Same as deepminds experiments without the salt/pepper noise) 500 Search + Search

  • Search for path 1 and evaluate optimal path found
  • Search for path 2 and evaluate optimal path found
  • Each found path is evaluated on test set
  • For each path, save: the path itself, its evaluated fitness, number of generations used to reach it, the average training each module received within the path (1=50*minibatch),
  • Also store the number of reused modules from task 1 to task 2
  • Generate path1 and path2s training-plot and write it to pdf

500 Pick + Search:

  • Generate a random path
  • Train same number of times as average training time for first path from "search + search" with same iteration index
  • evaluate random path on test set
  • Search for path 2 and evaluate on optimal path found
  • Store same metrics as in search + search
  • Generate path1 and path2s training-plot and write it to pdf

Last, write log to file


module reuse plot
Distribution of module reuse in s+s and p+s searches along side the distribution for a random module selection
average training
Average training each module in a path undergoes for s+s and p+s searches plotted over the amount of module reuse
reuse by layer
Amount of module reuse for each layer in s+s and p+s searches alongside amount of reuse when randomly selecting modules
First run:
Iterations: 600
Population size: 64
Acc threshold: 98%
Tasks: [3, 4] then [1, 2]

As we can see from the first plot, there is no indication that there is a significant difference in module reuse between the search+search and pick+search training schemes. When comparing the results with a random selection of modules (green bar) it is apparent that for these tasks, the PathNet prefer to train new modules for each task, rather than reuse knowledge in pre-trained modules. This differs from our hypothesis that end-to-end training causes confounded interfaces between layers, but we could argue that these results are caused by too little training, too simple training data or too much available capacity. This would cause the distance in parameter space between initialized parameters and "good enough" parameters to be rather small, so the gain from reusing modules is not large enough to justify the reuse.

In the second plot, we see a trend that supports the claim that the training scenario used is too simple. For the pick+search results, the amount of module reuse increases with the average training for each path, while this seems to stay relatively constant for the search+search experiment. This could mean that in order to reach the classification accuracy threshold for the second task, after performing an end-to-end training for the first task, the paths need more training to understand the layer outputs for a higher amount of module reuse.

The last plot shows something unexpected. The results for search+search and pick+search indicate the same as plot number one, no significant difference in the module reuse. But here, the reuse is shown for each layer in the models. For the first layer, it is a significant reduction in the number of reused modules. This is the opposite of what we would expect based on the results in "How transferable are features in deep neural networks?"(Yosinski et al), where the first layers tended to be the most general and easily reusable.

  • Initial reaction: This is a property of using MLPs and would disappear if the first layers in PathNet are replaced by convolutional modules.
    • Fully connected NNs are poor at generalizing to image data since convolutional layers are invariant to scale, rotational and translation. Each neuron would have to generalize to its coresponding image pixel, and would therefore be highly task specific.
Personlige verktøy