Martijho-PathNet-thesis

Fra Robin

Gå til: navigasjon, søk
Notes
Experiments repicable? What to do to get same results?
Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems
Find all changes made to original implementation
Background -> figure of neuron to have y^ as output
More info in NNvsDNN plot
Look through GAs in background to check if more is needed
Needed figure list
shallow neural net showing connections between neurons
Some visualization of a genetic algorithm. Preferably tournament search?
Visualization of why training is separated from evaluation in exp2?

Innhold

Opening

Abstract

  • What is all this about?
  • Why should I read this thesis?
  • Is it any good?
  • What's new?

Acknowledgements

  • Who is your advisor?
  • Did anyone help you?
  • Who funded this work?
  • What's the name of your favorite pet?


Introduction

3 til 5 sider

More on multi task learning More on transfer learning

How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?
Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching Artificial General Intelligence(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains.
In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}.
This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet.


Raise problem: catastrophic forgetting.

Multiple solutions (PNN, PN, EWC)

  • Large structures (PNN, PN)
  • Limited in number of tasks it can retains(EWC)

Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn.

where do i start?

Question DeepMind left unanswered is how different GAs influence task learning and module reuse. Exploration vs exploitation\ref{theoretic background on topic}

why this?

broad answers first, specify later. We know PN works. would it work better for different algorithms? logical next step from original paper "unit of evolution"

Problem/hypothesis

  • What do modular PN training do with the knowledge?
    • More/less accuracy?
    • More/less transferability?

Test by learning in end-to-end first then PN search. Difference in performance or reuse?

  • Can we make reuse easier by shifting focus of search algorithm?
    • PN original: Naive search. Higher exploitation improve on module selection?

How to answer?

  • Set up simple multitask scenarios and try.
    • 2 tasks where first are end to end vs PN
    • List algorithms with different selection pressure and try on multiple tasks.


Thesis outline

Theoretical Background
What is discussed and why. What does the thesis build on?
Implementation
Datasets
Programming language
Packages
Code structure
Experiment 1
What do i attempt to answer and how?
Experiment 2
What do i attempt to answer and how?
Conclusion


Theoretical Background

15 til 20 sider

Machine Learning

  • Supervised learning
  • Based on the structure of the brain
  • image of dendrite vs artificial neuron
  • weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function
    • activation not discussed in depth here
    • Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate
    • softmax is the generalization of binary logistic regression to multiple classes.
    • regression/classification
  • feedforward
  • image of connections
  • loss function(cost/error) calculate calculates the difference between expected output and target output
  • ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)
  • goal is to minimize the cross-entropy function for the dataset [X, Y].
  • Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly
  • Many optimization algorithms, most common is gradient descent.
  • Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]
  • NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose.
  • final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label.
  • image classification is done based on input pixel values
  • NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)
  • convolutional operations.
  • inputs image and performs convolutional operation on image and a kernel of weights.
  • outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing
  • but each pixel in feature map contains info about the local spatial area the kernel covered.
  • control this spatial area with kernel size and stride (jumps made by kernel).
  • convlayers channels specify the number of kernels run over the image. One output channel for each kernel.
  • normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map.
  • called Convolutional Neural Network (CNN) \ref{exp1.b exp2}
  • For each level feature map is reduced in spatial image dimentions but increased in channels.
  • usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image.
  • The convolutional operations in this case can be called feature extraction.


Deep Learning

insert from essay

  • multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research
  • Architectures depend on input type, problems they are applied to and resource limitations.

Transfer learning

  • Training in DNNS take time. Transfer learning as method of reusing models for different tasks.
  • Train model on one set of data for one task, reuse the trained weights as starting point for training
  • usually randomly initialized weights as starting point.
  • It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases.
  • Reduces needed training data.
  • Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed.
  • TL FROM ESSAY
    • What is it?
    • Why do it?
    • How do it?
    • TL in CNNs
      • Who have done it?
      • Results?
      • Gabor approximation
  • Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks.
  • However problem arises when sequentially trying to learn multiple tasks in the same neural network.
    • Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks.
    • Catastrophic forgetting and solutions:
      • EWC
      • PNN
  • Curriculum Learning / Gradual learning
    • Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks.
      • ref to motivation behind task ordering in exp2


Evolutionary algorithms

  • What is it? Where does it come from?
  • Exploration vs Exploitation
    • ref experiments (formulated in the context of this trade-off)
  • Terms used in the evolutionary programming context
    • Population
    • Genotype and genome
    • Fitness-function
    • selection
    • recombination
    • generation
    • mutation
    • population diversity and convergence
  • Some types
    • GA
    • Evolutionary searches
    • short. Straight into tournament search
  • Tournament search
    • How it works, what are the steps?
    • Selection pressure (in larger context of EAs and then tournament search)
    • ref to search

PathNet

Rework essay section

Search

  • Tournament k=2, p=1
  • fitness evaulation updates the weights
  • training is done during search.
  • locking modules when optimal path is found
    • Why loocking?
  • Modules are reinitialized after search if they are not locked.
    • Why?
  • Search lasts for a set duration (accuracy threshold or generation limit)

Structure

  • Layers of modules
  • Module is a Neural Network
  • Reduced sum between layers (adding module outputs together)
  • Task unique layer at the end of each path (each path in a search have the same end layer)
  • Each path has a max number of possible modules from each layer
    • Limiting the possible capacity in the network (explain capacity)

Monte Carlo probability approximation

Implementation

5 til 15 sider

EDIT NOTE: Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5.

Python implementation

  • why python?
    • Problems:
      • Not quick to run
    • Pros:
      • Quick to prototype in
      • Generally good to debug
      • Multiple good tools for machine learning
        • \cite{tensorflow}
        • \cite{keras}
        • Why are these good?
      • Other packages
        • Matplotlib (visualization)
        • Numpy (math stuffs)
        • Pickle (data logging)
  • code structure
    • Object oriented
      • Easily parameterizable for ease of prototyping pathnet structures
    • Class structure:
      • Modules
      • Layers
      • PathNet
        • Functionality for
          • Building random paths
          • Creating keras models
          • static methods for creating pathnet structures
          • reset backend session
      • Tasks
      • Search
      • Plot generating
  • Training on gpu
    • Quicker in general for ML
    • This implementation do lots on CPU
      • Other implementations could take advantage of customizing layers and models in keras.
  • Noteable differences in implementation
    • Keras implementasjon
    • Path fitness not negative error but accuracy
    • exp 2: fitness calculated before evaluation (not same step)
    • Not added any noise to training data
  • Implementation problems
    • Tensorflow sessions not made for using multiple graphs
      • Resetting backend session after a number of models are made
    • Tensorflow-gpus default is using all gpu memory it can
      • Limiting data allocation to scale when needed
    • Tensorflow session does not free allocated memory before python thread is done.
      • Run all experiments through treads.
  • Code available on github


Datasets

MNIST

SVHN

The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.

  • Data type
  • Use cases and citations
  • How does the data look?
  • set sizes and class distributions
  • state of the art and human level performance

Search implementation

  • functions. callback to theoretical background and GA buzzwords
  • parameterization

Experiment 1: Search versus Selection

35 til 45 sider / 2

Experiment 2: Selection Pressure

35 til 45 sider / 2

Discussion

5 sider

Are your results satisfactory? Can they be improved? Is there a need for improvement? Are other approaches worth trying out? Will some restriction be lifted? Will you save the world with your Nifty Gadget?

Discussion

Discussion of the accuracy and relevance of the results; comparison with other researchers results. \subsection{Common errors} Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this.

Conclusion

Consequences of the achieved results, for example for new research, theory and applications.

Common errors

The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose

Ending

Personlige verktøy