https://robin.wiki.ifi.uio.no/index.php?title=Spesial:Bidrag&feed=atom&target=MartijhoRobin - Bidrag [no]2022-09-26T09:25:25ZFra RobinMediaWiki 1.15.3https://robin.wiki.ifi.uio.no/MasterProjectsMasterProjects2018-04-30T10:18:08Z<p>Martijho: </p>
<hr />
<div>== Status and supervision ==<br />
* [[ProgressH2017|Progress reports, Autumn 2017]]<br />
* [[ProgressH2016|Progress reports, Autumn 2016]]<br />
* [[ProgressH2015|Progress reports, Autumn 2015]]<br />
* [[ProgressH2014|Progress reports, Autumn 2014]]<br />
<br />
== Individual project pages ==<br />
<br />
* [http://robin.wiki.ifi.uio.no/Bruker:Martijho Martin (Oppgave: Modular Transferlearning in a Super Neural Network)]<br />
<br />
* [http://robin.wiki.ifi.uio.no/Bruker:Eirisu Eirik (Oppgave: veideteksjon)]<br />
* [http://robin.wiki.ifi.uio.no/Bruker:mathiact Mathias (Oppgave: Algorithm visualiation through augmented reality)]<br />
* [[PreviousMasterProjects|Previous Master Projects]]<br />
<br />
<br />
(make links to your own personal pages above. You are free to put whatever content you find useful at these pages. Use it to collect information for you own sake, and to have information available for supervisors or other potential collaborators in case it becomes relevant. For inspiration you can take a look at the pages of the previous students. <br />
<br />
Note: Structure of these pages borrowed from [http://bmimaster.wiki.ifi.uio.no/Main_Page BMI group's pages]<br />
<br />
== Team project pages ==<br />
<br />
* [http://robin.wiki.ifi.uio.no/User:cadCam CadCam]<br />
<br />
<br />
<br />
== Resources ==<br />
* [[project planning tips]]<br />
* [[thesis writing tips]]<br />
* [[version control]]<br />
* [[graphical tools]]<br />
* [[computing resources]]<br />
* [[exchange abroad]]<br />
* [[robotics simulators]]<br />
* [[Machine learning tips]]<br />
* [http://www.mn.uio.no/for-ansatte/arbeidsstotte/studieadministrasjon/eksamen/karaktersetting/ karaktersetting]</div>Martijhohttps://robin.wiki.ifi.uio.no/Bruker:MartijhoBruker:Martijho2018-04-30T10:17:10Z<p>Martijho: </p>
<hr />
<div>= Current draft =<br />
<br />
Current draft of the thesis can be found by following this link<br />
:[https://www.dropbox.com/s/063csu4cwxalo5v/Current%20master%20thesis%20draft%2030.4.pdf?dl=0 Current thesis draft]<br />
Updated as of 30.04.2018<br />
<br />
<br />
; Presentations<br />
: [https://docs.google.com/presentation/d/1I3ObuFmTDMSaoST_G589VSaOIcRpst7osPGR-XxvyOs/edit?usp=sharing| Presentation of PathNet and research questions]<br />
: [https://docs.google.com/presentation/d/1QdQnJfcUNPkWDMSZ2dnVgq2qeHI5mQBge0E9WFBAgtM/edit?usp=sharing| Transfer learning in SNNs: (tl;dr) + First-path-experiments]<br />
: [https://docs.google.com/presentation/d/1Mh5z-AoWE9t0YtXMfJMIm7WHXWDVJkMEqZQvoOmUgtI/edit?usp=sharing| Transfer learning in SNNs: Search-experiments]<br />
: [https://docs.google.com/presentation/d/1jqZsRLzdY9ylh7p3nDj029sNSkalh1YmIEXOkBnhKok/edit?usp=sharing| ML for the cool kids]<br />
<br />
= Thesis structure and notes = <br />
A seperate page describing outline and section structure of the thesis. <br />
: [[Martijho-PathNet-thesis|Thesis structure and outlines]]<br />
<br />
= Experiments = <br />
A separate page describing research questions and the experiments proposed to answer them. <br />
: [[Martijho-PathNet-Experiments|Experiments]]<br />
<br />
<br />
= Terms to use in thesis = <br />
*; Plastic Neural Network<br />
: NN that change topology or connectivity according to learning algorithm<br />
*; In-silico<br />
: performed by computer<br />
*; Modular Super Neural Network<br />
: DNN consisting of modules of smaller NNs,<br />
*; Task-specific meme<br />
: Smallest concise unit of knowledge required to perform some task<br />
: Example - Task = Pick something up. Meme = Ability to bend index finger <br />
*; Memetics<br />
: Study of information in an analogy to Darwinian evolution.<br />
*; Transferability<br />
: Ability to transfer/reuse knowledge between task<br />
*; Saturation in PathNet<br />
: Most or all modules are trained and locked to backpropagation <br />
*; Embedded transfer learning<br />
: Knowledge transfer capability is incorporated into the machine learning structure (PathNet, PNNs)<br />
*; Catastrophic forgetting<br />
: Forgetting previously known task when fine-tuning parameters<br />
*; Evolved sub-models<br />
: Using GAs to evolve paths through a larger set of parameters (PathNet functionality)<br />
<br />
= Thoughts on Thesis =<br />
- Search for the first path is unnecessary? The search is over good permutations of parameters from the network at the same time<br />
the parameters are trained for the first time. In other words: does the search provide a significant increase in transferability or any measurable increase in performance over just picking a random path and training it for a set amount of iterations? <br />
<br />
- When training on a saturated PathNet, it might be quicker to preprocess the data for each path (view it as feature extraction) since there is no backpropagation except for <br />
in the final task-specific softmax layer<br />
<br />
- When training on a curriculum and decrease in batch size for each increase in the task difficulty might make sense. <br />
Easy examples have little "nuance" in-between datapoints so large batch size might increase convergence speed. <br />
Equivalently, complex tasks later on in the curriculum might have a lot of "detail" which will be drowned out if the batch size is kept constant.<br />
<br />
== Thesis problem specification == <br />
Studying the behaviour of super neural networks when saturated with subtasks from the same domain such as in a curriculum learning scenario.<br />
Include research questions such as <br />
* Can we estimate the decline in needed capacity for each new sub-task learned from the curriculum? <br />
* Could a PathNet saturated with optimized paths for tasks from a curriculum provide one/few-shot learning?<br />
** What would, in that case, constitute a "saturated PathNet"? <br />
** Is there a learning advantage to be had from this kind of learning? <br />
* Is there a measurable increase in performance by searching over optimal "first paths" instead of just training a selected segment of the PathNet?<br />
<br />
= PathNet Implementation = <br />
The pathnet is implementet using Keras with a tensorflow backend, in a object oriented structure with a high level of mudularity. <br />
<br />
Pathnet layers are represented as subclasses of Layer. Currently only DenseLayer is implemented. <br />
These contain all modules in the layer and functionality for providing a log of layer-information (used for saving pathnet to disc), merging selected modules from the layer with a new model, <br />
temporarily storing weights in the layer and loading them back (used during backend session reset). <br />
Task-objects contain the unique softmax layer, a potential optimal path as well as functionality for providig log (again: saving pathnet to disc), applying unique layer to a new model. <br />
<br />
A PathSearch class contain all implemented search algorithms (currently tournament and a simple evolutionary search are implemented). This class use a provided pathnet object which provides<br />
paths (genotypes) and models (for fitness evaluation). The search metods returns a optimal path along side a history-structure that are used in the Analytics class. <br />
Here, test results are stored and plotted.<br />
<br />
=== PathNet structure === <br />
Small structure to reduce computational requirements. <br />
: (3 layers 10-20 modules of small affine MLPs)<br />
<br />
=== Test scenario ===<br />
Must be fairly quick to provide one episode. Small input dimentionality to reduce necessary capacity of PathNet structure and computational time. <br />
The scenario must also be easy to divide into subtasks.<br />
* OpenAI gym? <br />
** LunarLander: <br />
*** Hover<br />
*** Land safely <br />
*** Land in goal<br />
*** Land in goal quickly <br />
<br />
<br />
<br />
= Who cites PathNet? =<br />
''[https://arxiv.org/pdf/1703.10371.pdf Born to Learn]''<br />
EPANN - Evolved Plastic Artificial Neural Networks<br />
Mentions Pathnet as an example of where evolution where<br />
used to train a network on multiple tasks. "While these<br />
results were only possible through significant computational <br />
resources, they demonstrate the potential of combining <br />
evolution and deep learning approaches.<br />
<br />
''[https://arxiv.org/pdf/1706.00046.pdf Learning time-efficient deep architectures with budgeted super networks]''<br />
Mentions PathNet as a predecessor in the super neural network family<br />
<br />
'' [https://arxiv.org/pdf/1708.07902.pdf Deep Learning for video game playing]''<br />
Reviewing recent deep learning advances in the context <br />
of how they have been applied to play different types of video games<br />
<br />
''[http://ceur-ws.org/Vol-1958/IOTSTREAMING2.pdf Evolutive deep models for online learning on data streams with no storage]''<br />
Pathnet is proposed alongside PNNS as a way to deal with changing environments. It is mentioned that both PathNet and progressive networks show good results on sequences of tasks and are a good alternative to fine-tuning to accelerate learning. <br />
<br />
''[https://openreview.net/pdf?id=H1XLbXEtg Online multi-task learning using active sampling]'' <br />
Cites Progressive Neural Networks for multitask learning<br />
<br />
''[http://juxi.net/workshop/deep-learning-rss-2017/papers/Xu.pdf Hierarchical Task Generalization with Neural Programs]''<br />
Mentions PathNet as way of reusing weights<br />
<br />
''[https://arxiv.org/pdf/1702.02217.pdf Multitask Evolution with Cartesian Genetic Programming]'' <br />
Mentions PathNet in a list of systems that use evolution as tool in multitasking</div>Martijhohttps://robin.wiki.ifi.uio.no/Bruker:MartijhoBruker:Martijho2018-04-26T10:46:24Z<p>Martijho: </p>
<hr />
<div>= Current draft =<br />
<br />
Current draft of the thesis can be found by following this link<br />
:[https://www.dropbox.com/s/mkbbjl7hou708jl/Current%20master%20thesis%20draft%2026.4.pdf?dl=0 Current thesis draft]<br />
Updated as of 26.04.2018<br />
<br />
<br />
; Presentations<br />
: [https://docs.google.com/presentation/d/1I3ObuFmTDMSaoST_G589VSaOIcRpst7osPGR-XxvyOs/edit?usp=sharing| Presentation of PathNet and research questions]<br />
: [https://docs.google.com/presentation/d/1QdQnJfcUNPkWDMSZ2dnVgq2qeHI5mQBge0E9WFBAgtM/edit?usp=sharing| Transfer learning in SNNs: (tl;dr) + First-path-experiments]<br />
: [https://docs.google.com/presentation/d/1Mh5z-AoWE9t0YtXMfJMIm7WHXWDVJkMEqZQvoOmUgtI/edit?usp=sharing| Transfer learning in SNNs: Search-experiments]<br />
: [https://docs.google.com/presentation/d/1jqZsRLzdY9ylh7p3nDj029sNSkalh1YmIEXOkBnhKok/edit?usp=sharing| ML for the cool kids]<br />
<br />
= Thesis structure and notes = <br />
A seperate page describing outline and section structure of the thesis. <br />
: [[Martijho-PathNet-thesis|Thesis structure and outlines]]<br />
<br />
= Experiments = <br />
A separate page describing research questions and the experiments proposed to answer them. <br />
: [[Martijho-PathNet-Experiments|Experiments]]<br />
<br />
<br />
= Terms to use in thesis = <br />
*; Plastic Neural Network<br />
: NN that change topology or connectivity according to learning algorithm<br />
*; In-silico<br />
: performed by computer<br />
*; Modular Super Neural Network<br />
: DNN consisting of modules of smaller NNs,<br />
*; Task-specific meme<br />
: Smallest concise unit of knowledge required to perform some task<br />
: Example - Task = Pick something up. Meme = Ability to bend index finger <br />
*; Memetics<br />
: Study of information in an analogy to Darwinian evolution.<br />
*; Transferability<br />
: Ability to transfer/reuse knowledge between task<br />
*; Saturation in PathNet<br />
: Most or all modules are trained and locked to backpropagation <br />
*; Embedded transfer learning<br />
: Knowledge transfer capability is incorporated into the machine learning structure (PathNet, PNNs)<br />
*; Catastrophic forgetting<br />
: Forgetting previously known task when fine-tuning parameters<br />
*; Evolved sub-models<br />
: Using GAs to evolve paths through a larger set of parameters (PathNet functionality)<br />
<br />
= Thoughts on Thesis =<br />
- Search for the first path is unnecessary? The search is over good permutations of parameters from the network at the same time<br />
the parameters are trained for the first time. In other words: does the search provide a significant increase in transferability or any measurable increase in performance over just picking a random path and training it for a set amount of iterations? <br />
<br />
- When training on a saturated PathNet, it might be quicker to preprocess the data for each path (view it as feature extraction) since there is no backpropagation except for <br />
in the final task-specific softmax layer<br />
<br />
- When training on a curriculum and decrease in batch size for each increase in the task difficulty might make sense. <br />
Easy examples have little "nuance" in-between datapoints so large batch size might increase convergence speed. <br />
Equivalently, complex tasks later on in the curriculum might have a lot of "detail" which will be drowned out if the batch size is kept constant.<br />
<br />
== Thesis problem specification == <br />
Studying the behaviour of super neural networks when saturated with subtasks from the same domain such as in a curriculum learning scenario.<br />
Include research questions such as <br />
* Can we estimate the decline in needed capacity for each new sub-task learned from the curriculum? <br />
* Could a PathNet saturated with optimized paths for tasks from a curriculum provide one/few-shot learning?<br />
** What would, in that case, constitute a "saturated PathNet"? <br />
** Is there a learning advantage to be had from this kind of learning? <br />
* Is there a measurable increase in performance by searching over optimal "first paths" instead of just training a selected segment of the PathNet?<br />
<br />
= PathNet Implementation = <br />
The pathnet is implementet using Keras with a tensorflow backend, in a object oriented structure with a high level of mudularity. <br />
<br />
Pathnet layers are represented as subclasses of Layer. Currently only DenseLayer is implemented. <br />
These contain all modules in the layer and functionality for providing a log of layer-information (used for saving pathnet to disc), merging selected modules from the layer with a new model, <br />
temporarily storing weights in the layer and loading them back (used during backend session reset). <br />
Task-objects contain the unique softmax layer, a potential optimal path as well as functionality for providig log (again: saving pathnet to disc), applying unique layer to a new model. <br />
<br />
A PathSearch class contain all implemented search algorithms (currently tournament and a simple evolutionary search are implemented). This class use a provided pathnet object which provides<br />
paths (genotypes) and models (for fitness evaluation). The search metods returns a optimal path along side a history-structure that are used in the Analytics class. <br />
Here, test results are stored and plotted.<br />
<br />
=== PathNet structure === <br />
Small structure to reduce computational requirements. <br />
: (3 layers 10-20 modules of small affine MLPs)<br />
<br />
=== Test scenario ===<br />
Must be fairly quick to provide one episode. Small input dimentionality to reduce necessary capacity of PathNet structure and computational time. <br />
The scenario must also be easy to divide into subtasks.<br />
* OpenAI gym? <br />
** LunarLander: <br />
*** Hover<br />
*** Land safely <br />
*** Land in goal<br />
*** Land in goal quickly <br />
<br />
<br />
<br />
= Who cites PathNet? =<br />
''[https://arxiv.org/pdf/1703.10371.pdf Born to Learn]''<br />
EPANN - Evolved Plastic Artificial Neural Networks<br />
Mentions Pathnet as an example of where evolution where<br />
used to train a network on multiple tasks. "While these<br />
results were only possible through significant computational <br />
resources, they demonstrate the potential of combining <br />
evolution and deep learning approaches.<br />
<br />
''[https://arxiv.org/pdf/1706.00046.pdf Learning time-efficient deep architectures with budgeted super networks]''<br />
Mentions PathNet as a predecessor in the super neural network family<br />
<br />
'' [https://arxiv.org/pdf/1708.07902.pdf Deep Learning for video game playing]''<br />
Reviewing recent deep learning advances in the context <br />
of how they have been applied to play different types of video games<br />
<br />
''[http://ceur-ws.org/Vol-1958/IOTSTREAMING2.pdf Evolutive deep models for online learning on data streams with no storage]''<br />
Pathnet is proposed alongside PNNS as a way to deal with changing environments. It is mentioned that both PathNet and progressive networks show good results on sequences of tasks and are a good alternative to fine-tuning to accelerate learning. <br />
<br />
''[https://openreview.net/pdf?id=H1XLbXEtg Online multi-task learning using active sampling]'' <br />
Cites Progressive Neural Networks for multitask learning<br />
<br />
''[http://juxi.net/workshop/deep-learning-rss-2017/papers/Xu.pdf Hierarchical Task Generalization with Neural Programs]''<br />
Mentions PathNet as way of reusing weights<br />
<br />
''[https://arxiv.org/pdf/1702.02217.pdf Multitask Evolution with Cartesian Genetic Programming]'' <br />
Mentions PathNet in a list of systems that use evolution as tool in multitasking</div>Martijhohttps://robin.wiki.ifi.uio.no/Bruker:MartijhoBruker:Martijho2018-04-26T10:46:14Z<p>Martijho: </p>
<hr />
<div>= Current draft =<br />
<br />
Current draft of the thesis can be found by following this link<br />
:[https://www.dropbox.com/s/mkbbjl7hou708jl/Current%20master%20thesis%20draft%2026.4.pdf?dl=0 Current thesis draft]<br />
Updated as of 24.04.2018<br />
<br />
<br />
; Presentations<br />
: [https://docs.google.com/presentation/d/1I3ObuFmTDMSaoST_G589VSaOIcRpst7osPGR-XxvyOs/edit?usp=sharing| Presentation of PathNet and research questions]<br />
: [https://docs.google.com/presentation/d/1QdQnJfcUNPkWDMSZ2dnVgq2qeHI5mQBge0E9WFBAgtM/edit?usp=sharing| Transfer learning in SNNs: (tl;dr) + First-path-experiments]<br />
: [https://docs.google.com/presentation/d/1Mh5z-AoWE9t0YtXMfJMIm7WHXWDVJkMEqZQvoOmUgtI/edit?usp=sharing| Transfer learning in SNNs: Search-experiments]<br />
: [https://docs.google.com/presentation/d/1jqZsRLzdY9ylh7p3nDj029sNSkalh1YmIEXOkBnhKok/edit?usp=sharing| ML for the cool kids]<br />
<br />
= Thesis structure and notes = <br />
A seperate page describing outline and section structure of the thesis. <br />
: [[Martijho-PathNet-thesis|Thesis structure and outlines]]<br />
<br />
= Experiments = <br />
A separate page describing research questions and the experiments proposed to answer them. <br />
: [[Martijho-PathNet-Experiments|Experiments]]<br />
<br />
<br />
= Terms to use in thesis = <br />
*; Plastic Neural Network<br />
: NN that change topology or connectivity according to learning algorithm<br />
*; In-silico<br />
: performed by computer<br />
*; Modular Super Neural Network<br />
: DNN consisting of modules of smaller NNs,<br />
*; Task-specific meme<br />
: Smallest concise unit of knowledge required to perform some task<br />
: Example - Task = Pick something up. Meme = Ability to bend index finger <br />
*; Memetics<br />
: Study of information in an analogy to Darwinian evolution.<br />
*; Transferability<br />
: Ability to transfer/reuse knowledge between task<br />
*; Saturation in PathNet<br />
: Most or all modules are trained and locked to backpropagation <br />
*; Embedded transfer learning<br />
: Knowledge transfer capability is incorporated into the machine learning structure (PathNet, PNNs)<br />
*; Catastrophic forgetting<br />
: Forgetting previously known task when fine-tuning parameters<br />
*; Evolved sub-models<br />
: Using GAs to evolve paths through a larger set of parameters (PathNet functionality)<br />
<br />
= Thoughts on Thesis =<br />
- Search for the first path is unnecessary? The search is over good permutations of parameters from the network at the same time<br />
the parameters are trained for the first time. In other words: does the search provide a significant increase in transferability or any measurable increase in performance over just picking a random path and training it for a set amount of iterations? <br />
<br />
- When training on a saturated PathNet, it might be quicker to preprocess the data for each path (view it as feature extraction) since there is no backpropagation except for <br />
in the final task-specific softmax layer<br />
<br />
- When training on a curriculum and decrease in batch size for each increase in the task difficulty might make sense. <br />
Easy examples have little "nuance" in-between datapoints so large batch size might increase convergence speed. <br />
Equivalently, complex tasks later on in the curriculum might have a lot of "detail" which will be drowned out if the batch size is kept constant.<br />
<br />
== Thesis problem specification == <br />
Studying the behaviour of super neural networks when saturated with subtasks from the same domain such as in a curriculum learning scenario.<br />
Include research questions such as <br />
* Can we estimate the decline in needed capacity for each new sub-task learned from the curriculum? <br />
* Could a PathNet saturated with optimized paths for tasks from a curriculum provide one/few-shot learning?<br />
** What would, in that case, constitute a "saturated PathNet"? <br />
** Is there a learning advantage to be had from this kind of learning? <br />
* Is there a measurable increase in performance by searching over optimal "first paths" instead of just training a selected segment of the PathNet?<br />
<br />
= PathNet Implementation = <br />
The pathnet is implementet using Keras with a tensorflow backend, in a object oriented structure with a high level of mudularity. <br />
<br />
Pathnet layers are represented as subclasses of Layer. Currently only DenseLayer is implemented. <br />
These contain all modules in the layer and functionality for providing a log of layer-information (used for saving pathnet to disc), merging selected modules from the layer with a new model, <br />
temporarily storing weights in the layer and loading them back (used during backend session reset). <br />
Task-objects contain the unique softmax layer, a potential optimal path as well as functionality for providig log (again: saving pathnet to disc), applying unique layer to a new model. <br />
<br />
A PathSearch class contain all implemented search algorithms (currently tournament and a simple evolutionary search are implemented). This class use a provided pathnet object which provides<br />
paths (genotypes) and models (for fitness evaluation). The search metods returns a optimal path along side a history-structure that are used in the Analytics class. <br />
Here, test results are stored and plotted.<br />
<br />
=== PathNet structure === <br />
Small structure to reduce computational requirements. <br />
: (3 layers 10-20 modules of small affine MLPs)<br />
<br />
=== Test scenario ===<br />
Must be fairly quick to provide one episode. Small input dimentionality to reduce necessary capacity of PathNet structure and computational time. <br />
The scenario must also be easy to divide into subtasks.<br />
* OpenAI gym? <br />
** LunarLander: <br />
*** Hover<br />
*** Land safely <br />
*** Land in goal<br />
*** Land in goal quickly <br />
<br />
<br />
<br />
= Who cites PathNet? =<br />
''[https://arxiv.org/pdf/1703.10371.pdf Born to Learn]''<br />
EPANN - Evolved Plastic Artificial Neural Networks<br />
Mentions Pathnet as an example of where evolution where<br />
used to train a network on multiple tasks. "While these<br />
results were only possible through significant computational <br />
resources, they demonstrate the potential of combining <br />
evolution and deep learning approaches.<br />
<br />
''[https://arxiv.org/pdf/1706.00046.pdf Learning time-efficient deep architectures with budgeted super networks]''<br />
Mentions PathNet as a predecessor in the super neural network family<br />
<br />
'' [https://arxiv.org/pdf/1708.07902.pdf Deep Learning for video game playing]''<br />
Reviewing recent deep learning advances in the context <br />
of how they have been applied to play different types of video games<br />
<br />
''[http://ceur-ws.org/Vol-1958/IOTSTREAMING2.pdf Evolutive deep models for online learning on data streams with no storage]''<br />
Pathnet is proposed alongside PNNS as a way to deal with changing environments. It is mentioned that both PathNet and progressive networks show good results on sequences of tasks and are a good alternative to fine-tuning to accelerate learning. <br />
<br />
''[https://openreview.net/pdf?id=H1XLbXEtg Online multi-task learning using active sampling]'' <br />
Cites Progressive Neural Networks for multitask learning<br />
<br />
''[http://juxi.net/workshop/deep-learning-rss-2017/papers/Xu.pdf Hierarchical Task Generalization with Neural Programs]''<br />
Mentions PathNet as way of reusing weights<br />
<br />
''[https://arxiv.org/pdf/1702.02217.pdf Multitask Evolution with Cartesian Genetic Programming]'' <br />
Mentions PathNet in a list of systems that use evolution as tool in multitasking</div>Martijhohttps://robin.wiki.ifi.uio.no/Bruker:MartijhoBruker:Martijho2018-04-24T17:17:59Z<p>Martijho: </p>
<hr />
<div>= Current draft =<br />
<br />
Current draft of the thesis can be found by following this link<br />
:[https://www.dropbox.com/s/curk3r125ldlxzv/Master_thesis%2024.april.pdf?dl=0 CurrentMasterThesis.pdf]<br />
Updated as of 24.04.2018<br />
<br />
<br />
; Presentations<br />
: [https://docs.google.com/presentation/d/1I3ObuFmTDMSaoST_G589VSaOIcRpst7osPGR-XxvyOs/edit?usp=sharing| Presentation of PathNet and research questions]<br />
: [https://docs.google.com/presentation/d/1QdQnJfcUNPkWDMSZ2dnVgq2qeHI5mQBge0E9WFBAgtM/edit?usp=sharing| Transfer learning in SNNs: (tl;dr) + First-path-experiments]<br />
: [https://docs.google.com/presentation/d/1Mh5z-AoWE9t0YtXMfJMIm7WHXWDVJkMEqZQvoOmUgtI/edit?usp=sharing| Transfer learning in SNNs: Search-experiments]<br />
: [https://docs.google.com/presentation/d/1jqZsRLzdY9ylh7p3nDj029sNSkalh1YmIEXOkBnhKok/edit?usp=sharing| ML for the cool kids]<br />
<br />
= Thesis structure and notes = <br />
A seperate page describing outline and section structure of the thesis. <br />
: [[Martijho-PathNet-thesis|Thesis structure and outlines]]<br />
<br />
= Experiments = <br />
A separate page describing research questions and the experiments proposed to answer them. <br />
: [[Martijho-PathNet-Experiments|Experiments]]<br />
<br />
<br />
= Terms to use in thesis = <br />
*; Plastic Neural Network<br />
: NN that change topology or connectivity according to learning algorithm<br />
*; In-silico<br />
: performed by computer<br />
*; Modular Super Neural Network<br />
: DNN consisting of modules of smaller NNs,<br />
*; Task-specific meme<br />
: Smallest concise unit of knowledge required to perform some task<br />
: Example - Task = Pick something up. Meme = Ability to bend index finger <br />
*; Memetics<br />
: Study of information in an analogy to Darwinian evolution.<br />
*; Transferability<br />
: Ability to transfer/reuse knowledge between task<br />
*; Saturation in PathNet<br />
: Most or all modules are trained and locked to backpropagation <br />
*; Embedded transfer learning<br />
: Knowledge transfer capability is incorporated into the machine learning structure (PathNet, PNNs)<br />
*; Catastrophic forgetting<br />
: Forgetting previously known task when fine-tuning parameters<br />
*; Evolved sub-models<br />
: Using GAs to evolve paths through a larger set of parameters (PathNet functionality)<br />
<br />
= Thoughts on Thesis =<br />
- Search for the first path is unnecessary? The search is over good permutations of parameters from the network at the same time<br />
the parameters are trained for the first time. In other words: does the search provide a significant increase in transferability or any measurable increase in performance over just picking a random path and training it for a set amount of iterations? <br />
<br />
- When training on a saturated PathNet, it might be quicker to preprocess the data for each path (view it as feature extraction) since there is no backpropagation except for <br />
in the final task-specific softmax layer<br />
<br />
- When training on a curriculum and decrease in batch size for each increase in the task difficulty might make sense. <br />
Easy examples have little "nuance" in-between datapoints so large batch size might increase convergence speed. <br />
Equivalently, complex tasks later on in the curriculum might have a lot of "detail" which will be drowned out if the batch size is kept constant.<br />
<br />
== Thesis problem specification == <br />
Studying the behaviour of super neural networks when saturated with subtasks from the same domain such as in a curriculum learning scenario.<br />
Include research questions such as <br />
* Can we estimate the decline in needed capacity for each new sub-task learned from the curriculum? <br />
* Could a PathNet saturated with optimized paths for tasks from a curriculum provide one/few-shot learning?<br />
** What would, in that case, constitute a "saturated PathNet"? <br />
** Is there a learning advantage to be had from this kind of learning? <br />
* Is there a measurable increase in performance by searching over optimal "first paths" instead of just training a selected segment of the PathNet?<br />
<br />
= PathNet Implementation = <br />
The pathnet is implementet using Keras with a tensorflow backend, in a object oriented structure with a high level of mudularity. <br />
<br />
Pathnet layers are represented as subclasses of Layer. Currently only DenseLayer is implemented. <br />
These contain all modules in the layer and functionality for providing a log of layer-information (used for saving pathnet to disc), merging selected modules from the layer with a new model, <br />
temporarily storing weights in the layer and loading them back (used during backend session reset). <br />
Task-objects contain the unique softmax layer, a potential optimal path as well as functionality for providig log (again: saving pathnet to disc), applying unique layer to a new model. <br />
<br />
A PathSearch class contain all implemented search algorithms (currently tournament and a simple evolutionary search are implemented). This class use a provided pathnet object which provides<br />
paths (genotypes) and models (for fitness evaluation). The search metods returns a optimal path along side a history-structure that are used in the Analytics class. <br />
Here, test results are stored and plotted.<br />
<br />
=== PathNet structure === <br />
Small structure to reduce computational requirements. <br />
: (3 layers 10-20 modules of small affine MLPs)<br />
<br />
=== Test scenario ===<br />
Must be fairly quick to provide one episode. Small input dimentionality to reduce necessary capacity of PathNet structure and computational time. <br />
The scenario must also be easy to divide into subtasks.<br />
* OpenAI gym? <br />
** LunarLander: <br />
*** Hover<br />
*** Land safely <br />
*** Land in goal<br />
*** Land in goal quickly <br />
<br />
<br />
<br />
= Who cites PathNet? =<br />
''[https://arxiv.org/pdf/1703.10371.pdf Born to Learn]''<br />
EPANN - Evolved Plastic Artificial Neural Networks<br />
Mentions Pathnet as an example of where evolution where<br />
used to train a network on multiple tasks. "While these<br />
results were only possible through significant computational <br />
resources, they demonstrate the potential of combining <br />
evolution and deep learning approaches.<br />
<br />
''[https://arxiv.org/pdf/1706.00046.pdf Learning time-efficient deep architectures with budgeted super networks]''<br />
Mentions PathNet as a predecessor in the super neural network family<br />
<br />
'' [https://arxiv.org/pdf/1708.07902.pdf Deep Learning for video game playing]''<br />
Reviewing recent deep learning advances in the context <br />
of how they have been applied to play different types of video games<br />
<br />
''[http://ceur-ws.org/Vol-1958/IOTSTREAMING2.pdf Evolutive deep models for online learning on data streams with no storage]''<br />
Pathnet is proposed alongside PNNS as a way to deal with changing environments. It is mentioned that both PathNet and progressive networks show good results on sequences of tasks and are a good alternative to fine-tuning to accelerate learning. <br />
<br />
''[https://openreview.net/pdf?id=H1XLbXEtg Online multi-task learning using active sampling]'' <br />
Cites Progressive Neural Networks for multitask learning<br />
<br />
''[http://juxi.net/workshop/deep-learning-rss-2017/papers/Xu.pdf Hierarchical Task Generalization with Neural Programs]''<br />
Mentions PathNet as way of reusing weights<br />
<br />
''[https://arxiv.org/pdf/1702.02217.pdf Multitask Evolution with Cartesian Genetic Programming]'' <br />
Mentions PathNet in a list of systems that use evolution as tool in multitasking</div>Martijhohttps://robin.wiki.ifi.uio.no/Bruker:MartijhoBruker:Martijho2018-04-24T17:17:34Z<p>Martijho: </p>
<hr />
<div><br />
= Current draft =<br />
<br />
Current draft of the thesis can be found by following this link<br />
[https://www.dropbox.com/s/curk3r125ldlxzv/Master_thesis%2024.april.pdf?dl=0 CurrentMasterThesis.pdf]<br />
Updated as of 24.04.2018<br />
<br />
<br />
; Presentations<br />
: [https://docs.google.com/presentation/d/1I3ObuFmTDMSaoST_G589VSaOIcRpst7osPGR-XxvyOs/edit?usp=sharing| Presentation of PathNet and research questions]<br />
: [https://docs.google.com/presentation/d/1QdQnJfcUNPkWDMSZ2dnVgq2qeHI5mQBge0E9WFBAgtM/edit?usp=sharing| Transfer learning in SNNs: (tl;dr) + First-path-experiments]<br />
: [https://docs.google.com/presentation/d/1Mh5z-AoWE9t0YtXMfJMIm7WHXWDVJkMEqZQvoOmUgtI/edit?usp=sharing| Transfer learning in SNNs: Search-experiments]<br />
: [https://docs.google.com/presentation/d/1jqZsRLzdY9ylh7p3nDj029sNSkalh1YmIEXOkBnhKok/edit?usp=sharing| ML for the cool kids]<br />
<br />
= Thesis structure and notes = <br />
A seperate page describing outline and section structure of the thesis. <br />
: [[Martijho-PathNet-thesis|Thesis structure and outlines]]<br />
<br />
= Experiments = <br />
A separate page describing research questions and the experiments proposed to answer them. <br />
: [[Martijho-PathNet-Experiments|Experiments]]<br />
<br />
<br />
= Terms to use in thesis = <br />
*; Plastic Neural Network<br />
: NN that change topology or connectivity according to learning algorithm<br />
*; In-silico<br />
: performed by computer<br />
*; Modular Super Neural Network<br />
: DNN consisting of modules of smaller NNs,<br />
*; Task-specific meme<br />
: Smallest concise unit of knowledge required to perform some task<br />
: Example - Task = Pick something up. Meme = Ability to bend index finger <br />
*; Memetics<br />
: Study of information in an analogy to Darwinian evolution.<br />
*; Transferability<br />
: Ability to transfer/reuse knowledge between task<br />
*; Saturation in PathNet<br />
: Most or all modules are trained and locked to backpropagation <br />
*; Embedded transfer learning<br />
: Knowledge transfer capability is incorporated into the machine learning structure (PathNet, PNNs)<br />
*; Catastrophic forgetting<br />
: Forgetting previously known task when fine-tuning parameters<br />
*; Evolved sub-models<br />
: Using GAs to evolve paths through a larger set of parameters (PathNet functionality)<br />
<br />
= Thoughts on Thesis =<br />
- Search for the first path is unnecessary? The search is over good permutations of parameters from the network at the same time<br />
the parameters are trained for the first time. In other words: does the search provide a significant increase in transferability or any measurable increase in performance over just picking a random path and training it for a set amount of iterations? <br />
<br />
- When training on a saturated PathNet, it might be quicker to preprocess the data for each path (view it as feature extraction) since there is no backpropagation except for <br />
in the final task-specific softmax layer<br />
<br />
- When training on a curriculum and decrease in batch size for each increase in the task difficulty might make sense. <br />
Easy examples have little "nuance" in-between datapoints so large batch size might increase convergence speed. <br />
Equivalently, complex tasks later on in the curriculum might have a lot of "detail" which will be drowned out if the batch size is kept constant.<br />
<br />
== Thesis problem specification == <br />
Studying the behaviour of super neural networks when saturated with subtasks from the same domain such as in a curriculum learning scenario.<br />
Include research questions such as <br />
* Can we estimate the decline in needed capacity for each new sub-task learned from the curriculum? <br />
* Could a PathNet saturated with optimized paths for tasks from a curriculum provide one/few-shot learning?<br />
** What would, in that case, constitute a "saturated PathNet"? <br />
** Is there a learning advantage to be had from this kind of learning? <br />
* Is there a measurable increase in performance by searching over optimal "first paths" instead of just training a selected segment of the PathNet?<br />
<br />
= PathNet Implementation = <br />
The pathnet is implementet using Keras with a tensorflow backend, in a object oriented structure with a high level of mudularity. <br />
<br />
Pathnet layers are represented as subclasses of Layer. Currently only DenseLayer is implemented. <br />
These contain all modules in the layer and functionality for providing a log of layer-information (used for saving pathnet to disc), merging selected modules from the layer with a new model, <br />
temporarily storing weights in the layer and loading them back (used during backend session reset). <br />
Task-objects contain the unique softmax layer, a potential optimal path as well as functionality for providig log (again: saving pathnet to disc), applying unique layer to a new model. <br />
<br />
A PathSearch class contain all implemented search algorithms (currently tournament and a simple evolutionary search are implemented). This class use a provided pathnet object which provides<br />
paths (genotypes) and models (for fitness evaluation). The search metods returns a optimal path along side a history-structure that are used in the Analytics class. <br />
Here, test results are stored and plotted.<br />
<br />
=== PathNet structure === <br />
Small structure to reduce computational requirements. <br />
: (3 layers 10-20 modules of small affine MLPs)<br />
<br />
=== Test scenario ===<br />
Must be fairly quick to provide one episode. Small input dimentionality to reduce necessary capacity of PathNet structure and computational time. <br />
The scenario must also be easy to divide into subtasks.<br />
* OpenAI gym? <br />
** LunarLander: <br />
*** Hover<br />
*** Land safely <br />
*** Land in goal<br />
*** Land in goal quickly <br />
<br />
<br />
<br />
= Who cites PathNet? =<br />
''[https://arxiv.org/pdf/1703.10371.pdf Born to Learn]''<br />
EPANN - Evolved Plastic Artificial Neural Networks<br />
Mentions Pathnet as an example of where evolution where<br />
used to train a network on multiple tasks. "While these<br />
results were only possible through significant computational <br />
resources, they demonstrate the potential of combining <br />
evolution and deep learning approaches.<br />
<br />
''[https://arxiv.org/pdf/1706.00046.pdf Learning time-efficient deep architectures with budgeted super networks]''<br />
Mentions PathNet as a predecessor in the super neural network family<br />
<br />
'' [https://arxiv.org/pdf/1708.07902.pdf Deep Learning for video game playing]''<br />
Reviewing recent deep learning advances in the context <br />
of how they have been applied to play different types of video games<br />
<br />
''[http://ceur-ws.org/Vol-1958/IOTSTREAMING2.pdf Evolutive deep models for online learning on data streams with no storage]''<br />
Pathnet is proposed alongside PNNS as a way to deal with changing environments. It is mentioned that both PathNet and progressive networks show good results on sequences of tasks and are a good alternative to fine-tuning to accelerate learning. <br />
<br />
''[https://openreview.net/pdf?id=H1XLbXEtg Online multi-task learning using active sampling]'' <br />
Cites Progressive Neural Networks for multitask learning<br />
<br />
''[http://juxi.net/workshop/deep-learning-rss-2017/papers/Xu.pdf Hierarchical Task Generalization with Neural Programs]''<br />
Mentions PathNet as way of reusing weights<br />
<br />
''[https://arxiv.org/pdf/1702.02217.pdf Multitask Evolution with Cartesian Genetic Programming]'' <br />
Mentions PathNet in a list of systems that use evolution as tool in multitasking</div>Martijhohttps://robin.wiki.ifi.uio.no/Progress_for_week_38_(2017)Progress for week 38 (2017)2018-04-19T13:16:45Z<p>Martijho: </p>
<hr />
<div>== Martin ==<br />
=== Budget ===<br />
* Fullføre essay<br />
<br />
=== Accounting ===<br />
* Fullførte essay [https://www.dropbox.com/s/02n6orwh5ayocr9/Thesis%20essay%20-%20Martin%20J.%20Hovin.pdf?dl=0]<br />
* Hadde møte med Arjun fra Telenor Research<br />
<br />
== Jonas ==<br />
=== Budget ===<br />
* Fullføre essay<br />
<br />
=== Accounting ===<br />
* Fullførte essay<br />
* Avtalt å møte Ole Jakob<br />
<br />
=== Kim ===<br />
=== Accounting ===<br />
* Fullførte essay<br />
* Bakt kake<br />
<br />
=== Budget - Neste uke? ===<br />
* Sette meg inn i simulatorer<br />
* Velge robotmodell<br />
<br />
<br />
<br />
== Student template (copy this for your entry) ==<br />
=== Budget ===<br />
* Todo 1<br />
* Todo 2<br />
<br />
=== Accounting ===<br />
* Done 1<br />
* Done 2<br />
<br />
== Example student(for your entry, copy from above) ==<br />
=== Budget ===<br />
* I will read wikipedia article on evolutionary algorithms<br />
* I will install Latex on my laptop<br />
<br />
=== Accounting ===<br />
* I read wikipedia article on evolutionary algorithms, and also followed this further to pages on genetic algorithms, evolutionary strategies, and multi-objective evolutionary algorithms.<br />
* I tried to install Latex on my laptop, but it didn't work (makefile failed)</div>Martijhohttps://robin.wiki.ifi.uio.no/Progress_for_week_38_(2017)Progress for week 38 (2017)2018-04-19T13:16:35Z<p>Martijho: </p>
<hr />
<div>== Martin ==<br />
=== Budget ===<br />
* Fullføre essay<br />
* Prøve å glemme jacob<br />
<br />
=== Accounting ===<br />
* Fullførte essay [https://www.dropbox.com/s/02n6orwh5ayocr9/Thesis%20essay%20-%20Martin%20J.%20Hovin.pdf?dl=0]<br />
* Hadde møte med Arjun fra Telenor Research<br />
<br />
== Jonas ==<br />
=== Budget ===<br />
* Fullføre essay<br />
<br />
=== Accounting ===<br />
* Fullførte essay<br />
* Avtalt å møte Ole Jakob<br />
<br />
=== Kim ===<br />
=== Accounting ===<br />
* Fullførte essay<br />
* Bakt kake<br />
<br />
=== Budget - Neste uke? ===<br />
* Sette meg inn i simulatorer<br />
* Velge robotmodell<br />
<br />
<br />
<br />
== Student template (copy this for your entry) ==<br />
=== Budget ===<br />
* Todo 1<br />
* Todo 2<br />
<br />
=== Accounting ===<br />
* Done 1<br />
* Done 2<br />
<br />
== Example student(for your entry, copy from above) ==<br />
=== Budget ===<br />
* I will read wikipedia article on evolutionary algorithms<br />
* I will install Latex on my laptop<br />
<br />
=== Accounting ===<br />
* I read wikipedia article on evolutionary algorithms, and also followed this further to pages on genetic algorithms, evolutionary strategies, and multi-objective evolutionary algorithms.<br />
* I tried to install Latex on my laptop, but it didn't work (makefile failed)</div>Martijhohttps://robin.wiki.ifi.uio.no/Progress_for_week_16_(2018)Progress for week 16 (2018)2018-04-19T13:15:38Z<p>Martijho: </p>
<hr />
<div>== Vetle Bu Solgård ==<br />
=== Budget ===<br />
* Start DDPG and TRPO implementation<br />
* Make a decision and implement on where to validate algorithms<br />
* Extend REINFORCE with baseline<br />
* Read up on typical state representations and reward signals for locomotion tasks<br />
* Get an overview of potensial different tasks to explore with Dyret (Balance, movement speed, etc.)<br />
<br />
=== Accounting ===<br />
*<br />
<br />
== Martin Hovin ==<br />
=== Budget ===<br />
* Skrive resultater, diskusjon og konklusjon i Relearning <br />
* Skrive introduksjon<br />
* Skrive avsluttning på oppgaven <br />
<br />
=== Accounting ===<br />
* Skrevet første utkast til Relearning<br />
** For kort avsluttning? <br />
** Henger første halvdel sammen med siste? <br />
* Skrevet kort intro<br />
** Trengs det mer?<br />
** Må problemstillingen konkteriseres mer?<br />
* Startet på avsluttningen<br />
** Seksjonene er nå: <br />
:: Result summary<br />
:: Discussion summary<br />
:: Thesis conclusion<br />
:: Future work</div>Martijhohttps://robin.wiki.ifi.uio.no/Progress_for_week_14_(2018)Progress for week 14 (2018)2018-04-10T17:18:13Z<p>Martijho: /* Accounting */</p>
<hr />
<div>== Vetle Bu Solgård ==<br />
=== Budget ===<br />
* Finish RL environment implementation<br />
* Continue research of algorithms to implement<br />
* Start implementing chosen algo(s)<br />
<br />
=== Accounting ===<br />
* Environment semi finished<br />
* REINFORCE simplest candidate. Will continue with TRPO and DDPG.<br />
* Implementation of REINFORCE started.<br />
<br />
== Martin Hovin ==<br />
=== Budget ===<br />
* Oppdatere eksperiment 2 med nye resultater<br />
* Legge til om ANOVA i teoretisk bakgrunn<br />
* Skrive om diversity-metric i bakgrunn<br />
<br />
=== Accounting ===<br />
* Oppdatert eksperiment 2 med nye resultater<br />
* Oppdatert resultatene med MannWhitney-test<br />
* Kjørt algorithm 3a and 3b med tournament size range 2-10 <br />
** MannWhitney av forskjellen<br />
** Skrevet addendum til kap. <br />
* Skrevet om ANOVA I teoretisk bakgrunn<br />
* Fjerne ANOVA fra teoretisk bakgrunn<br />
* Skrevet om MannWhitney og Bonferroni correction i teoretisk bakgrunn<br />
* Skrevet om diversity i backgrun<br />
<br />
* Så langt jeg har kommet i oppgaven kan finnes [https://www.dropbox.com/s/amb429kvkum1yut/current_master_thesis.pdf?dl=0 her]</div>Martijhohttps://robin.wiki.ifi.uio.no/Progress_for_week_14_(2018)Progress for week 14 (2018)2018-04-08T12:51:56Z<p>Martijho: </p>
<hr />
<div>== Vetle Bu Solgård ==<br />
=== Budget ===<br />
* Finish RL environment implementation<br />
* Continue research of algorithms to implement<br />
* Start implementing chosen algo(s)<br />
<br />
=== Accounting ===<br />
* Environment semi finished<br />
* REINFORCE simplest candidate. Will continue with TRPO and DDPG.<br />
* Implementation of REINFORCE started.<br />
<br />
== Martin Hovin ==<br />
=== Budget ===<br />
* Oppdatere eksperiment 2 med nye resultater<br />
* Legge til om ANOVA i teoretisk bakgrunn<br />
* Skrive om diversity-metric i bakgrunn<br />
<br />
=== Accounting ===<br />
* Oppdatert eksperiment 2 med nye resultater<br />
* Oppdatert resultatene med MannWhitney-test<br />
* Kjørt algorithm 3a and 3b med tournament size range 2-10 <br />
** MannWhitney av forskjellen<br />
** Skrevet addendum til kap. <br />
* Skrevet om ANOVA I teoretisk bakgrunn<br />
* Fjerne ANOVA fra teoretisk bakgrunn<br />
* Skrevet om MannWhitney og Bonferroni correction i teoretisk bakgrunn<br />
* Skrevet om diversity i backgrun<br />
<br />
* Så langt jeg har kommet i oppgaven kan finnes [https://www.dropbox.com/s/m5nouk0d2ua5tfy/current_master_thesis.pdf?dl=0 her]</div>Martijhohttps://robin.wiki.ifi.uio.no/Progress_for_week_14_(2018)Progress for week 14 (2018)2018-04-08T12:49:17Z<p>Martijho: </p>
<hr />
<div>== Vetle Bu Solgård ==<br />
=== Budget ===<br />
* Finish RL environment implementation<br />
* Continue research of algorithms to implement<br />
* Start implementing chosen algo(s)<br />
<br />
=== Accounting ===<br />
* Environment semi finished<br />
* REINFORCE simplest candidate. Will continue with TRPO and DDPG.<br />
* Implementation of REINFORCE started.<br />
<br />
== Martin Hovin ==<br />
=== Budget ===<br />
* Oppdatere eksperiment 2 med nye resultater<br />
* Legge til om ANOVA i teoretisk bakgrunn<br />
* Skrive om diversity-metric i bakgrunn<br />
<br />
=== Accounting ===<br />
* Oppdatert eksperiment 2 med nye resultater<br />
* Oppdatert resultatene med MannWhitney-test<br />
* Kjørt algorithm 3a and 3b med tournament size range 2-10 <br />
** MannWhitney av forskjellen<br />
** Skrevet addendum til kap. <br />
* Skrevet om ANOVA I teoretisk bakgrunn<br />
* Fjerne ANOVA fra teoretisk bakgrunn<br />
* Skrevet om MannWhitney og Bonferroni correction i teoretisk bakgrunn<br />
* Skrevet om diversity i backgrun<br />
<br />
* Så langt jeg har kommet i oppgaven</div>Martijhohttps://robin.wiki.ifi.uio.no/Progress_for_week_15_(2018)Progress for week 15 (2018)2018-04-03T09:44:19Z<p>Martijho: </p>
<hr />
<div>== Vetle Bu Solgård ==<br />
=== Budget ===<br />
* <br />
<br />
=== Accounting ===<br />
*<br />
<br />
== Martin Hovin ==<br />
=== Budget ===<br />
* Gjøre ferdig Relearn<br />
* Skrive om resultater<br />
* Skrive Diskusjon<br />
* Skrive konklusjon<br />
* Legge til evt. nye deler til teoretisk bakgrunn</div>Martijhohttps://robin.wiki.ifi.uio.no/Progress_for_week_16_(2018)Progress for week 16 (2018)2018-04-03T09:43:16Z<p>Martijho: </p>
<hr />
<div>== Vetle Bu Solgård ==<br />
=== Budget ===<br />
* <br />
<br />
=== Accounting ===<br />
*<br />
<br />
== Martin Hovin ==<br />
=== Budget ===<br />
* Forbedre tekst<br />
* Skrive Introduksjon<br />
* Skrive Acknowledgement<br />
* Skrive Abstract<br />
=== Accounting ===<br />
*</div>Martijhohttps://robin.wiki.ifi.uio.no/Progress_for_week_17_(2018)Progress for week 17 (2018)2018-04-03T09:42:42Z<p>Martijho: </p>
<hr />
<div>== Vetle Bu Solgård ==<br />
=== Budget ===<br />
* <br />
<br />
=== Accounting ===<br />
*<br />
<br />
== Martin Hovin ==<br />
=== Budget ===<br />
* Forbedre tekst<br />
* Skrive Introduksjon<br />
* Skrive Acknowledgements og Abstract<br />
=== Accounting ===<br />
*</div>Martijhohttps://robin.wiki.ifi.uio.no/Progress_for_week_18_(2018)Progress for week 18 (2018)2018-04-03T09:42:04Z<p>Martijho: Ny side: == Vetle Bu Solgård == === Budget === * === Accounting === * == Martin Hovin == === Budget === * Forbedre tekst * Levere === Accounting === *</p>
<hr />
<div>== Vetle Bu Solgård ==<br />
=== Budget ===<br />
* <br />
<br />
=== Accounting ===<br />
*<br />
<br />
== Martin Hovin ==<br />
=== Budget ===<br />
* Forbedre tekst<br />
* Levere <br />
<br />
=== Accounting ===<br />
*</div>Martijhohttps://robin.wiki.ifi.uio.no/Progress_for_week_17_(2018)Progress for week 17 (2018)2018-04-03T09:41:36Z<p>Martijho: Ny side: == Vetle Bu Solgård == === Budget === * === Accounting === * == Martin Hovin == === Budget === * === Accounting === *</p>
<hr />
<div>== Vetle Bu Solgård ==<br />
=== Budget ===<br />
* <br />
<br />
=== Accounting ===<br />
*<br />
<br />
== Martin Hovin ==<br />
=== Budget ===<br />
* <br />
=== Accounting ===<br />
*</div>Martijhohttps://robin.wiki.ifi.uio.no/Progress_for_week_16_(2018)Progress for week 16 (2018)2018-04-03T09:41:25Z<p>Martijho: Ny side: == Vetle Bu Solgård == === Budget === * === Accounting === * == Martin Hovin == === Budget === * === Accounting === *</p>
<hr />
<div>== Vetle Bu Solgård ==<br />
=== Budget ===<br />
* <br />
<br />
=== Accounting ===<br />
*<br />
<br />
== Martin Hovin ==<br />
=== Budget ===<br />
* <br />
=== Accounting ===<br />
*</div>Martijhohttps://robin.wiki.ifi.uio.no/Progress_for_week_15_(2018)Progress for week 15 (2018)2018-04-03T09:41:11Z<p>Martijho: Ny side: == Vetle Bu Solgård == === Budget === * === Accounting === * == Martin Hovin == === Budget === *</p>
<hr />
<div>== Vetle Bu Solgård ==<br />
=== Budget ===<br />
* <br />
<br />
=== Accounting ===<br />
*<br />
<br />
== Martin Hovin ==<br />
=== Budget ===<br />
*</div>Martijhohttps://robin.wiki.ifi.uio.no/ProgressH2017ProgressH20172018-04-03T09:40:12Z<p>Martijho: </p>
<hr />
<div>==== Progress ====<br />
* [[Progress for week 38 (2017)]]<br />
* [[Progress for week 39 (2017)]]<br />
* [[Progress for week 40 & 41 (2017)]]<br />
* [[Progress for week 42 (2017)]]<br />
* [[Progress for week 43 (2017)]]<br />
* [[Progress for week 44 (2017)]]<br />
* [[Progress for week 45 (2017)]]<br />
* [[Progress for week 46 (2017)]]<br />
* [[Progress for week 47 (2017)]]<br />
* [[Progress for week 4 (2018)]]<br />
* [[Progress for week 5 (2018)]]<br />
* [[Progress for week 6 (2018)]]<br />
* [[Progress for week 7 (2018)]]<br />
* [[Progress for week 8 (2018)]]<br />
* [[Progress for week 9 (2018)]]<br />
* [[Progress for week 10 (2018)]]<br />
* [[Progress for week 11 (2018)]]<br />
* [[Progress for week 12 (2018)]]<br />
* [[Progress for week 13 (2018)]]<br />
* [[Progress for week 14 (2018)]]<br />
* [[Progress for week 15 (2018)]]<br />
* [[Progress for week 16 (2018)]]<br />
* [[Progress for week 17 (2018)]]<br />
* [[Progress for week 18 (2018)]]<br />
<br />
==== Weekly status meetings ====<br />
* [https://docs.google.com/spreadsheets/d/1PooW29tk3VEfJyKGOmfUXtRnrpCywh7xnMK4to941Pg/edit?usp=sharing| Participation at meetings]<br />
<br />
==== How to use this page ====<br />
This page is meant for master students in the ROBIN group to track progress. We believe it is indeed very useful for yourself to have such an overview, and not the least to think concretely about progress in this way already from the start. Also, it helps supervisors providing a better and more to the point supervision.<br />
<br />
We have divided progress for each student into two parts, which we refer to as budgeting and accounting. Budgeting is then simply the plans written up ahead, while accounting is the list of what was indeed done when going through this afterwards. Some students also divide between visible and invisible progress. The points with this would be to become more aware of any important progress being made which is not really measurable. As an example, writing a page of text is very visible progress, while determining how to approach a task is just as important, but as it does not provide visible results one can easily feel that nothing has been done or gained.<br />
<br />
At the start of every week, you should first go into the entry for the previous week and fill in the accounting part (ideally you would already have done this at the end of the previous week). Then, you go into the coming week and write budget for this week (this can of course also be done the week before if you prefer).</div>Martijhohttps://robin.wiki.ifi.uio.no/Progress_for_week_13_(2018)Progress for week 13 (2018)2018-04-02T17:16:48Z<p>Martijho: </p>
<hr />
<div>== Vetle Bu Solgård ==<br />
=== Budget ===<br />
* <br />
<br />
=== Accounting ===<br />
*<br />
<br />
== Martin Hovin ==<br />
=== Budget ===<br />
* Skrive Theoretical Background<br />
* Skrive Implementasjon <br />
* Få ferdig search experiments<br />
<br />
=== Accounting ===<br />
* Skrev Theoretical Background<br />
* Skrev Implementasjon<br />
* Search experiments er ferdig<br />
* Databruk er ferdig<br />
* Skrevet intro, hypotese og implementasjon av relearning-eksperimenter<br />
* Kjørt relearning eksperimenter for 50 og 150 generasjoner.</div>Martijhohttps://robin.wiki.ifi.uio.no/Progress_for_week_14_(2018)Progress for week 14 (2018)2018-04-02T17:15:42Z<p>Martijho: Ny side: == Vetle Bu Solgård == === Budget === * === Accounting === * == Martin Hovin == === Budget === * Oppdatere eksperiment 2 med nye resultater * Legge til om ANOVA i teoretisk bakgrunn * S…</p>
<hr />
<div>== Vetle Bu Solgård ==<br />
=== Budget ===<br />
* <br />
<br />
=== Accounting ===<br />
*<br />
<br />
== Martin Hovin ==<br />
=== Budget ===<br />
* Oppdatere eksperiment 2 med nye resultater<br />
* Legge til om ANOVA i teoretisk bakgrunn<br />
* Skrive om diversity-metric i bakgrunn<br />
<br />
=== Accounting ===<br />
*</div>Martijhohttps://robin.wiki.ifi.uio.no/Progress_for_week_13_(2018)Progress for week 13 (2018)2018-04-02T17:13:17Z<p>Martijho: </p>
<hr />
<div>== Vetle Bu Solgård ==<br />
=== Budget ===<br />
* <br />
<br />
=== Accounting ===<br />
*<br />
<br />
== Martin Hovin ==<br />
=== Budget ===<br />
* Skrive Theoretical Background<br />
* Skrive Implementasjon <br />
* Få ferdig search experiments<br />
<br />
=== Accounting ===<br />
* Skrev ferdig Theoretical Background<br />
* Skrev ferdig Implementasjon<br />
* Search experiments er ferdig<br />
* Databruk er ferdig</div>Martijhohttps://robin.wiki.ifi.uio.no/Progress_for_week_13_(2018)Progress for week 13 (2018)2018-04-02T17:11:17Z<p>Martijho: </p>
<hr />
<div>== Vetle Bu Solgård ==<br />
=== Budget ===<br />
* <br />
<br />
=== Accounting ===<br />
*<br />
<br />
== Martin Hovin ==<br />
=== Budget ===<br />
* <br />
<br />
=== Accounting ===<br />
*</div>Martijhohttps://robin.wiki.ifi.uio.no/Progress_for_week_13_(2018)Progress for week 13 (2018)2018-04-02T17:10:55Z<p>Martijho: Ny side: == Vetle Bu Solgård == === Budget === * === Accounting === * == Martin Hovin == === Budget === * === Accounting === *</p>
<hr />
<div>== Vetle Bu Solgård ==<br />
=== Budget ===<br />
* <br />
<br />
=== Accounting ===<br />
*<br />
<br />
== Martin Hovin ==<br />
=== Budget ===<br />
* <br />
=== Accounting ===<br />
*</div>Martijhohttps://robin.wiki.ifi.uio.no/ProgressH2017ProgressH20172018-04-02T17:10:28Z<p>Martijho: </p>
<hr />
<div>==== Progress ====<br />
* [[Progress for week 38 (2017)]]<br />
* [[Progress for week 39 (2017)]]<br />
* [[Progress for week 40 & 41 (2017)]]<br />
* [[Progress for week 42 (2017)]]<br />
* [[Progress for week 43 (2017)]]<br />
* [[Progress for week 44 (2017)]]<br />
* [[Progress for week 45 (2017)]]<br />
* [[Progress for week 46 (2017)]]<br />
* [[Progress for week 47 (2017)]]<br />
* [[Progress for week 4 (2018)]]<br />
* [[Progress for week 5 (2018)]]<br />
* [[Progress for week 6 (2018)]]<br />
* [[Progress for week 7 (2018)]]<br />
* [[Progress for week 8 (2018)]]<br />
* [[Progress for week 9 (2018)]]<br />
* [[Progress for week 10 (2018)]]<br />
* [[Progress for week 11 (2018)]]<br />
* [[Progress for week 12 (2018)]]<br />
* [[Progress for week 13 (2018)]]<br />
* [[Progress for week 14 (2018)]]<br />
* [[Progress for week 15 (2018)]]<br />
<br />
<br />
==== Weekly status meetings ====<br />
* [https://docs.google.com/spreadsheets/d/1PooW29tk3VEfJyKGOmfUXtRnrpCywh7xnMK4to941Pg/edit?usp=sharing| Participation at meetings]<br />
<br />
==== How to use this page ====<br />
This page is meant for master students in the ROBIN group to track progress. We believe it is indeed very useful for yourself to have such an overview, and not the least to think concretely about progress in this way already from the start. Also, it helps supervisors providing a better and more to the point supervision.<br />
<br />
We have divided progress for each student into two parts, which we refer to as budgeting and accounting. Budgeting is then simply the plans written up ahead, while accounting is the list of what was indeed done when going through this afterwards. Some students also divide between visible and invisible progress. The points with this would be to become more aware of any important progress being made which is not really measurable. As an example, writing a page of text is very visible progress, while determining how to approach a task is just as important, but as it does not provide visible results one can easily feel that nothing has been done or gained.<br />
<br />
At the start of every week, you should first go into the entry for the previous week and fill in the accounting part (ideally you would already have done this at the end of the previous week). Then, you go into the coming week and write budget for this week (this can of course also be done the week before if you prefer).</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-28T18:26:31Z<p>Martijho: /* Theoretical Background */</p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
: Background -> figure of neuron to have y^ as output<br />
: <s>More info in NNvsDNN plot</s><br />
: Look through GAs in background to check if more is needed<br />
<br />
; Needed figure list<br />
<br />
:: shallow neural net showing connections between neurons<br />
:: <s>Some visualization of a genetic algorithm. Preferably tournament search?</s><br />
:: Visualization of why training is separated from evaluation in exp2?<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
:::::<b>3 til 5 sider</b><br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<br />
== Thesis outline == <br />
; Theoretical Background <br />
: What is discussed and why. What does the thesis build on?<br />
; Implementation<br />
: Datasets<br />
: Programming language<br />
:: Packages<br />
: Code structure<br />
; Experiment 1<br />
: What do i attempt to answer and how?<br />
; Experiment 2<br />
: What do i attempt to answer and how?<br />
; Conclusion<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
:::::<b>15 til 20 sider</b><br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* <b>Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed.</b><br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
== PathNet == <br />
Rework essay section<br />
<br />
=== Search ===<br />
* Tournament k=2, p=1<br />
* fitness evaulation updates the weights<br />
* training is done during search. <br />
* locking modules when optimal path is found<br />
** Why loocking?<br />
* Modules are reinitialized after search if they are not locked. <br />
** Why?<br />
* Search lasts for a set duration (accuracy threshold or generation limit)<br />
<br />
=== Structure ===<br />
* Layers of modules<br />
* Module is a Neural Network<br />
* Reduced sum between layers (adding module outputs together)<br />
* Task unique layer at the end of each path (each path in a search have the same end layer)<br />
* Each path has a max number of possible modules from each layer <br />
** Limiting the possible capacity in the network (explain capacity)<br />
<br />
== Monte Carlo probability approximation == <br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
:::::<b>5 til 15 sider</b><br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Tasks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
= Experiment 1: Search versus Selection =<br />
:::::<b>35 til 45 sider / 2</b><br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
:::::<b>35 til 45 sider / 2</b><br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
:::::<b>5 sider</b><br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-27T13:33:09Z<p>Martijho: </p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
: Background -> figure of neuron to have y^ as output<br />
: <s>More info in NNvsDNN plot</s><br />
: Look through GAs in background to check if more is needed<br />
<br />
; Needed figure list<br />
<br />
:: shallow neural net showing connections between neurons<br />
:: <s>Some visualization of a genetic algorithm. Preferably tournament search?</s><br />
:: Visualization of why training is separated from evaluation in exp2?<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
:::::<b>3 til 5 sider</b><br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<br />
== Thesis outline == <br />
; Theoretical Background <br />
: What is discussed and why. What does the thesis build on?<br />
; Implementation<br />
: Datasets<br />
: Programming language<br />
:: Packages<br />
: Code structure<br />
; Experiment 1<br />
: What do i attempt to answer and how?<br />
; Experiment 2<br />
: What do i attempt to answer and how?<br />
; Conclusion<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
:::::<b>15 til 20 sider</b><br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* <b>Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed.</b><br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
== PathNet == <br />
Rework essay section<br />
<br />
=== Search ===<br />
* Tournament k=2, p=1<br />
* fitness evaulation updates the weights<br />
* training is done during search. <br />
* locking modules when optimal path is found<br />
** Why loocking?<br />
* Modules are reinitialized after search if they are not locked. <br />
** Why?<br />
* Search lasts for a set duration (accuracy threshold or generation limit)<br />
<br />
=== Structure ===<br />
* Layers of modules<br />
* Module is a Neural Network<br />
* Reduced sum between layers (adding module outputs together)<br />
* Task unique layer at the end of each path (each path in a search have the same end layer)<br />
* Each path has a max number of possible modules from each layer <br />
** Limiting the possible capacity in the network (explain capacity)<br />
<br />
<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
:::::<b>5 til 15 sider</b><br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Tasks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
= Experiment 1: Search versus Selection =<br />
:::::<b>35 til 45 sider / 2</b><br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
:::::<b>35 til 45 sider / 2</b><br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
:::::<b>5 sider</b><br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-26T14:14:14Z<p>Martijho: </p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
: Background -> figure of neuron to have y^ as output<br />
: <s>More info in NNvsDNN plot</s><br />
: Look through GAs in background to check if more is needed<br />
<br />
; Needed figure list<br />
<br />
:: shallow neural net showing connections between neurons<br />
:: Some visualization of a genetic algorithm. Preferably tournament search?<br />
:: Visualization of why training is separated from evaluation in exp2?<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
:::::<b>3 til 5 sider</b><br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<br />
== Thesis outline == <br />
; Theoretical Background <br />
: What is discussed and why. What does the thesis build on?<br />
; Implementation<br />
: Datasets<br />
: Programming language<br />
:: Packages<br />
: Code structure<br />
; Experiment 1<br />
: What do i attempt to answer and how?<br />
; Experiment 2<br />
: What do i attempt to answer and how?<br />
; Conclusion<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
:::::<b>15 til 20 sider</b><br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* <b>Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed.</b><br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
== PathNet == <br />
Rework essay section<br />
<br />
=== Search ===<br />
* Tournament k=2, p=1<br />
* fitness evaulation updates the weights<br />
* training is done during search. <br />
* locking modules when optimal path is found<br />
** Why loocking?<br />
* Modules are reinitialized after search if they are not locked. <br />
** Why?<br />
* Search lasts for a set duration (accuracy threshold or generation limit)<br />
<br />
=== Structure ===<br />
* Layers of modules<br />
* Module is a Neural Network<br />
* Reduced sum between layers (adding module outputs together)<br />
* Task unique layer at the end of each path (each path in a search have the same end layer)<br />
* Each path has a max number of possible modules from each layer <br />
** Limiting the possible capacity in the network (explain capacity)<br />
<br />
<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
:::::<b>5 til 15 sider</b><br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Tasks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
= Experiment 1: Search versus Selection =<br />
:::::<b>35 til 45 sider / 2</b><br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
:::::<b>35 til 45 sider / 2</b><br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
:::::<b>5 sider</b><br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-26T12:51:51Z<p>Martijho: </p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
: Background -> figure of neuron to have y^ as output<br />
: <s>More info in NNvsDNN plot</s><br />
: Look through GAs in background to check if more is needed<br />
<br />
; Needed figure list<br />
<br />
:: shallow neural net showing connections between neurons<br />
:: Some visualization of a genetic algorithm. Preferably tournament search?<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
:::::<b>3 til 5 sider</b><br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<br />
== Thesis outline == <br />
; Theoretical Background <br />
: What is discussed and why. What does the thesis build on?<br />
; Implementation<br />
: Datasets<br />
: Programming language<br />
:: Packages<br />
: Code structure<br />
; Experiment 1<br />
: What do i attempt to answer and how?<br />
; Experiment 2<br />
: What do i attempt to answer and how?<br />
; Conclusion<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
:::::<b>15 til 20 sider</b><br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* <b>Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed.</b><br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
== PathNet == <br />
Rework essay section<br />
<br />
=== Search ===<br />
* Tournament k=2, p=1<br />
* fitness evaulation updates the weights<br />
* training is done during search. <br />
* locking modules when optimal path is found<br />
** Why loocking?<br />
* Modules are reinitialized after search if they are not locked. <br />
** Why?<br />
* Search lasts for a set duration (accuracy threshold or generation limit)<br />
<br />
=== Structure ===<br />
* Layers of modules<br />
* Module is a Neural Network<br />
* Reduced sum between layers (adding module outputs together)<br />
* Task unique layer at the end of each path (each path in a search have the same end layer)<br />
* Each path has a max number of possible modules from each layer <br />
** Limiting the possible capacity in the network (explain capacity)<br />
<br />
<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
:::::<b>5 til 15 sider</b><br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Tasks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
= Experiment 1: Search versus Selection =<br />
:::::<b>35 til 45 sider / 2</b><br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
:::::<b>35 til 45 sider / 2</b><br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
:::::<b>5 sider</b><br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-26T11:57:51Z<p>Martijho: /* Implementation */</p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
: Background -> figure of neuron to have y^ as output<br />
: <s>More info in NNvsDNN plot</s><br />
: Look through GAs in background to check if more is needed<br />
<br />
; Needed figure list<br />
<br />
:: shallow neural net showing connections between neurons<br />
:: Convolutional operation<br />
:: Some visualization of a genetic algorithm. Preferably tournament search?<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
:::::<b>3 til 5 sider</b><br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<br />
== Thesis outline == <br />
; Theoretical Background <br />
: What is discussed and why. What does the thesis build on?<br />
; Implementation<br />
: Datasets<br />
: Programming language<br />
:: Packages<br />
: Code structure<br />
; Experiment 1<br />
: What do i attempt to answer and how?<br />
; Experiment 2<br />
: What do i attempt to answer and how?<br />
; Conclusion<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
:::::<b>15 til 20 sider</b><br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* <b>Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed.</b><br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
== PathNet == <br />
Rework essay section<br />
<br />
=== Search ===<br />
* Tournament k=2, p=1<br />
* fitness evaulation updates the weights<br />
* training is done during search. <br />
* locking modules when optimal path is found<br />
** Why loocking?<br />
* Modules are reinitialized after search if they are not locked. <br />
** Why?<br />
* Search lasts for a set duration (accuracy threshold or generation limit)<br />
<br />
=== Structure ===<br />
* Layers of modules<br />
* Module is a Neural Network<br />
* Reduced sum between layers (adding module outputs together)<br />
* Task unique layer at the end of each path (each path in a search have the same end layer)<br />
* Each path has a max number of possible modules from each layer <br />
** Limiting the possible capacity in the network (explain capacity)<br />
<br />
<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
:::::<b>5 til 15 sider</b><br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Tasks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
= Experiment 1: Search versus Selection =<br />
:::::<b>35 til 45 sider / 2</b><br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
:::::<b>35 til 45 sider / 2</b><br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
:::::<b>5 sider</b><br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-26T09:29:53Z<p>Martijho: </p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
: Background -> figure of neuron to have y^ as output<br />
: <s>More info in NNvsDNN plot</s><br />
: Look through GAs in background to check if more is needed<br />
<br />
; Needed figure list<br />
<br />
:: shallow neural net showing connections between neurons<br />
:: Convolutional operation<br />
:: Some visualization of a genetic algorithm. Preferably tournament search?<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
:::::<b>3 til 5 sider</b><br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<br />
== Thesis outline == <br />
; Theoretical Background <br />
: What is discussed and why. What does the thesis build on?<br />
; Implementation<br />
: Datasets<br />
: Programming language<br />
:: Packages<br />
: Code structure<br />
; Experiment 1<br />
: What do i attempt to answer and how?<br />
; Experiment 2<br />
: What do i attempt to answer and how?<br />
; Conclusion<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
:::::<b>15 til 20 sider</b><br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* <b>Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed.</b><br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
== PathNet == <br />
Rework essay section<br />
<br />
=== Search ===<br />
* Tournament k=2, p=1<br />
* fitness evaulation updates the weights<br />
* training is done during search. <br />
* locking modules when optimal path is found<br />
** Why loocking?<br />
* Modules are reinitialized after search if they are not locked. <br />
** Why?<br />
* Search lasts for a set duration (accuracy threshold or generation limit)<br />
<br />
=== Structure ===<br />
* Layers of modules<br />
* Module is a Neural Network<br />
* Reduced sum between layers (adding module outputs together)<br />
* Task unique layer at the end of each path (each path in a search have the same end layer)<br />
* Each path has a max number of possible modules from each layer <br />
** Limiting the possible capacity in the network (explain capacity)<br />
<br />
<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
:::::<b>5 til 15 sider</b><br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Taks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
<br />
= Experiment 1: Search versus Selection =<br />
:::::<b>35 til 45 sider / 2</b><br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
:::::<b>35 til 45 sider / 2</b><br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
:::::<b>5 sider</b><br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-26T09:22:18Z<p>Martijho: /* Introduction */</p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
: Background -> figure of neuron to have y^ as output<br />
: <s>More info in NNvsDNN plot</s><br />
: Look through GAs in background to check if more is needed<br />
<br />
; Needed figure list<br />
<br />
:: shallow neural net showing connections between neurons<br />
:: Convolutional operation<br />
:: Some visualization of a genetic algorithm. Preferably tournament search?<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<br />
== Thesis outline == <br />
; Theoretical Background <br />
: What is discussed and why. What does the thesis build on?<br />
; Implementation<br />
: Datasets<br />
: Programming language<br />
:: Packages<br />
: Code structure<br />
; Experiment 1<br />
: What do i attempt to answer and how?<br />
; Experiment 2<br />
: What do i attempt to answer and how?<br />
; Conclusion<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* <b>Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed.</b><br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
== PathNet == <br />
Rework essay section<br />
<br />
=== Search ===<br />
* Tournament k=2, p=1<br />
* fitness evaulation updates the weights<br />
* training is done during search. <br />
* locking modules when optimal path is found<br />
** Why loocking?<br />
* Modules are reinitialized after search if they are not locked. <br />
** Why?<br />
* Search lasts for a set duration (accuracy threshold or generation limit)<br />
<br />
=== Structure ===<br />
* Layers of modules<br />
* Module is a Neural Network<br />
* Reduced sum between layers (adding module outputs together)<br />
* Task unique layer at the end of each path (each path in a search have the same end layer)<br />
* Each path has a max number of possible modules from each layer <br />
** Limiting the possible capacity in the network (explain capacity)<br />
<br />
<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Taks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
<br />
= Experiment 1: Search versus Selection =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-25T14:56:07Z<p>Martijho: /* PathNet */</p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
: Background -> figure of neuron to have y^ as output<br />
: <s>More info in NNvsDNN plot</s><br />
: Look through GAs in background to check if more is needed<br />
<br />
; Needed figure list<br />
<br />
:: shallow neural net showing connections between neurons<br />
:: Convolutional operation<br />
:: Some visualization of a genetic algorithm. Preferably tournament search?<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* <b>Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed.</b><br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
== PathNet == <br />
Rework essay section<br />
<br />
=== Search ===<br />
* Tournament k=2, p=1<br />
* fitness evaulation updates the weights<br />
* training is done during search. <br />
* locking modules when optimal path is found<br />
** Why loocking?<br />
* Modules are reinitialized after search if they are not locked. <br />
** Why?<br />
* Search lasts for a set duration (accuracy threshold or generation limit)<br />
<br />
=== Structure ===<br />
* Layers of modules<br />
* Module is a Neural Network<br />
* Reduced sum between layers (adding module outputs together)<br />
* Task unique layer at the end of each path (each path in a search have the same end layer)<br />
* Each path has a max number of possible modules from each layer <br />
** Limiting the possible capacity in the network (explain capacity)<br />
<br />
<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Taks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
<br />
= Experiment 1: Search versus Selection =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-24T16:30:50Z<p>Martijho: </p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
: Background -> figure of neuron to have y^ as output<br />
: <s>More info in NNvsDNN plot</s><br />
: Look through GAs in background to check if more is needed<br />
<br />
; Needed figure list<br />
<br />
:: shallow neural net showing connections between neurons<br />
:: Convolutional operation<br />
:: Some visualization of a genetic algorithm. Preferably tournament search?<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* <b>Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed.</b><br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
== PathNet == <br />
Rework essay section<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Taks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
<br />
= Experiment 1: Search versus Selection =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-24T12:29:17Z<p>Martijho: /* Theoretical Background */</p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
: Background -> figure of neuron to have y^ as output<br />
: <s>More info in NNvsDNN plot</s><br />
; Needed figure list<br />
<br />
:: shallow neural net showing connections between neurons<br />
:: Convolutional operation<br />
<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* <b>Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed.</b><br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
== PathNet == <br />
Rework essay section<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Taks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
<br />
= Experiment 1: Search versus Selection =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-24T12:11:50Z<p>Martijho: /* Transfer learning */</p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
: Background -> figure of neuron to have y^ as output<br />
: <s>More info in NNvsDNN plot</s><br />
; Needed figure list<br />
<br />
:: shallow neural net showing connections between neurons<br />
:: Convolutional operation<br />
<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* <b>Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed.</b><br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
*** <b>PathNet</b><br />
**** Super Neural Network<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
* Super Neural Networks<br />
** What are they?<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Taks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
<br />
= Experiment 1: Search versus Selection =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-23T14:40:45Z<p>Martijho: /* Transfer learning */</p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
: Background -> figure of neuron to have y^ as output<br />
: <s>More info in NNvsDNN plot</s><br />
; Needed figure list<br />
<br />
:: shallow neural net showing connections between neurons<br />
:: Convolutional operation<br />
<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* <b>Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed.</b><br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
*** <b>PathNet</b><br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
* Super Neural Networks<br />
** What are they?<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Taks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
<br />
= Experiment 1: Search versus Selection =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-23T11:18:19Z<p>Martijho: /* Deep Learning */</p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
: Background -> figure of neuron to have y^ as output<br />
: <s>More info in NNvsDNN plot</s><br />
; Needed figure list<br />
<br />
:: shallow neural net showing connections between neurons<br />
:: Convolutional operation<br />
<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* <b>Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed.</b><br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
*** PathNet<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
* Super Neural Networks<br />
** What are they?<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Taks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
<br />
= Experiment 1: Search versus Selection =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-23T10:32:24Z<p>Martijho: </p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
: Background -> figure of neuron to have y^ as output<br />
: <s>More info in NNvsDNN plot</s><br />
; Needed figure list<br />
<br />
:: shallow neural net showing connections between neurons<br />
:: Convolutional operation<br />
<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed. <br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
*** PathNet<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
* Super Neural Networks<br />
** What are they?<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Taks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
<br />
= Experiment 1: Search versus Selection =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-23T10:32:02Z<p>Martijho: </p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
: Background -> figure of neuron to have y^ as output<br />
: <s>More info in NNvsDNN plot</s><br />
; Needed figure list<br />
:: shallow neural net showing connections between neurons<br />
:: Convolutional operation<br />
<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed. <br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
*** PathNet<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
* Super Neural Networks<br />
** What are they?<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Taks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
<br />
= Experiment 1: Search versus Selection =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-23T10:31:28Z<p>Martijho: </p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
: Background -> figure of neuron to have y^ as output<br />
: {{strikethrough|More info in NNvsDNN plot}}<br />
; Needed figure list<br />
:: shallow neural net showing connections between neurons<br />
:: Convolutional operation<br />
<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed. <br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
*** PathNet<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
* Super Neural Networks<br />
** What are they?<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Taks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
<br />
= Experiment 1: Search versus Selection =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-23T10:31:10Z<p>Martijho: </p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
: Background -> figure of neuron to have y^ as output<br />
: {{S|More info in NNvsDNN plot}}<br />
; Needed figure list<br />
:: shallow neural net showing connections between neurons<br />
:: Convolutional operation<br />
<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed. <br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
*** PathNet<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
* Super Neural Networks<br />
** What are they?<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Taks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
<br />
= Experiment 1: Search versus Selection =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Progress_for_week_12_(2018)Progress for week 12 (2018)2018-03-23T10:28:36Z<p>Martijho: /* Accounting */</p>
<hr />
<div>== Vetle Bu Solgård ==<br />
=== Budget ===<br />
* <br />
<br />
=== Accounting ===<br />
*<br />
<br />
== Martin Hovin ==<br />
=== Budget ===<br />
* Skrive om databruk eksperimentet. <br />
** Samme oppsett som de to andre. <br />
** Noe mer som skal visualiseres?<br />
** Kjøre flere runs?<br />
<br />
* Sette opp en big-picture plan av hva som skal skrives fremover<br />
<br />
=== Accounting ===<br />
* Satt opp en innholdsplan for hva som skal med i oppgaven<br />
:: [[Martijho-PathNet-thesis|Thesis structure and outlines]]<br />
* Jobber med Theoretical Background. Satser på å ha det ferdig til starten av neste uke<br />
* Kjører fortsatt eksperimenter</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-22T18:06:55Z<p>Martijho: </p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
: Background -> figure of neuron to have y^ as output<br />
<br />
: Needed figure list<br />
:: shallow neural net showing connections between neurons<br />
:: Convolutional operation<br />
<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed. <br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
*** PathNet<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
* Super Neural Networks<br />
** What are they?<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Taks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
<br />
= Experiment 1: Search versus Selection =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-22T15:53:43Z<p>Martijho: </p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
: Background -> figure of neuron to have y^ as output<br />
<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed. <br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
*** PathNet<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
* Super Neural Networks<br />
** What are they?<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Taks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
<br />
= Experiment 1: Search versus Selection =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-22T13:09:05Z<p>Martijho: /* Machine Learning */</p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
<br />
<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
== Machine Learning == <br />
* Supervised learning<br />
* Based on the structure of the brain <br />
* image of dendrite vs artificial neuron <br />
* weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
** activation not discussed in depth here<br />
** Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
** softmax is the generalization of binary logistic regression to multiple classes. <br />
** regression/classification<br />
* feedforward<br />
* image of connections<br />
* loss function(cost/error) calculate calculates the difference between expected output and target output<br />
* ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
* goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
* Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
* Many optimization algorithms, most common is gradient descent. <br />
* Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
* NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
* final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
* image classification is done based on input pixel values<br />
* NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
* convolutional operations.<br />
* inputs image and performs convolutional operation on image and a kernel of weights. <br />
* outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
* but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
* control this spatial area with kernel size and stride (jumps made by kernel). <br />
* convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
* normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
* called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
* For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
* usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
* The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed. <br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
*** PathNet<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
* Super Neural Networks<br />
** What are they?<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Taks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
<br />
= Experiment 1: Search versus Selection =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-22T13:07:59Z<p>Martijho: /* Deep Learning */</p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
<br />
<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
== Machine Learning == <br />
Supervised learning<br />
Based on the structure of the brain <br />
image of dendrite vs artificial neuron <br />
weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
- activation not discussed in depth here<br />
- Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
- softmax is the generalization of binary logistic regression to multiple classes. <br />
- regression/classification<br />
feedforward<br />
image of connections<br />
loss function(cost/error) calculate calculates the difference between expected output and target output<br />
ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
Many optimization algorithms, most common is gradient descent. <br />
Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
image classification is done based on input pixel values<br />
NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
convolutional operations.<br />
inputs image and performs convolutional operation on image and a kernel of weights. <br />
outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
control this spatial area with kernel size and stride (jumps made by kernel). <br />
convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed. <br />
* TL FROM ESSAY<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Overlap in motivation behind transfer learning and multitask learning: Being able to share knowledge between tasks. <br />
* However problem arises when sequentially trying to learn multiple tasks in the same neural network. <br />
** Optimization done for one dataset is overwritten if backpropagation is allowed to pass through the same layers for both tasks. <br />
** Catastrophic forgetting and solutions: <br />
*** EWC<br />
*** PNN<br />
*** PathNet<br />
* Curriculum Learning / Gradual learning<br />
** Simplifying learning process by learning simpler tasks first and building on the parameters reached for these tasks. <br />
*** ref to motivation behind task ordering in exp2<br />
* Super Neural Networks<br />
** What are they?<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Taks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
<br />
= Experiment 1: Search versus Selection =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-22T12:03:40Z<p>Martijho: /* Deep Learning */</p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
<br />
<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
== Machine Learning == <br />
Supervised learning<br />
Based on the structure of the brain <br />
image of dendrite vs artificial neuron <br />
weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
- activation not discussed in depth here<br />
- Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
- softmax is the generalization of binary logistic regression to multiple classes. <br />
- regression/classification<br />
feedforward<br />
image of connections<br />
loss function(cost/error) calculate calculates the difference between expected output and target output<br />
ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
Many optimization algorithms, most common is gradient descent. <br />
Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
image classification is done based on input pixel values<br />
NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
convolutional operations.<br />
inputs image and performs convolutional operation on image and a kernel of weights. <br />
outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
control this spatial area with kernel size and stride (jumps made by kernel). <br />
convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
insert from essay<br />
* multiple network architectures that fall in DNN. Later years multitude with different applicability have been used commercially and in research<br />
* Architectures depend on input type, problems they are applied to and resource limitations. <br />
<br />
=== Transfer learning === <br />
* Training in DNNS take time. Transfer learning as method of reusing models for different tasks. <br />
* Train model on one set of data for one task, reuse the trained weights as starting point for training<br />
* usually randomly initialized weights as starting point. <br />
* It is shown that reusing weights in similar tasks and training weights on new data yields better results in some cases. <br />
* Reduces needed training data. <br />
* Can pretrain some model on f.eks image data and provide a pretrained model that can be reused and adapted quickly to smaller datasets without immense computational resources needed. <br />
* DEFINITION FROM ESSAY<br />
:We can define transfer learning as trying to learn a target conditional probability distribution \(P(Y_t|X_t)\) within a domain \(\mathcal{D}_t\), based on information gained from learning a source task \(\mathcal{T}_s\) in the source domain \(\mathcal{D}_s\) where \(\mathcal{D}_s \neq \mathcal{D}_t\) and \(\mathcal{T}_s \neq \mathcal{T}_t\). A domain \(\mathcal{D}\) would then, in a typical classification example, be given as \(\mathcal{D} = \{X, P(X)\}\) where \(X = x_1,x_2, \dotsc ,x_n\) are sampled from the feature space \(\mathcal{X}\) and \(P(X)\) is a probability distribution over that space. The task \(\mathcal{T}\) in that domain would then consist of a label space \(\mathcal{Y}\) and the conditional probability distribution \(P(Y|X)\) which usually is approximated during training on a set of \(x_i, y_i\) pairs where \(x_i \in \mathcal{X}\) and \(y_i \in \mathcal{Y}\).<br />
* Multiple techniques <br />
** <br />
* Transfer learning<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Multi-task Learning<br />
** Curriculum Learning<br />
*** ref to motivation behind task ordering in exp2<br />
* Catastrophic forgetting<br />
*** EWC<br />
*** PNN<br />
*** PathNet<br />
* Super Neural Networks<br />
** What are they?<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Taks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
<br />
= Experiment 1: Search versus Selection =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-22T11:45:36Z<p>Martijho: /* Machine Learning */</p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
<br />
<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
== Machine Learning == <br />
Supervised learning<br />
Based on the structure of the brain <br />
image of dendrite vs artificial neuron <br />
weights/parameters and a bias (x1*w1 +x2*w2 ... + bias) summed then activation function <br />
- activation not discussed in depth here<br />
- Using two: rectified linear unit (ReLU) and softmax which scales output to have a sum of 1 so it can be used as a probability estimate<br />
- softmax is the generalization of binary logistic regression to multiple classes. <br />
- regression/classification<br />
feedforward<br />
image of connections<br />
loss function(cost/error) calculate calculates the difference between expected output and target output<br />
ref to experiments cross-entropy not going in detail. well suited with softmax activation (ref http://cs231n.github.io/linear-classify/)<br />
goal is to minimize the cross-entropy function for the dataset [X, Y]. <br />
Driving force is the backpropagation and the optimization algorithm which is used to calculates a gradient for all weights in the neural network and update the weights accordingly<br />
Many optimization algorithms, most common is gradient descent. <br />
Not diving into details here. using stochastic gradient descent (SGD) and Adaptive Moment Estimation (ADAM) [\ref{sgd}\ref{adam}]<br />
NNs function estimation properties used in reinforcement learning, regression but here used for classification purpose. <br />
final layer softmax-output to estimate the probability of class label, therefore, outputs vector of values [0, 1] where index of largest value selected as label. <br />
image classification is done based on input pixel values<br />
NNs bad at this as images class manifold can be highly complex (ref transition between binary and quinary mnist)<br />
convolutional operations.<br />
inputs image and performs convolutional operation on image and a kernel of weights. <br />
outputs what is called feature map. as with NN each pixes here is simple combination of multiplications and summing<br />
but each pixel in feature map contains info about the local spatial area the kernel covered. <br />
control this spatial area with kernel size and stride (jumps made by kernel). <br />
convlayers channels specify the number of kernels run over the image. One output channel for each kernel. <br />
normal to stack layers of convoperations in a network to generalize to the images given. Each layer contain a abstractation level and outputs a feature map. <br />
called Convolutional Neural Network (CNN) \ref{exp1.b exp2}<br />
For each level feature map is reduced in spatial image dimentions but increased in channels. <br />
usually for image classification, feature map is flattened at some point and ran through a fully connected classification layer which learns the features of the image. <br />
The convolutional operations in this case can be called feature extraction. <br />
<br />
<!-- <br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
--><br />
<br />
== Deep Learning == <br />
* Feature extraction<br />
** Bigger black box<br />
* Network designs<br />
* Transfer learning<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Multi-task Learning<br />
** Curriculum Learning<br />
*** ref to motivation behind task ordering in exp2<br />
* Catastrophic forgetting<br />
*** EWC<br />
*** PNN<br />
*** PathNet<br />
* Super Neural Networks<br />
** What are they?<br />
<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Taks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
<br />
= Experiment 1: Search versus Selection =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijhohttps://robin.wiki.ifi.uio.no/Martijho-PathNet-thesisMartijho-PathNet-thesis2018-03-21T15:27:09Z<p>Martijho: /* Machine Learning */</p>
<hr />
<div>; Notes<br />
: Experiments repicable? What to do to get same results? <br />
: Conclusion/end of thesis/ "what could be better?" section: Simplify experiment 2 with fewer algorithms and harder problems<br />
: Find all changes made to original implementation<br />
<br />
<br />
<br />
= Opening = <br />
== Abstract == <br />
* What is all this about? <br />
* Why should I read this thesis? <br />
* Is it any good? <br />
* What's new? <br />
<br />
== Acknowledgements == <br />
* Who is your advisor? <br />
* Did anyone help you? <br />
* Who funded this work? <br />
* What's the name of your favorite pet?<br />
<br />
<br />
= Introduction =<br />
More on multi task learning<br />
More on transfer learning<br />
<br />
:How is it a human brain is seemingly capable of learning an endless amount of tasks? Is it truly endless? Could we incorporate the same effect in our artificial minds?<br />
<br />
:Biology and nature have always been imitated in art and the sciences, but over the years the imitations are growing increasingly better. Artificial Intelligence and its sub-field of Machine Learning is one of these areas. The fields rapid growth in popularity the later years have yielded multiple advances[CITATION NEEDED] . While some of the advances are building such technologies as self driving cars, others are focused on working towards a ultimate goal of reaching ''Artificial General Intelligence''(AGI). A system capable of not only human-level performance in one field but able to generalize across a vast number of domains. <br />
<br />
:In the quest for a artificial general intelligence agent, while there might be disagreement on what sub-fields of AI are the most important for this endeavor, improving on current learning systems is considered a good start\cite{mlroadmap}. <br />
<br />
:This thesis will attempt to shed light on a subfield of machine learning called transfer learning, and a structure developed to take advantage of the gain this technique called PathNet. <br />
<br />
<br />
== Raise problem: catastrophic forgetting. ==<br />
Multiple solutions (PNN, PN, EWC)<br />
* Large structures (PNN, PN)<br />
* Limited in number of tasks it can retains(EWC)<br />
<br />
Optimize reuse of knowledge while still providing valid solutions to tasks. More reuse and limited capacity use will increase amount of task a structure can learn. <br />
<br />
:where do i start?<br />
Question DeepMind left unanswered is how different GAs influence task learning and module reuse. <br />
Exploration vs exploitation\ref{theoretic background on topic}<br />
<br />
:why this? <br />
broad answers first, specify later. <br />
We know PN works. would it work better for different algorithms?<br />
logical next step from original paper "unit of evolution"<br />
<br />
== Problem/hypothesis ==<br />
* What do modular PN training do with the knowledge? <br />
** More/less accuracy?<br />
** More/less transferability? <br />
Test by learning in end-to-end first then PN search. <br />
Difference in performance or reuse?<br />
<br />
* Can we make reuse easier by shifting focus of search algorithm?<br />
** PN original: Naive search. Higher exploitation improve on module selection?<br />
<br />
== How to answer? == <br />
* Set up simple multitask scenarios and try. <br />
** 2 tasks where first are end to end vs PN<br />
** List algorithms with different selection pressure and try on multiple tasks.<br />
<br />
<!-- <br />
What is the use of a Nifty Gadget? <br />
What is the problem? <br />
How can it be solved? <br />
What are the previous approaches? <br />
What is your approach? <br />
Why do it this way? <br />
What are your results? <br />
Why is this better? <br />
Is this a new approach? <br />
Why haven't anyone done it before? <br />
or<br />
Why do you reiterate previous work? <br />
What is your contribution to the field of Nifty Gadgets? <br />
<br />
\section{What should this chapter contain?}<br />
Presentation of the problem or phenomenon to be addressed, the situation where the problem or phenomenon occurs, and references to earlier relevant research. <br />
\subsection{Common errors}<br />
Problem is not properly specified or formulated; insufficient references to earlier work. <br />
<br />
\section{Purpose}<br />
What can be gained by more knowledge about the problem or phenomenon. <br />
\subsection{Common errors}<br />
The purpose is not mentioned, not connected to earlier research, or not in line with what the actual contents of the thesis. <br />
<br />
\section{Problem/Hypothesis} <br />
Questions that need to be answered to reach <br />
the goal and/or hypothesis formulated be means of <br />
underlying theories. <br />
\subsection{Common errors}<br />
Missing problem description; deficiencies in the connections between questions; badly formulated <br />
hypothesis. <br />
<br />
\section{Method} <br />
Choice of an adequate method with respect to the <br />
purpose and problem/hypothesis. <br />
<br />
\subsection{Common errors}<br />
An inappropriate method is used, for example due to lack of knowledge about different methods; <br />
erroneous use of chosen method. <br />
--><br />
<br />
= Theoretical Background =<br />
== Machine Learning == <br />
<br />
<br />
Intro about ML from the thesis<br />
\subsection{MLP and NN modeling as function approx}<br />
Inspired by the structure of the brain, the Neural Network (NN) consists of one or more layer where each layer is made up of perceptrons<br />
* What is a perceptron? How is it connected to input, output? <br />
* How is training done? Input against target<br />
* Multiple layer perceptron (MLP) as an artificial Neural Network (ANN).<br />
** Ref binary MNIST classification in exp 1<br />
* Backpropagation and optimizers (SGD and Adam)<br />
** ref binary MNIST/Quinary MNIST/exp2<br />
* Regression/function approximation (ReLU activation)<br />
* Classification (Softmax and probability approximation)<br />
** ref experiments<br />
* Image classification<br />
** ref experiments<br />
* Convolutional Neural Networks (CNN)<br />
** ref transition binary-quinary exp1 and exp2<br />
* Deep Learning and Deep neural networks (DNN)<br />
<br />
== Deep Learning == <br />
* Feature extraction<br />
** Bigger black box<br />
* Network designs<br />
* Transfer learning<br />
** What is it? <br />
** Why do it?<br />
** How do it?<br />
** TL in CNNs<br />
*** Who have done it? <br />
*** Results?<br />
*** Gabor approximation<br />
* Multi-task Learning<br />
** Curriculum Learning<br />
*** ref to motivation behind task ordering in exp2<br />
* Catastrophic forgetting<br />
*** EWC<br />
*** PNN<br />
*** PathNet<br />
* Super Neural Networks<br />
** What are they?<br />
<br />
<br />
== Evolutionary algorithms == <br />
* What is it? Where does it come from?<br />
* Exploration vs Exploitation<br />
** ref experiments (formulated in the context of this trade-off)<br />
* Terms used in the evolutionary programming context<br />
** Population<br />
** Genotype and genome<br />
** Fitness-function<br />
** selection<br />
** recombination<br />
** generation <br />
** mutation<br />
** population diversity and convergence<br />
* Some types<br />
** GA<br />
** Evolutionary searches<br />
** short. Straight into tournament search<br />
* Tournament search<br />
** How it works, what are the steps?<br />
** Selection pressure (in larger context of EAs and then tournament search)<br />
** ref to search<br />
<br />
<!--<br />
x What is the required background knowledge? <br />
x Where can I find it? <br />
\section{Various approaches to Nifty Gadgets} <br />
x What is the relevant prior work? <br />
x Where can I find it? <br />
Why should it be done differently? <br />
x Has anyone attempted your approach previously? <br />
x Where is that work reported? <br />
\Section{Nifty Gadgets my way}<br />
What is the outline of your way? <br />
Have you published it before? <br />
--><br />
<br />
= Implementation =<br />
EDIT NOTE: <br />
Limit overlap in implementation details between this chapter and experimentation implementation. Build up a base that can be built on in chapter 4 and 5. <br />
<br />
== Python implementation ==<br />
* why python? <br />
** Problems: <br />
*** Not quick to run <br />
** Pros: <br />
*** Quick to prototype in <br />
*** Generally good to debug<br />
*** Multiple good tools for machine learning<br />
**** \cite{tensorflow}<br />
**** \cite{keras}<br />
**** Why are these good?<br />
*** Other packages<br />
**** Matplotlib (visualization)<br />
**** Numpy (math stuffs)<br />
**** Pickle (data logging)<br />
* code structure<br />
** Object oriented<br />
*** Easily parameterizable for ease of prototyping pathnet structures<br />
** Class structure: <br />
*** Modules<br />
*** Layers<br />
*** PathNet<br />
**** Functionality for<br />
***** Building random paths<br />
***** Creating keras models<br />
***** static methods for creating pathnet structures<br />
***** reset backend session<br />
*** Taks<br />
*** Search<br />
*** Plot generating<br />
* Training on gpu<br />
** Quicker in general for ML<br />
** This implementation do lots on CPU<br />
*** Other implementations could take advantage of customizing layers and models in keras. <br />
* Noteable differences in implementation <br />
** Keras implementasjon<br />
** Path fitness not negative error but accuracy<br />
** exp 2: fitness calculated before evaluation (not same step)<br />
** Not added any noise to training data<br />
* Implementation problems<br />
** Tensorflow sessions not made for using multiple graphs<br />
*** Resetting backend session after a number of models are made<br />
** Tensorflow-gpus default is using all gpu memory it can <br />
*** Limiting data allocation to scale when needed<br />
** Tensorflow session does not free allocated memory before python thread is done. <br />
*** Run all experiments through treads. <br />
* Code available on github<br />
<br />
<br />
== Datasets == <br />
<br />
=== MNIST ===<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in both training-set and validation-set along with the portion of the whole sets each class constitute. }<br />
\label{table:MNIST class distribution}<br />
\begin{tabular}{ccccc}<br />
& \multicolumn{2}{c}{Training data} & \multicolumn{2}{c}{Validation data} \\<br />
Class number (Digit) & Number of samples & \% of whole set & Number of samples & \% of whole set \\<br />
0 & 5923 & 9.9\% & 980 & 9.8\% \\<br />
1 & 6742 & 11.3\% & 1135 & 11.3\% \\<br />
2 & 5958 & 9.9\% & 1032 & 10.3\% \\<br />
3 & 6131 & 10.2\% & 1010 & 10.1\% \\<br />
4 & 5842 & 9.7\% & 982 & 9.8\% \\<br />
5 & 5421 & 9.0\% & 892 & 8.9\% \\<br />
6 & 5918 & 9.9\% & 958 & 9.6\% \\<br />
7 & 6265 & 10.4\% & 1028 & 10.3\% \\<br />
8 & 5951 & 9.9\% & 974 & 9.7\% \\<br />
9 & 5949 & 9.9\% & 1009 & 10.1\% <br />
\end{tabular}<br />
\end{table}<br />
--><br />
<br />
=== SVHN === <br />
The sample distribution on each class follows Benfords law, which can be expected from a natural dataset such as this.<br />
<!--<br />
\begin{table}[h]<br />
\centering<br />
\caption{Distribution of samples on each class in the cropped SVHN set used in this thesis, along with the portion of the whole set each class constitute. Given a random selection of samples from this set, this percentage should approximately be the probability for selection each class}<br />
\label{table:SVHN class distribution}<br />
\begin{tabular}{ccc}<br />
Class number (Digit) & Number of samples & \% of whole dataset \\<br />
0 & 45550 & 8.6\% \\<br />
1 & 90560 & 17.0\% \\<br />
2 & 74740 & 14.1\% \\<br />
3 & 60765 & 11.5\% \\<br />
4 & 50633 & 9.5\% \\<br />
5 & 53490 & 10.1\% \\<br />
6 & 41582 & 7.8\% \\<br />
7 & 43997 & 8.3\% \\<br />
8 & 35358 & 6.7\% \\<br />
9 & 34456 & 6.5\% <br />
\end{tabular}<br />
\end{table} <br />
--><br />
<br />
* Data type<br />
* Use cases and citations<br />
* How does the data look?<br />
* set sizes and class distributions<br />
* state of the art and human level performance<br />
<br />
== Search implementation == <br />
* functions. callback to theoretical background and GA buzzwords<br />
* parameterization<br />
<br />
<br />
= Experiment 1: Search versus Selection =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Experiment 2: Selection Pressure =<br />
<br />
<!--<br />
X Did you actually build it? <br />
X How can you test it? <br />
X How did you test it? <br />
X Why did you test it this way? <br />
Are the results satisfactory? <br />
X Why should you (not) test it more? <br />
X What compensations had to be made to interpret the results? <br />
Why did you succeed/fail? <br />
<br />
\section{Result} <br />
Answers to the forwarded questions by means of the achieved results. <br />
\subsection{Common errors}<br />
The results are not properly connected to the problem; blurry presentation; the results are inter-mixed with discussion. <br />
--><br />
<br />
= Discussion =<br />
Are your results satisfactory? <br />
Can they be improved? <br />
Is there a need for improvement? <br />
Are other approaches worth trying out? <br />
Will some restriction be lifted? <br />
Will you save the world with your Nifty Gadget? <br />
<br />
== Discussion == <br />
Discussion of the accuracy and relevance of the results; comparison with other researchers results. <br />
\subsection{Common errors}<br />
Too far reaching conclusions; guesswork not supported by the data; introduction of a new problem and a discussion around this. <br />
<br />
== Conclusion == <br />
Consequences of the achieved results, for example for new research, theory and applications. <br />
<br />
=== Common errors ===<br />
The conclusions are too far reaching with respect to the achieved results; the conclusions do not correspond with the purpose<br />
<br />
= Ending =</div>Martijho