add start policy

4cd66fc4 · Andri Joos · e84e5b26 · 4cd66fc4 · 4cd66fc4 · 4cd66fc4
Commit 4cd66fc4 authored 1 year ago by Andri Joos
--- a/src/02_product-documentation/04_implementation/02_policies.tex
+++ b/src/02_product-documentation/04_implementation/02_policies.tex
 \section{Policies}
    To each environment, there exists a corresponding policy.
+
+    \input{02_product-documentation/04_implementation/02_policies/01_start_policy}
--- a/src/02_product-documentation/04_implementation/02_policies/01_start_policy.tex
+++ b/src/02_product-documentation/04_implementation/02_policies/01_start_policy.tex
+\subsection{Start policy}
+    The start policy handles the takeoff of the drone.
+    However, since the takeoff and landing are very similar tasks and the start policy generalizes quite good, the start policy can also be used for the
+    landing task.
+
+    \subsubsection{Hyperparameters}
+        \begin{lstlisting}[language=python]
+num_episodes = 10000 # ensures training isn't stopped too early
+num_steps_per_episode = 2000
+
+# ensures the policy reaching its target is in the replay 
+# buffer, 10000 was too little, as only the last 5 episodes
+# were kept in the replay buffer
+replay_buffer_capacity = 100000
+
+# simple tasks with large learning rate require large 
+# batch size to prevent loss getting 0
+batch_size = 1024 
+
+# largest possible value
+# larger learning rate causes policy to not solve task
+critic_learning_rate = 3e-4
+actor_learning_rate = 3e-4
+alpha_learning_rate = 3e-4
+
+# simple task -> discount can be quite large / gamma quite low
+# fine-adjusted through trial and error
+gamma = 0.9
+
+# simple task -> few neurons
+# z velocity requires a second layer
+actor_fc_layer_params = (16,16)
+critic_joint_fc_layer_params = actor_fc_layer_params
+
+
+# the following hyperparameters are from utils.py
+
+target_update_tau = 0.005
+target_update_period = 1
+        \end{lstlisting}
+
+        The parameters \lstinline[language=python]{target_update_tau} and \lstinline[language=python]{target_update_period} are explained in 
+        \autoref{sec:implementation/networks-target-networks}. 
+        They only affect the \hyperref[sec:implementation/critic-network]{critic network}.
+
+    \subsubsection{Learning curve}
+        \begin{figure}[H]
+            \centering
+            \includegraphics[width=\linewidth]{implementation/learning_curves/start_policy.png}
+            \caption{Learning curve start policy}
+        \end{figure}
+
+        In the opaque, smoothened curve, is is clearly visible how the policy approaches the optimum, which is approximately at $-1 * 10^4$.
+        The partially transparent, not smoothened curve shows even shows good results at 70K steps, but, while testing the policy, I have noted, that the
+        policy is still not good in some cases.
+        Therefore, the smoothened curve is a better indicator for the policy successfully solving the task.
+
+    \subsubsection{Demo}
+        A demonstration video of the drone starting is available \href{https://cloud.joos.io/index.php/s/QRpNi2WnzRt6XrE}{here}.
+        A demonstration video of the drone landing is available \href{https://cloud.joos.io/index.php/s/tRC7DETDXFo8mPf}{here}.
--- a/src/resources/implementation/learning_curves/start_policy.png
+++ b/src/resources/implementation/learning_curves/start_policy.png