# Publications

### Submitted Papers

1. McAleer, S., Lanier, J., Baldi, P., & Fox, R.. (2021). XDO: A Double Oracle Algorithm for Extensive Form Games.

Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm for two-player zero-sum games that has empirically found approximate Nash equilibria in large games. Although PSRO is guaranteed to converge to a Nash equilibrium, it may take an exponential number of iterations as the number of infostates grows. We propose XDO, a new extensive-form double oracle algorithm that is guaranteed to converge to an approximate Nash equilibrium linearly in the number of infostates. Unlike PSRO, which mixes best responses at the root of the game, XDO mixes best responses at every infostate. We also introduce Neural XDO (NXDO), where the best response is learned through deep RL. In tabular experiments on Leduc poker, we find that XDO achieves an approximate Nash equilibrium in a number of iterations 1-2 orders of magnitude smaller than PSRO. In experiments on a modified Leduc poker game, we show that tabular XDO achieves over 11x lower exploitability than CFR and over 82x lower exploitability than PSRO and XFP in the same amount of time. We also show that NXDO beats PSRO and is competitive with NFSP on a large no-limit poker game.

### Journal Articles

1. Agostinelli, F.*, McAleer, S.*, Shmakov, A.*, & Baldi, P. (2019). Solving the Rubik’s cube with deep reinforcement learning and search. In Nature Machine Intelligence.

The Rubik’s cube is a prototypical combinatorial puzzle that has a large state space with a single goal state. The goal state is unlikely to be accessed using sequences of randomly generated moves, posing unique challenges for machine learning. We solve the Rubik’s cube with DeepCubeA, a deep reinforcement learning approach that learns how to solve increasingly difficult states in reverse from the goal state without any specific domain knowledge. DeepCubeA solves 100% of all test configurations, finding a shortest path to the goal state 60.3% of the time. DeepCubeA generalizes to other combinatorial puzzles and is able to solve the 15 puzzle, 24 puzzle, 35 puzzle, 48 puzzle, Lights Out and Sokoban, finding a shortest path in the majority of verifiable cases.

### Conference and Workshop Papers

1. McAleer, S.*, Lanier, J.*, Fox, R., & Baldi, P. (2020). Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games. In Conference on Neural Information Processing Systems (NeurIPS)

Finding approximate Nash equilibria in zero-sum imperfect-information games is challenging when the number of information states is large. Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm grounded in game theory that is guaranteed to converge to an approximate Nash equilibrium. However, PSRO requires training a reinforcement learning policy at each iteration, making it too slow for large games. We show through counterexamples and experiments that DCH and Rectified PSRO, two existing approaches to scaling up PSRO, fail to converge even in small games. We introduce Pipeline PSRO (P2SRO), the first scalable general method for finding approximate Nash equilibria in large zero-sum imperfect-information games. P2SRO is able to parallelize PSRO with convergence guarantees by maintaining a hierarchical pipeline of reinforcement learning workers, each training against the policies generated by lower levels in the hierarchy. We show that unlike existing methods, P2SRO converges to an approximate Nash equilibrium, and does so faster as the number of parallel workers increases, across a variety of imperfect information games. We also introduce an open-source environment for Barrage Stratego, a variant of Stratego with an approximate game tree complexity of $10^{50}$. P2SRO is able to achieve state-of-the-art performance on Barrage Stratego and beats all existing bots.

1. Khadka, S., Majumdar, S., Miret, S., McAleer, S., & Tumer, K. (2019). Evolutionary Reinforcement Learning for Sample-Efficient Multiagent Coordination. In International Conference on Machine Learning (ICML).

Many cooperative multiagent reinforcement learning environments provide agents with a sparse team-based reward as well as a dense agent-specific reward that incentivizes learning basic skills. Training policies solely on the team-based reward is often difficult due to its sparsity. Also, relying solely on the agent-specific reward is sub-optimal because it usually does not capture the team coordination objective. A common approach is to use reward shaping to construct a proxy reward. However, this requires manual design and tuning for each environment. We introduce Multiagent Evolutionary Reinforcement Learning (MERL), a split-level training platform that handles the two objectives separately through two optimization processes. An evolutionary algorithm maximizes the sparse team-based objective through neuroevolution. Concurrently, a gradient-based optimizer trains policies to maximize dense agent-specific rewards. These gradient-based policies are periodically copied into the evolutionary population. This enables the evolutionary algorithm to use skills learned via the agent-specific rewards toward optimizing the global objective. Results demonstrate that MERL significantly outperforms state-of-the-art methods such as MADDPG on a number of difficult coordination benchmarks.

2. Shmakov, A., Lanier, J., McAleer, S., Achar, R., Lopes, C., & Baldi, P. (2019). ColosseumRL: A Framework for Multiagent Reinforcement Learning in N-Player Games. In AAAI Spring Symposium Series: Challenges and Opportunities for Multi-Agent Reinforcement Learning.

Much of recent success in multiagent reinforcement learning has been in two-player zero-sum games. In these games, algorithms such as fictitious self-play and minimax tree search can converge to an approximate Nash equilibrium. While playing a Nash equilibrium strategy in a two-player zero-sum game is optimal, in an n-player general sum game, it becomes a much less informative solution concept. Despite the lack of a satisfying solution concept, n-player games form the vast majority of real-world multiagent situations. In this paper we present a new framework for research in reinforcement learning in n-player games. We hope that by analyzing behavior learned by agents in these environments the community can better understand this important research area and move toward meaningful solution concepts and research directions. The implementation and additional information about this framework can be found at this https URL.

3. McAleer, S.*, Agostinelli, F.*, Shmakov, A.*, & Baldi, P. (2018). Solving the Rubik's Cube with Approximate Policy Iteration. In International Conference on Learning Representations (ICLR).

Recently, Approximate Policy Iteration (API) algorithms have achieved super-human proficiency in two-player zero-sum games such as Go, Chess, and Shogi without human data. These API algorithms iterate between two policies: a slow policy (tree search), and a fast policy (a neural network). In these two-player games, a reward is always received at the end of the game. However, the Rubik’s Cube has only a single solved state, and episodes are not guaranteed to terminate. This poses a major problem for these API algorithms since they rely on the reward received at the end of the game. We introduce Autodidactic Iteration: an API algorithm that overcomes the problem of sparse rewards by training on a distribution of states that allows the reward to propagate from the goal state to states farther away. Autodidactic Iteration is able to learn how to solve the Rubik’s Cube and the 15-puzzle without relying on human data. Our algorithm is able to solve 100% of randomly scrambled cubes while achieving a median solve length of 30 moves—less than or equal to solvers that employ human domain knowledge.

4. Shao, S., McAleer, S., Yan, R., & Baldi, P. (2018). Highly Accurate Machine Fault Diagnosis Using Deep Transfer Learning. In IEEE Transactions on Industrial Informatics 15.

We develop a novel deep learning framework to achieve highly accurate machine fault diagnosis using transfer learning to enable and accelerate the training of deep neural network. Compared with existing methods, the proposed method is faster to train and more accurate. First, original sensor data are converted to images by conducting a Wavelet transformation to obtain time-frequency distributions. Next, a pretrained network is used to extract lower level features. The labeled time-frequency images are then used to fine-tune the higher levels of the neural network architecture. This paper creates a machine fault diagnosis pipeline and experiments are carried out to verify the effectiveness and generalization of the pipeline on three main mechanical datasets including induction motors, gearboxes, and bearings with sizes of 6000, 9000, and 5000 time series samples, respectively. We achieve state-of-the-art results on each dataset, with most datasets showing test accuracy near 100%, and in the gearbox dataset, we achieve significant improvement from 94.8% to 99.64%. We created a repository including these datasets located at mlmechanics.ics.uci.edu.

### Arxiv Papers

1. Anker, A., Baldi, P., Barwick, SW., Bergman, D., Bernhoff, H., Besson, DZ., Bingefors, N., Botner, O., Chen, P., Chen, Y., García-Fernández, D., Gaswint, G., Glaser, C., Hallgren, A., Hanson, JC., Huang, JJ., Klein, SR., Kleinfelder, SA., Kuo, C-Y., Lahmann, R., Latif, U., Liu, T., Lyu, Y., McAleer, S., Nam, J., Novikov, A., Nelles, A., Paul, MP., Persichilli, C., Plaisier, I., Shiao, JY., Tatar, J., van Vliet, A., Wang, S-H., Wang, Y-H., & Welling, C. (2020). White Paper: ARIANNA-200 high energy neutrino telescope.

The proposed ARIANNA-200 neutrino detector, located at sea-level on the Ross Ice Shelf, Antarctica, consists of 200 autonomous and independent detector stations separated by 1 kilometer in a uniform triangular mesh, and serves as a pathfinder mission for the future IceCube-Gen2 project. The primary science mission of ARIANNA-200 is to search for sources of neutrinos with energies greater than 10^ 17 eV, complementing the reach of IceCube. An ARIANNA observation of a neutrino source would provide strong insight into the enigmatic sources of cosmic rays. ARIANNA observes the radio emission from high energy neutrino interactions in the Antarctic ice. Among radio based concepts under current investigation, ARIANNA-200 would uniquely survey the vast majority of the southern sky at any instant in time, and an important region of the northern sky, by virtue of its location on the surface of the Ross Ice Shelf in Antarctica. The broad sky coverage is specific to the Moore's Bay site, and makes ARIANNA-200 ideally suited to contribute to the multi-messenger thrust by the US National Science Foundation, Windows on the Universe-Multi-Messenger Astrophysics, providing capabilities to observe explosive sources from unknown directions. The ARIANNA architecture is designed to measure the angular direction to within 3 degrees for every neutrino candidate, which too plays an important role in the pursuit of multi-messenger observations of astrophysical sources.

2. Lanier, J., McAleer, S., & Baldi, P. (2019). Curiosity-Driven Multi-Criteria Hindsight Experience Replay.

Dealing with sparse rewards is a longstanding challenge in reinforcement learning. The recent use of hindsight methods have achieved success on a variety of sparse-reward tasks, but they fail on complex tasks such as stacking multiple blocks with a robot arm in simulation. Curiosity-driven exploration using the prediction error of a learned dynamics model as an intrinsic reward has been shown to be effective for exploring a number of sparse-reward environments. We present a method that combines hindsight with curiosity-driven exploration and curriculum learning in order to solve the challenging sparse-reward block stacking task. We are the first to stack more than two blocks using only sparse reward without human demonstrations.