References

Abadi, Martı́n, Paul Barham, Jianmin Chen, et al. 2016. “TensorFlow: A System for Large-Scale Machine Learning.” 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 265–83. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi.

Abdel-Hamid, Ossama, Abdel-Rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. 2014. “Convolutional Neural Networks for Speech Recognition.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (10): 1533–45. https://doi.org/10.1109/taslp.2014.2339736.

Ahmed, Amr, Mohamed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and Alexander J Smola. 2012. “Scalable Inference in Latent Variable Models.” Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, 123–32. https://doi.org/10.1145/2124295.2124312.

Akiba, T., S. Sano, T. Yanase, T. Ohta, and M. Koyama. 2019. “Optuna: A Next-Generation Hyperparameter Optimization Framework.” Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. https://doi.org/10.1145/3292500.3330701.

Alayrac, Jean-Baptiste, Jeff Donahue, Pauline Luc, et al. 2022. “Flamingo: A Visual Language Model for Few-Shot Learning.” ArXiv:2204.14198. https://arxiv.org/abs/2204.14198.

Alsallakh, Bilal, Narine Kokhlikyan, Vivek Miglani, Jun Yuan, and Orion Reblitz-Richardson. 2020. “Mind the PAD – CNNs Can Develop Blind Spots.” ArXiv:2010.02178. https://arxiv.org/abs/2010.02178.

Anil, Rohan, Andrew M Dai, Orhan Firat, et al. 2023. “PaLM 2 Technical Report.” ArXiv:2305.10403. https://arxiv.org/abs/2305.10403.

Anil, Rohan, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. 2020. “Scalable Second-Order Optimization for Deep Learning.” ArXiv:2002.09018. https://arxiv.org/abs/2002.09018.

Aronszajn, Nachman. 1950. “Theory of reproducing kernels.” Transactions of the American Mathematical Society 68 (3): 337–404. https://doi.org/10.1090/s0002-9947-1950-0051437-7.

Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. “Layer Normalization.” ArXiv:1607.06450. https://arxiv.org/abs/1607.06450.

Baevski, Alexei, and Michael Auli. 2018. “Adaptive Input Representations for Neural Language Modeling.” International Conference on Learning Representations. https://openreview.net/forum?id=ByxZX20qFQ.

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. “Neural Machine Translation by Jointly Learning to Align and Translate.” International Conference on Learning Representations. https://doi.org/10.48550/arXiv.1409.0473.

Bai, Yuntao, Saurav Kadavath, Sandipan Kundu, et al. 2022. “Constitutional AI: Harmlessness from AI Feedback.” ArXiv:2212.08073. https://arxiv.org/abs/2212.08073.

Baptista, R., and M. Poloczek. 2018. “Bayesian Optimization of Combinatorial Structures.” Proceedings of the 35th International Conference on Machine Learning. https://proceedings.mlr.press/v80/baptista18a.html.

Bardenet, R., M. Brendel, B. Kégl, and M. Sebag. 2013. “Collaborative Hyperparameter Tuning.” Proceedings of the 30th International Conference on Machine Learning (ICML’13). https://proceedings.mlr.press/v28/bardenet13.html.

Bay, Herbert, Tinne Tuytelaars, and Luc Van Gool. 2006. “SURF: Speeded up Robust Features.” European Conference on Computer Vision, 404–17. https://doi.org/10.1007/11744023_32.

Bellman, R. 1966. “Dynamic Programming.” Science 153: 34–37. https://doi.org/10.1126/science.153.3731.34.

Bellman, Richard. 1952. “On the Theory of Dynamic Programming.” Proceedings of the National Academy of Sciences 38 (8): 716–19. https://doi.org/10.1073/pnas.38.8.716.

Bellman, Richard. 1957a. “A Markovian Decision Process.” Journal of Mathematics and Mechanics 6 (5): 679–84. http://www.jstor.org/stable/24900506.

Bellman, Richard. 1957b. Dynamic Programming. Dover. Dover Publications. https://doi.org/10.1515/9781400835386.

Beltagy, Iz, Matthew E Peters, and Arman Cohan. 2020. “Longformer: The Long-Document Transformer.” ArXiv:2004.05150. https://arxiv.org/abs/2004.05150.

Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. “A Neural Probabilistic Language Model.” Journal of Machine Learning Research 3 (Feb): 1137–55. https://jmlr.org/papers/v3/bengio03a.html.

Bengio, Yoshua, Patrice Simard, and Paolo Frasconi. 1994. “Learning Long-Term Dependencies with Gradient Descent Is Difficult.” IEEE Transactions on Neural Networks 5 (2): 157–66. https://doi.org/10.1109/72.279181.

Bergstra, James, and Yoshua Bengio. 2012. “Random Search for Hyper-Parameter Optimization.” Journal of Machine Learning Research 13: 281–305.

Bergstra, James, Olivier Breuleux, Frédéric Bastien, et al. 2010. “Theano: A CPU and GPU Math Compiler in Python.” Proc. 9th Python in Science Conference 1: 3–10. https://www.iro.umontreal.ca/~lisa/pointeurs/theano_scipy2010.pdf.

Beutel, Alex, Kenton Murray, Christos Faloutsos, and Alexander J Smola. 2014. “CoBaFi: Collaborative Bayesian Filtering.” Proceedings of the 23rd International Conference on World Wide Web, 97–108. https://doi.org/10.1145/2566486.2567980.

Bishop, Chris M. 1995. “Training with Noise Is Equivalent to Tikhonov Regularization.” Neural Computation 7 (1): 108–16. https://doi.org/10.1162/neco.1995.7.1.108.

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer. https://doi.org/10.1007/978-0-387-45528-0.

Black, Fischer, and Myron Scholes. 1973. “The Pricing of Options and Corporate Liabilities.” Journal of Political Economy 81: 637–54. https://doi.org/10.1086/260062.

Bodla, Navaneeth, Bharat Singh, Rama Chellappa, and Larry S Davis. 2017. “Soft-NMS-Improving Object Detection with One Line of Code.” Proceedings of the IEEE International Conference on Computer Vision, 5561–69. https://doi.org/10.1109/iccv.2017.593.

Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. “Enriching Word Vectors with Subword Information.” Transactions of the Association for Computational Linguistics 5: 135–46. https://doi.org/10.1162/tacl_a_00051.

Bollobás, B. 1999. Linear Analysis. Cambridge University Press. https://doi.org/10.1017/CBO9780511626296.

Bommasani, Rishi, Drew A Hudson, Ehsan Adeli, et al. 2021. “On the Opportunities and Risks of Foundation Models.” ArXiv:2108.07258. https://arxiv.org/abs/2108.07258.

Bottou, Léon. 2010. “Large-Scale Machine Learning with Stochastic Gradient Descent.” In Proceedings of COMPSTAT’2010. Springer. https://doi.org/10.1201/b11429-4.

Bottou, Léon, and Yann Le Cun. 1988. “SN: A Simulator for Connectionist Models.” Proceedings of NeuroNimes 88 (Nimes, France), 371–82. http://leon.bottou.org/papers/bottou-lecun-88.

Boucheron, Stéphane, Olivier Bousquet, and Gábor Lugosi. 2005. “Theory of Classification: A Survey of Some Recent Advances.” ESAIM: Probability and Statistics 9: 323–75. https://doi.org/10.1214/154957805100000046.

Bowman, Samuel R, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. “A Large Annotated Corpus for Learning Natural Language Inference.” Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 632–42. https://arxiv.org/abs/1508.05326.

Boyd, Stephen, and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press. https://web.stanford.edu/~boyd/cvxbook/.

Bradley, Ralph Allan, and Milton E Terry. 1952. “Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons.” Biometrika 39 (3/4): 324–45. https://doi.org/10.1093/biomet/41.3-4.502.

Brown, Noam, and Tuomas Sandholm. 2017. “Libratus: The Superhuman AI for No-Limit Poker.” IJCAI, 5226–28. https://doi.org/10.24963/ijcai.2017/772.

Brown, Peter F, John Cocke, Stephen A Della Pietra, et al. 1988. “A Statistical Approach to Language Translation.” COLING Budapest 1988 Volume 1: International Conference on Computational Linguistics.

Brown, Peter F, John Cocke, Stephen A Della Pietra, et al. 1990. “A Statistical Approach to Machine Translation.” Computational Linguistics 16 (2): 79–85. https://doi.org/10.1162/coli.1990.16.2.79.

Brown, Tom, Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33: 1877–901. https://arxiv.org/abs/2005.14165.

Buslaev, Alexander, Vladimir I Iglovikov, Eugene Khvedchenya, Alex Parinov, Mikhail Druzhinin, and Alexandr A Kalinin. 2020. “Albumentations: Fast and Flexible Image Augmentations.” Information 11 (2): 125. https://doi.org/10.3390/info11020125.

Campbell, Murray, A Joseph Hoane Jr, and Feng-hsiung Hsu. 2002. “Deep Blue.” Artificial Intelligence 134 (1-2): 57–83. https://doi.org/10.1016/s0004-3702(01)00129-1.

Canny, John. 1987. “A Computational Approach to Edge Detection.” In Readings in Computer Vision. Elsevier. https://doi.org/10.1109/tpami.1986.4767851.

Cer, Daniel, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. “SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation.” Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 1–14. https://doi.org/10.18653/v1/s17-2001.

Chan, William, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. 2015. “Listen, Attend and Spell.” ArXiv:1508.01211. https://arxiv.org/abs/1508.01211.

Chen, Lili, Kevin Lu, Aravind Rajeswaran, et al. 2021. “Decision Transformer: Reinforcement Learning via Sequence Modeling.” Advances in Neural Information Processing Systems 34: 15084–97. https://arxiv.org/abs/2106.01345.

Chen, Tianqi, Mu Li, Yutian Li, et al. 2015. “MXNET: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems.” ArXiv:1512.01274. https://arxiv.org/abs/1512.01274.

Cheng, Jianpeng, Li Dong, and Mirella Lapata. 2016. “Long Short-Term Memory-Networks for Machine Reading.” Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 551–61. https://doi.org/10.18653/v1/d16-1053.

Chetlur, Sharan, Cliff Woolley, Philippe Vandermersch, et al. 2014. “CuDNN: Efficient Primitives for Deep Learning.” ArXiv:1410.0759. https://arxiv.org/abs/1410.0759.

Cho, Kyunghyun, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. “On the Properties of Neural Machine Translation: Encoder–Decoder Approaches.” ArXiv:1409.1259. https://arxiv.org/abs/1409.1259.

Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, et al. 2014. “Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation.” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1724–34. https://arxiv.org/abs/1406.1078.

Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin, et al. 2022. “PaLM: Scaling Language Modeling with Pathways.” ArXiv:2204.02311. https://arxiv.org/abs/2204.02311.

Chung, Junyoung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.” ArXiv:1412.3555. https://arxiv.org/abs/1412.3555.

Clark, Kevin, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. “ELECTRA: Pre-Training Text Encoders as Discriminators Rather Than Generators.” International Conference on Learning Representations. https://openreview.net/forum?id=r1xMH1BtvB.

Collobert, Ronan, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. “Natural Language Processing (Almost) from Scratch.” Journal of Machine Learning Research 12: 2493–537. https://jmlr.org/papers/v12/collobert11a.html.

Cordonnier, Jean-Baptiste, Andreas Loukas, and Martin Jaggi. 2020. “On the Relationship Between Self-Attention and Convolutional Layers.” International Conference on Learning Representations. https://openreview.net/forum?id=HJlnC1rKPB.

Cover, T, and JM Thomas. 1999. Elements of Information Theory. John Wiley & Sons. https://doi.org/10.1002/047174882X.

Csiszár, Imre. 2008. “Axiomatic Characterizations of Information Measures.” Entropy 10 (3): 261–73. https://doi.org/10.3390/e10030261.

Cybenko, George. 1989. “Approximation by Superpositions of a Sigmoidal Function.” Mathematics of Control, Signals and Systems 2 (4): 303–14. https://doi.org/10.1007/bf02134016.

Dalal, Navneet, and Bill Triggs. 2005. “Histograms of Oriented Gradients for Human Detection.” 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) 1: 886–93. https://doi.org/10.1109/cvpr.2005.177.

De Cock, Dean. 2011. “Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project.” Journal of Statistics Education 19 (3). https://doi.org/10.1080/10691898.2011.11889627.

Dean, Jeffrey, Greg S Corrado, Rajat Monga, et al. 2012. “Large Scale Distributed Deep Networks.” Proceedings of the 25th International Conference on Neural Information Processing Systems, Volume 1, 1223–31. https://doi.org/10.5555/2999134.2999271.

DeCandia, Giuseppe, Deniz Hastorun, Madan Jampani, et al. 2007. “Dynamo: Amazon’s Highly Available Key-Value Store.” ACM SIGOPS Operating Systems Review 41: 205–20. https://doi.org/10.1145/1323293.1294281.

Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. “Imagenet: A Large-Scale Hierarchical Image Database.” 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–55. https://doi.org/10.1109/cvpr.2009.5206848.

Der Kiureghian, Armen, and Ove Ditlevsen. 2009. “Aleatory or Epistemic? Does It Matter?” Structural Safety 31 (2): 105–12. https://doi.org/10.1016/j.strusafe.2008.06.020.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” ArXiv:1810.04805. https://arxiv.org/abs/1810.04805.

Dinh, Laurent, David Krueger, and Yoshua Bengio. 2014. “NICE: Non-Linear Independent Components Estimation.” ArXiv:1410.8516. https://arxiv.org/abs/1410.8516.

Dinh, Laurent, Jascha Sohl-Dickstein, and Samy Bengio. 2017. “Density Estimation Using Real NVP.” International Conference on Learning Representations. https://openreview.net/forum?id=HkpbnH9lx.

Doersch, Carl, Abhinav Gupta, and Alexei A Efros. 2015. “Unsupervised Visual Representation Learning by Context Prediction.” Proceedings of the IEEE International Conference on Computer Vision, 1422–30. https://doi.org/10.1109/iccv.2015.167.

Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, et al. 2021. “An Image Is Worth 16 x 16 Words: Transformers for Image Recognition at Scale.” International Conference on Learning Representations. https://openreview.net/forum?id=YicbFdNTTy.

Duchi, John, Elad Hazan, and Yoram Singer. 2011. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” Journal of Machine Learning Research 12: 2121–59. https://jmlr.org/papers/v12/duchi11a.html.

Dumoulin, Vincent, and Francesco Visin. 2016. “A Guide to Convolution Arithmetic for Deep Learning.” ArXiv:1603.07285. https://arxiv.org/abs/1603.07285.

Dwivedi, Vijay Prakash, and Xavier Bresson. 2020. “A Generalization of Transformer Networks to Graphs.” ArXiv:2012.09699. https://arxiv.org/abs/2012.09699.

Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. 2015. “Preserving Statistical Validity in Adaptive Data Analysis.” Proceedings of the 47th Annual ACM Symposium on Theory of Computing, 117–26. https://doi.org/10.1145/2746539.2746580.

Elman, Jeffrey L. 1990. “Finding Structure in Time.” Cognitive Science 14 (2): 179–211. https://doi.org/10.1016/0364-0213(90)90002-E.

Elsken, T., J. H. Metzen, and F. Hutter. 2018. “Neural Architecture Search: A Survey.” ArXiv:1808.05377 [Stat.ML]. https://arxiv.org/abs/1808.05377.

Fechner, Gustav Theodor. 1860. Elemente Der Psychophysik. Vol. 2. Breitkopf u. Härtel.

Fedus, William, Barret Zoph, and Noam Shazeer. 2022. “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” Journal of Machine Learning Research 23 (120): 1–39. https://arxiv.org/abs/2101.03961.

Fernando, Randima. 2004. GPU Gems: Programming Techniques, Tips, and Tricks for Real-Time Graphics. Addison-Wesley.

Feurer, M., and F. Hutter. 2019. “Hyperparameter Optimization.” In Automated Machine Learning: Methods, Systems, Challenges. Springer. https://doi.org/10.1007/978-3-030-05318-5_1.

Feurer, M., B. Letham, F. Hutter, and E. Bakshy. 2022. “Practical Transfer Learning for Bayesian Optimization.” ArXiv:1802.02219 [Stat.ML]. https://arxiv.org/abs/1802.02219.

Field, David J. 1987. “Relations Between the Statistics of Natural Images and the Response Properties of Cortical Cells.” JOSA A 4 (12): 2379–94. https://doi.org/10.1364/josaa.4.002379.

Fisher, R A. 1925. Statistical Methods for Research Workers. Oliver & Boyd.

Flammarion, Nicolas, and Francis Bach. 2015. “From Averaging to Acceleration, There Is Only a Step-Size.” Conference on Learning Theory, 658–95. https://proceedings.mlr.press/v40/Flammarion15.html.

Forrester, Alexander IJ, András Sóbester, and Andy J Keane. 2007. “Multi-Fidelity Optimization via Surrogate Modelling.” Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 463 (2088): 3251–69. https://doi.org/10.1098/rspa.2007.1900.

Franceschi, L., M. Donini, P. Frasconi, and M. Pontil. 2017. “Forward and Reverse Gradient-Based Hyperparameter Optimization.” Proceedings of the 34th International Conference on Machine Learning (ICML’17). https://proceedings.mlr.press/v70/franceschi17a.html.

Frankle, Jonathan, and Michael Carbin. 2019. “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.” International Conference on Learning Representations. https://arxiv.org/abs/1803.03635.

Frazier, Peter I. 2018. “A Tutorial on Bayesian Optimization.” ArXiv:1807.02811. https://arxiv.org/abs/1807.02811.

Freund, Yoav, and Robert E Schapire. 1996. “Experiments with a New Boosting Algorithm.” Proceedings of the International Conference on Machine Learning 96: 148–56. https://dl.acm.org/doi/10.5555/3091696.3091715.

Friedman, Jerome H. 1987. “Exploratory Projection Pursuit.” Journal of the American Statistical Association 82 (397): 249–66. https://doi.org/10.1080/01621459.1987.10478427.

Frostig, Roy, Matthew James Johnson, and Chris Leary. 2018. “Compiling Machine Learning Programs via High-Level Tracing.” In Proceedings of Systems for Machine Learning. https://mlsys.org/Conferences/2018/doc/19.pdf.

Fukushima, Kunihiko. 1982. “Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Visual Pattern Recognition.” In Competition and Cooperation in Neural Nets. Springer. https://doi.org/10.1007/978-3-642-46466-9_18.

Gardner, Jacob, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson. 2018. “GPyTorch: Blackbox Matrix–Matrix Gaussian Process Inference with GPU Acceleration.” Advances in Neural Information Processing Systems 31. https://doi.org/10.5555/3327345.3327419.

Garg, Saurabh, Sivaraman Balakrishnan, Zico Kolter, and Zachary Lipton. 2021. “RATT: Leveraging Unlabeled Data to Guarantee Generalization.” International Conference on Machine Learning, 3598–609. https://proceedings.mlr.press/v139/garg21a.html.

Gatys, Leon A, Alexander S Ecker, and Matthias Bethge. 2016. “Image Style Transfer Using Convolutional Neural Networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2414–23. https://doi.org/10.1109/cvpr.2016.265.

Gauss, Carl Friedrich. 1809. “Theoria Motus Corporum Coelestium.” In Werke. Königlich Preussische Akademie der Wissenschaften. https://doi.org/10.1007/978-3-642-92478-1.

Gibbs, Josiah Willard. 1902. Elementary Principles of Statistical Mechanics. Scribner’s.

Ginibre, Jean. 1965. “Statistical Ensembles of Complex, Quaternion, and Real Matrices.” Journal of Mathematical Physics 6 (3): 440–49. https://doi.org/10.1063/1.1704292.

Girshick, Ross. 2015. “Fast R-CNN.” Proceedings of the IEEE International Conference on Computer Vision, 1440–48. https://doi.org/10.1109/iccv.2015.169.

Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 580–87. https://doi.org/10.18127/j00338486-202109-11.

Glorot, Xavier, and Yoshua Bengio. 2010. “Understanding the Difficulty of Training Deep Feedforward Neural Networks.” Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 249–56. https://proceedings.mlr.press/v9/glorot10a.html.

Goh, Gabriel. 2017. “Why Momentum Really Works.” Distill. http://distill.pub/2017/momentum.

Goldberg, David, David Nichols, Brian M Oki, and Douglas Terry. 1992. “Using Collaborative Filtering to Weave an Information Tapestry.” Communications of the ACM 35 (12): 61–71. https://doi.org/10.1145/138859.138867.

Golub, Gene H, and Charles F Van Loan. 1996. Matrix Computations. Johns Hopkins University Press. https://jhupbooks.press.jhu.edu/title/matrix-computations.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, et al. 2014. “Generative Adversarial Nets.” Advances in Neural Information Processing Systems, 2672–80. https://doi.org/10.5555/2969033.2969125.

Gotmare, Akhilesh, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. “A Closer Look at Deep Learning Heuristics: Learning Rate Restarts, Warmup and Distillation.” ArXiv:1810.13243. https://arxiv.org/abs/1810.13243.

Goyal, Ankit, Alexey Bochkovskiy, Jia Deng, and Vladlen Koltun. 2021. “Non-Deep Networks.” ArXiv:2110.07641. https://arxiv.org/abs/2110.07641.

Graham, Benjamin. 2014. “Fractional Max-Pooling.” ArXiv:1412.6071. https://arxiv.org/abs/1412.6071.

Graves, Alex. 2013. “Generating Sequences with Recurrent Neural Networks.” ArXiv:1308.0850. https://arxiv.org/abs/1308.0850.

Graves, Alex, Marcus Liwicki, Santiago Fernández, Roman Bertolami, Horst Bunke, and Jürgen Schmidhuber. 2008. “A Novel Connectionist System for Unconstrained Handwriting Recognition.” IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (5): 855–68. https://doi.org/10.1109/tpami.2008.137.

Graves, Alex, and Jürgen Schmidhuber. 2005. “Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures.” Neural Networks 18 (5-6): 602–10. https://doi.org/10.1109/ijcnn.2005.1556215.

Griewank, Andreas. 1989. “On Automatic Differentiation.” In Mathematical Programming: Recent Developments and Applications. Kluwer. https://doi.org/10.1007/bfb0092220.

Gulati, Anmol, James Qin, Chung-Cheng Chiu, et al. 2020. “Conformer: Convolution-Augmented Transformer for Speech Recognition.” Proc. Interspeech 2020, 5036–40. https://doi.org/10.21437/interspeech.2020-3015.

Gunawardana, Asela, and Guy Shani. 2015. “Evaluating Recommender Systems.” In Recommender Systems Handbook. Springer. https://doi.org/10.1007/978-1-4899-7637-6_8.

Guo, Huifeng, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. “DeepFM: A Factorization-Machine Based Neural Network for CTR Prediction.” Proceedings of the 26th International Joint Conference on Artificial Intelligence, 1725–31. https://doi.org/10.24963/ijcai.2017/239.

Guyon, Isabelle, Steve Gunn, Masoud Nikravesh, and Lotfi A Zadeh. 2008. Feature Extraction: Foundations and Applications. Springer. https://doi.org/10.1007/978-3-540-35488-8.

Hadjis, Stefan, Ce Zhang, Ioannis Mitliagkas, Dan Iter, and Christopher Ré. 2016. “Omnivore: An Optimizer for Multi-Device Deep Learning on CPUs and GPUs.” ArXiv:1606.04487. https://arxiv.org/abs/1606.04487.

Hartley, Richard I, and Fredrik Kahl. 2009. “Global Optimization Through Rotation Space Search.” International Journal of Computer Vision 82 (1): 64–79. https://doi.org/10.1007/s11263-008-0186-9.

Hartley, Richard, and Andrew Zisserman. 2000. Multiple View Geometry in Computer Vision. Cambridge University Press. https://doi.org/10.1017/cbo9780511811685.

He, Kaiming, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. “Masked Autoencoders Are Scalable Vision Learners.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16000–16009. https://doi.org/10.1109/cvpr52688.2022.01553.

He, Kaiming, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. “Mask R-CNN.” Proceedings of the IEEE International Conference on Computer Vision, 2961–69. https://doi.org/10.1109/iccv.2017.322.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.” Proceedings of the IEEE International Conference on Computer Vision, 1026–34. https://doi.org/10.1109/iccv.2015.123.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning for Image Recognition.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–78. https://doi.org/10.1109/cvpr.2016.90.

He, Xiangnan, and Tat-Seng Chua. 2017. “Neural Factorization Machines for Sparse Predictive Analytics.” Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 355–64. https://doi.org/10.1145/3077136.3080777.

He, Xiangnan, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. “Neural Collaborative Filtering.” Proceedings of the 26th International Conference on World Wide Web, 173–82. https://doi.org/10.1145/3038912.3052569.

Hebb, Donald Olding. 1949. The Organization of Behavior. Wiley.

Hendrycks, Dan, and Kevin Gimpel. 2016. “Gaussian Error Linear Units (GELUs).” ArXiv:1606.08415. https://arxiv.org/abs/1606.08415.

Hennessy, John L, and David A Patterson. 2011. Computer Architecture: A Quantitative Approach. Elsevier. https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-383872-8.

Herlocker, Jonathan L, Joseph A Konstan, Al Borchers, and John Riedl. 1999. “An Algorithmic Framework for Performing Collaborative Filtering.” 22nd Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR 1999, 230–37. https://doi.org/10.1145/312624.312682.

Hidasi, Balázs, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. “Session-Based Recommendations with Recurrent Neural Networks.” ArXiv:1511.06939. https://arxiv.org/abs/1511.06939.

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. “Denoising Diffusion Probabilistic Models.” Advances in Neural Information Processing Systems 33: 6840–51. https://arxiv.org/abs/2006.11239.

Hochreiter, Sepp, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. 2001. “Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies.” In A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press. https://doi.org/10.18034/ei.v8i2.570.

Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computation 9 (8): 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.

Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, et al. 2022. “Training Compute-Optimal Large Language Models.” ArXiv:2203.15556. https://arxiv.org/abs/2203.15556.

Howard, Andrew, Mark Sandler, Grace Chu, et al. 2019. “Searching for MobileNetV3.” Proceedings of the IEEE/CVF International Conference on Computer Vision, 1314–24. https://doi.org/10.1109/iccv.2019.00140.

Hoyer, Patrik O, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. 2009. “Nonlinear Causal Discovery with Additive Noise Models.” Advances in Neural Information Processing Systems, 689–96. https://doi.org/10.5555/2981780.2981826.

Hu, Jie, Li Shen, and Gang Sun. 2018. “Squeeze-and-Excitation Networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7132–41. https://doi.org/10.1109/cvpr.2018.00745.

Hu, Yifan, Yehuda Koren, and Chris Volinsky. 2008. “Collaborative Filtering for Implicit Feedback Datasets.” 2008 8th IEEE International Conference on Data Mining, 263–72. https://doi.org/10.1109/icdm.2008.22.

Hu, Zhiqiang, Roy Ka-Wei Lee, Charu C. Aggarwal, and Aston Zhang. 2022. “Text Style Transfer: A Review and Experimental Evaluation.” SIGKDD Explor. Newsl. 24 (1). https://doi.org/10.1145/3544903.3544906.

Huang, Cheng-Zhi Anna, Ashish Vaswani, Jakob Uszkoreit, et al. 2018. “Music Transformer: Generating Music with Long-Term Structure.” International Conference on Learning Representations. https://arxiv.org/abs/1809.04281.

Huang, Gao, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. “Densely Connected Convolutional Networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4700–4708. https://doi.org/10.1109/cvpr.2017.243.

Huang, Zhiheng, Wei Xu, and Kai Yu. 2015. “Bidirectional LSTM–CRF Models for Sequence Tagging.” ArXiv:1508.01991. https://arxiv.org/abs/1508.01991.

Hubel, David H, and Torsten N Wiesel. 1959. “Receptive Fields of Single Neurones in the Cat’s Striate Cortex.” Journal of Physiology 148 (3): 574–91. https://doi.org/10.1113/jphysiol.1959.sp006308.

Hubel, David H, and Torsten N Wiesel. 1962. “Receptive Fields, Binocular Interaction and Functional Architecture in the Cat’s Visual Cortex.” Journal of Physiology 160 (1): 106–54. https://doi.org/10.1113/jphysiol.1962.sp006837.

Hubel, David H, and Torsten N Wiesel. 1968. “Receptive Fields and Functional Architecture of Monkey Striate Cortex.” Journal of Physiology 195 (1): 215–43. https://doi.org/10.1113/jphysiol.1968.sp008455.

Hutter, F., H. Hoos, and K. Leyton-Brown. 2011. “Sequential Model-Based Optimization for General Algorithm Configuration.” Proceedings of the Fifth International Conference on Learning and Intelligent Optimization (LION’11). https://doi.org/10.1007/978-3-642-25566-3_40.

Hutter, F., L. Kotthoff, and J. Vanschoren, eds. 2019. Automated Machine Learning: Methods, Systems, Challenges. Springer. https://doi.org/10.1007/978-3-030-05318-5.

Ioffe, Sergey. 2017. “Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models.” Advances in Neural Information Processing Systems, 1945–53. https://proceedings.mlr.press/v70/ioffe17a.html.

Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” International Conference on Machine Learning, 448–56. https://arxiv.org/abs/1502.03167.

Izmailov, Pavel, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. “Averaging Weights Leads to Wider Optima and Better Generalization.” Uncertainty in Artificial Intelligence, 876–85. https://arxiv.org/abs/1803.05407.

Jacot, Arthur, Franck Gabriel, and Clément Hongler. 2018. “Neural Tangent Kernel: Convergence and Generalization in Neural Networks.” Advances in Neural Information Processing Systems 31. https://doi.org/10.1145/3406325.3465355.

Jaeger, Herbert. 2002. Tutorial on Training Recurrent Neural Networks, Covering BPTT, RTRL, EKF and the “Echo State Network” Approach. GMD-Forschungszentrum Informationstechnik Bonn.

Jamieson, K., and A. Talwalkar. 2016. “Non-Stochastic Best Arm Identification and Hyperparameter Optimization.” Proceedings of the 19th International Conference on Artificial Intelligence and Statistics. https://proceedings.mlr.press/v51/jamieson16.html.

Jenatton, R., C. Archambeau, J. González, and M. Seeger. 2017. “Bayesian Optimization with tree-Structured dependencies.” Proceedings of the 34th International Conference on Machine Learning (ICML’17). https://proceedings.mlr.press/v70/jenatton17a.html.

Jia, Xianyan, Shutao Song, Wei He, et al. 2018. “Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes.” ArXiv:1807.11205. https://arxiv.org/abs/1807.11205.

Jia, Yangqing, Evan Shelhamer, Jeff Donahue, et al. 2014. “Caffe: Convolutional Architecture for Fast Feature Embedding.” Proceedings of the 22nd ACM International Conference on Multimedia, 675–78. https://doi.org/10.1145/2647868.2654889.

Joshi, Mandar, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. 2020. “SpanBERT: Improving Pre-Training by Representing and Predicting Spans.” Transactions of the Association for Computational Linguistics 8: 64–77. https://arxiv.org/abs/1907.10529.

Jouppi, Norman P, Cliff Young, Nishant Patil, et al. 2017. “In-Datacenter Performance Analysis of a Tensor Processing Unit.” 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 1–12. https://doi.org/10.1145/3140659.3080246.

Kalchbrenner, Nal, Edward Grefenstette, and Phil Blunsom. 2014. “A Convolutional Neural Network for Modelling Sentences.” ArXiv:1404.2188. https://arxiv.org/abs/1404.2188.

Kalman, Barry L, and Stan C Kwasny. 1992. “Why Tanh: Choosing a Sigmoidal Function.” Proceedings of the International Joint Conference on Neural Networks (IJCNN), 578–81. https://doi.org/10.1109/ijcnn.1992.227257.

Kaplan, Jared, Sam McCandlish, Tom Henighan, et al. 2020. “Scaling Laws for Neural Language Models.” ArXiv:2001.08361. https://arxiv.org/abs/2001.08361.

Karnin, Z., T. Koren, and O. Somekh. 2013. “Almost Optimal Exploration in Multi-Armed Bandits.” Proceedings of the 30th International Conference on Machine Learning (ICML’13). https://proceedings.mlr.press/v28/karnin13.html.

Karras, Tero, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. “Progressive Growing of GANs for Improved Quality, Stability, and Variation.” International Conference on Learning Representations. https://arxiv.org/abs/1710.10196.

Kim, Jaeyoung, Mostafa El-Khamy, and Jungwon Lee. 2017. “Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition.” ArXiv:1701.03360. https://arxiv.org/abs/1701.03360.

Kim, Yoon. 2014. “Convolutional Neural Networks for Sentence Classification.” ArXiv:1408.5882. https://arxiv.org/abs/1408.5882.

Kimeldorf, G. S., and G. Wahba. 1971. “Some Results on Tchebycheffian Spline Functions.” J. Math. Anal. Appl. 33: 82–95. https://doi.org/10.1214/aoms/1177693054.

Kingma, Diederik P, and Jimmy Ba. 2015. “Adam: A Method for Stochastic Optimization.” International Conference on Learning Representations. https://arxiv.org/abs/1412.6980.

Kingma, Diederik P., and Max Welling. 2014. “Auto-Encoding Variational Bayes.” International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1312.6114.

Kipf, Thomas N, and Max Welling. 2017. “Semi-Supervised Classification with Graph Convolutional Networks.” International Conference on Learning Representations. https://arxiv.org/abs/1609.02907.

Kojima, Takeshi, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. “Large Language Models Are Zero-Shot Reasoners.” Arxiv.org/Abs/2205.11916, ahead of print. https://doi.org/10.52202/068431-1613.

Koller, Daphne, and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press. https://doi.org/10.7551/mitpress/7432.001.0001.

Kolmogorov, Andrey. 1933. “Sulla Determinazione Empirica Di Una Legge Di Distribuzione.” Inst. Ital. Attuari, Giorn. 4: 83–91. https://doi.org/10.1007/BF03017337.

Kolter, Zico. 2008. “Linear Algebra Review and Reference.” Available Online: Http://Cs229.stanford.edu/Section/Cs229-Linalg.pdf. http://cs229.stanford.edu/section/cs229-linalg.pdf.

Koren, Yehuda, Robert Bell, and Chris Volinsky. 2009. “Matrix Factorization Techniques for Recommender Systems.” Computer 42 (8): 30–37. https://doi.org/10.1109/mc.2009.263.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems, 1097–105. https://doi.org/10.5555/2999134.2999257.

Kung, Sun Yuan. 1988. “VLSI Array Processors.” Prentice Hall.

Kuzovkin, Ilya, Raul Vicente, Mathilde Petton, et al. 2018. “Activations of Deep Convolutional Neural Networks Are Aligned with Gamma Band Activity of Human Visual Cortex.” Communications Biology 1 (1): 1–12. https://doi.org/10.1038/s42003-018-0110-y.

Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. “ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations.” ArXiv:1909.11942. https://arxiv.org/abs/1909.11942.

Lavin, Andrew, and Scott Gray. 2016. “Fast Algorithms for Convolutional Neural Networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4013–21. https://doi.org/10.1109/cvpr.2016.435.

Le, Quoc V. 2013. “Building High-Level Features Using Large Scale Unsupervised Learning.” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 8595–98. https://doi.org/10.1109/icassp.2013.6639343.

LeCun, Yann, Yoshua Bengio, and et al. 1995. “Convolutional Networks for Images, Speech, and Time Series.” In The Handbook of Brain Theory and Neural Networks. MIT Press. http://yann.lecun.com/exdb/publis/pdf/lecun-bengio-95a.pdf.

LeCun, Yann, Bernhard Boser, John S Denker, et al. 1989. “Backpropagation Applied to Handwritten Zip Code Recognition.” Neural Computation 1 (4): 541–51. https://doi.org/10.1162/neco.1989.1.4.541.

LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE 86 (11): 2278–324. https://doi.org/10.1109/5.726791.

LeCun, Yann, Leon Bottou, G Orr, and Klaus-Robert Muller. 1998. “Efficient Backprop.” In Neural Networks: Tricks of the Trade. Springer. https://doi.org/10.1007/3-540-49430-8_2.

LeCun, Yann, LD Jackel, Leon Bottou, et al. 1995. “Comparison of Learning Algorithms for Handwritten Digit Recognition.” International Conference on Artificial Neural Networks, 53–60.

Legendre, Adrien Marie. 1805. Mémoire Sur Les Opérations Trigonométriques: Dont Les Résultats Dépendent de La Figure de La Terre. F. Didot.

Lewis, Mike, Yinhan Liu, Naman Goyal, et al. 2019. “BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension.” ArXiv:1910.13461. https://arxiv.org/abs/1910.13461.

Lewkowycz, Aitor, Anders Andreassen, David Dohan, et al. 2022. “Solving Quantitative Reasoning Problems with Language Models.” ArXiv:2206.14858. https://arxiv.org/abs/2206.14858.

Li, L., K. Jamieson, A. Rostamizadeh, et al. 2018. “Massively Parallel Hyperparameter Tuning.” ArXiv:1810.05934. https://arxiv.org/abs/1810.05934.

Li, Mu. 2017. “Scaling Distributed Machine Learning with System and Algorithm Co-Design.” PhD thesis, PhD Thesis, CMU. https://www.cs.cmu.edu/~muli/file/mu-thesis.pdf.

Li, Mu, David G Andersen, Jun Woo Park, et al. 2014. “Scaling Distributed Machine Learning with the Parameter Server.” 11th Symposium on Operating Systems Design and Implementation (OSDI 14), 583–98. https://doi.org/10.1145/2640087.2644155.

Li, Mu, Tong Zhang, Yuqiang Chen, and Alexander J Smola. 2014. “Efficient Mini-Batch Training for Stochastic Optimization.” Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 661–70. https://doi.org/10.1145/2623330.2623612.

Liaw, R., E. Liang, R. Nishihara, P. Moritz, J. Gonzalez, and I. Stoica. 2018. “Tune: A Research Platform for Distributed Model Selection and Training.” ArXiv:1807.05118. https://arxiv.org/abs/1807.05118.

Lin, Min, Qiang Chen, and Shuicheng Yan. 2013. “Network in Network.” ArXiv:1312.4400. https://arxiv.org/abs/1312.4400.

Lin, Tsung-Yi, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. “Focal Loss for Dense Object Detection.” Proceedings of the IEEE International Conference on Computer Vision, 2980–88. https://doi.org/10.1109/iccv.2017.324.

Lin, Yuanqing, F Lv, S Zhu, et al. 2010. “ImageNet Classification: Fast Descriptor Coding and Large-Scale SVM Training.” Large Scale Visual Recognition Challenge, ahead of print. https://doi.org/10.1109/cvpr.2010.5539970.

Lin, Zhouhan, Minwei Feng, Cicero Nogueira dos Santos, et al. 2017. “A Structured Self-Attentive Sentence Embedding.” ArXiv:1703.03130. https://arxiv.org/abs/1703.03130.

Lipton, Zachary C, John Berkowitz, and Charles Elkan. 2015. “A Critical Review of Recurrent Neural Networks for Sequence Learning.” ArXiv:1506.00019. https://arxiv.org/abs/1506.00019.

Lipton, Zachary C, David C Kale, Charles Elkan, and Randall Wetzel. 2016. “Learning to Diagnose with LSTM Recurrent Neural Networks.” International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1511.03677.

Lipton, Zachary C, and Jacob Steinhardt. 2018. “Troubling Trends in Machine Learning Scholarship.” Communications of the ACM 17: 45–77. https://doi.org/10.1145/3317287.3328534.

Liu, Dong C, and Jorge Nocedal. 1989. “On the Limited Memory BFGS Method for Large Scale Optimization.” Mathematical Programming 45 (1): 503–28. https://doi.org/10.1007/bf01589116.

Liu, Hanxiao, Karen Simonyan, and Yiming Yang. 2018. “DARTS: Differentiable Architecture Search.” ArXiv:1806.09055. https://arxiv.org/abs/1806.09055.

Liu, Wei, Dragomir Anguelov, Dumitru Erhan, et al. 2016. “SSD: Single Shot Multibox Detector.” European Conference on Computer Vision, 21–37. https://doi.org/10.1007/978-3-319-46448-0_2.

Liu, Yinhan, Myle Ott, Naman Goyal, et al. 2019. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” ArXiv:1907.11692. https://arxiv.org/abs/1907.11692.

Liu, Ze, Yutong Lin, Yue Cao, et al. 2021. “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows.” Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012–22. https://doi.org/10.1109/iccv48922.2021.00986.

Liu, Zhuang, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. “A ConvNet for the 2020s.” ArXiv:2201.03545. https://arxiv.org/abs/2201.03545.

Long, Jonathan, Evan Shelhamer, and Trevor Darrell. 2015. “Fully Convolutional Networks for Semantic Segmentation.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3431–40. https://doi.org/10.1109/tpami.2016.2572683.

Loshchilov, Ilya, and Frank Hutter. 2016. “SGDR: Stochastic Gradient Descent with Warm Restarts.” ArXiv:1608.03983. https://arxiv.org/abs/1608.03983.

Lowe, David G. 2004. “Distinctive Image Features from Scale-Invariant Keypoints.” International Journal of Computer Vision 60 (2): 91–110. https://doi.org/10.1023/b:visi.0000029664.99615.94.

Luo, Ping, Xinjiang Wang, Wenqi Shao, and Zhanglin Peng. 2018. “Towards Understanding Regularization in Batch Normalization.” ArXiv:1809.00846. https://arxiv.org/abs/1809.00846.

Maas, Andrew L, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011. “Learning Word Vectors for Sentiment Analysis.” Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Volume 1, 142–50. https://aclanthology.org/P11-1015.

Mack, Yue-Pok, and Bernard W Silverman. 1982. “Weak and Strong Uniform Consistency of Kernel Regression Estimates.” Zeitschrift für Wahrscheinlichkeitstheorie Und Verwandte Gebiete 61 (3): 405–15. https://doi.org/10.1007/bf00539840.

MacKay, David JC. 2003. Information Theory, Inference and Learning Algorithms. Cambridge University Press. https://www.inference.org.uk/mackay/itila/book.html.

Maclaurin, D., D. Duvenaud, and R. Adams. 2015. “Gradient-Based Hyperparameter Optimization Through Reversible Learning.” Proceedings of the 32nd International Conference on Machine Learning (ICML’15). https://proceedings.mlr.press/v37/maclaurin15.html.

Mangasarian, O. L. 1965. “Linear and Nonlinear Separation of Patterns by Linear Programming.” Oper. Res. 13: 444–52. https://doi.org/10.1287/opre.13.3.444.

Mangram, Myles E. 2013. “A Simplified Perspective of the Markowitz Portfolio Theory.” Global Journal of Business Research 7 (1): 59–70. https://scholarworks.iu.edu/journals/index.php/jiuspa/article/view/4517.

Matthews, Alexander G de G, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. 2018. “Gaussian Process Behaviour in Wide Deep Neural Networks.” ArXiv:1804.11271. https://arxiv.org/abs/1804.11271.

McCann, Bryan, James Bradbury, Caiming Xiong, and Richard Socher. 2017. “Learned in Translation: Contextualized Word Vectors.” Advances in Neural Information Processing Systems, 6294–305. https://doi.org/10.5555/3294996.3295037.

McCulloch, Warren S, and Walter Pitts. 1943. “A Logical Calculus of the Ideas Immanent in Nervous Activity.” Bulletin of Mathematical Biophysics 5 (4): 115–33. https://doi.org/10.1016/s0092-8240(05)80006-0.

McMahan, H Brendan, Gary Holt, David Sculley, et al. 2013. “Ad Click Prediction: A View from the Trenches.” Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1222–30. https://doi.org/10.1145/2487575.2488200.

Mead, Carver, and Lynn Conway. 1980. Introduction to VLSI Systems. Addison-Wesley.

Merity, Stephen, Caiming Xiong, James Bradbury, and Richard Socher. 2016. “Pointer Sentinel Mixture Models.” ArXiv:1609.07843. https://arxiv.org/abs/1609.07843.

Micchelli, Charles A. 1984. “Interpolation of Scattered Data: Distance Matrices and Conditionally Positive Definite Functions.” In Approximation Theory and Spline Functions. Springer. https://doi.org/10.1007/bf01893414.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” ArXiv:1301.3781. https://arxiv.org/abs/1301.3781.

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” Advances in Neural Information Processing Systems, 3111–19. https://doi.org/10.5555/2999792.2999959.

Miller, George A. 1995. “WordNet: A Lexical Database for English.” Communications of the ACM 38 (11): 39–41. https://doi.org/10.1145/219717.219748.

Mirhoseini, Azalia, Hieu Pham, Quoc V Le, et al. 2017. “Device Placement Optimization with Reinforcement Learning.” Proceedings of the 34th International Conference on Machine Learning, 2430–39. https://proceedings.mlr.press/v70/mirhoseini17a.html.

Mnih, Volodymyr, Nicolas Heess, Alex Graves, et al. 2014. “Recurrent Models of Visual Attention.” Advances in Neural Information Processing Systems, 2204–12. https://doi.org/10.5555/2969033.2969073.

Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, et al. 2013. “Playing Atari with Deep Reinforcement Learning.” ArXiv:1312.5602.

Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, et al. 2015. “Human-Level Control Through Deep Reinforcement Learning.” Nature 518 (7540): 529–33. https://doi.org/10.1038/nature14236.

Moon, Taesup, Alex Smola, Yi Chang, and Zhaohui Zheng. 2010. “Intervalrank: Isotonic Regression with Listwise and Pairwise Constraints.” Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, 151–60. https://doi.org/10.1145/1718487.1718520.

Morey, Richard D, Rink Hoekstra, Jeffrey N Rouder, Michael D Lee, and Eric-Jan Wagenmakers. 2016. “The Fallacy of Placing Confidence in Confidence Intervals.” Psychonomic Bulletin & Review 23 (1): 103–23. https://doi.org/10.3758/s13423-015-0947-8.

Morozov, Vladimir Alekseevich. 1984. Methods for Solving Incorrectly Posed Problems. Springer.

Nadaraya, Elizbar A. 1964. “On Estimating Regression.” Theory of Probability & Its Applications 9 (1): 141–42. https://doi.org/10.1137/1109020.

Nair, Vinod, and Geoffrey E Hinton. 2010. “Rectified Linear Units Improve Restricted Boltzmann Machines.” ICML, 807–14. https://dl.acm.org/doi/10.5555/3104322.3104425.

Nakkiran, Preetum, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. 2021. “Deep Double Descent: Where Bigger Models and More Data Hurt.” Journal of Statistical Mechanics: Theory and Experiment 2021 (12): 124003. https://doi.org/10.1088/1742-5468/ac3a74.

Naor, Moni, and Omer Reingold. 1999. “On the Construction of Pseudorandom Permutations: Luby–Rackoff Revisited.” Journal of Cryptology 12 (1): 29–66. https://doi.org/10.1007/s001459900037.

Neal, Radford M. 1996. Bayesian Learning for Neural Networks. Springer. https://doi.org/10.1007/978-1-4612-0745-0.

Nesterov, Yu. 2018. Lectures on Convex Optimization. Springer. https://doi.org/10.1007/978-3-319-91578-4.

Nesterov, Yu, and J-Ph Vial. 2008. “Confidence Level Solutions for Stochastic Programming.” Automatica 44 (6): 1559–68. https://doi.org/10.1016/j.automatica.2008.01.017.

Neyman, Jerzy. 1937. “Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability.” Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences 236 (767): 333–80. https://doi.org/10.2307/jj.8501421.24.

Norelli, Antonio, Marco Fumero, Valentino Maiorca, Luca Moschella, Emanuele Rodolà, and Francesco Locatello. 2022. “ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training.” ArXiv:2210.01738. https://arxiv.org/abs/2210.01738.

Novak, Roman, Lechao Xiao, Jaehoon Lee, et al. 2018. “Bayesian Deep Convolutional Networks with Many Channels Are Gaussian Processes.” ArXiv:1810.05148. https://arxiv.org/abs/1810.05148.

Novikoff, A. B. J. 1962. “On Convergence Proofs for Perceptrons.” Proceedings of the Symposium on the Mathematical Theory of Automata, 615–22. https://cs.nyu.edu/~mohri/pub/nov62.pdf.

Olshausen, Bruno A, and David J Field. 1996. “Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images.” Nature 381 (6583): 607–9. https://doi.org/10.1038/381607a0.

Ong, Cheng Soon, Alexander Smola, and Robert Williamson. 2005. “Learning the Kernel with Hyperkernels.” Journal of Machine Learning Research 6: 1043–71. https://doi.org/10.1109/jcss.2005.25.2.

OpenAI. 2023. “GPT-4 Technical Report.” ArXiv:2303.08774. https://arxiv.org/abs/2303.08774.

Ouyang, Long, Jeff Wu, Xu Jiang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” ArXiv:2203.02155. https://arxiv.org/abs/2203.02155.

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. “BLEU: A Method for Automatic Evaluation of Machine Translation.” Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–18. https://doi.org/10.3115/1073083.1073135.

Parikh, Ankur P, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. “A Decomposable Attention Model for Natural Language Inference.” ArXiv:1606.01933. https://arxiv.org/abs/1606.01933.

Park, Taesung, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. “Semantic Image Synthesis with Spatially-Adaptive Normalization.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2337–46. https://doi.org/10.1109/cvpr.2019.00244.

Parzen, Emanuel. 1957. “On Consistent Estimates of the Spectrum of a Stationary Time Series.” Annals of Mathematical Statistics 28: 329–48. https://doi.org/10.1214/aoms/1177706962.

Paszke, Adam, Sam Gross, Francisco Massa, et al. 2019. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” Advances in Neural Information Processing Systems 32: 8026–37. https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html.

Paulus, Romain, Caiming Xiong, and Richard Socher. 2017. “A Deep Reinforced Model for Abstractive Summarization.” ArXiv:1705.04304. https://arxiv.org/abs/1705.04304.

Penedo, Guilherme, Quentin Malartic, Daniel Hesslow, et al. 2023. “The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only.” ArXiv:2306.01116. https://arxiv.org/abs/2306.01116.

Pennington, Jeffrey, Samuel Schoenholz, and Surya Ganguli. 2017. “Resurrecting the Sigmoid in Deep Learning Through Dynamical Isometry: Theory and Practice.” Advances in Neural Information Processing Systems, 4785–95. https://proceedings.neurips.cc/paper/2017/hash/a9fc2d3b0c721b5b3f0b3f11bb24a75c-Abstract.html.

Pennington, Jeffrey, Richard Socher, and Christopher Manning. 2014. “GloVe: Global Vectors for Word Representation.” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–43. https://doi.org/10.3115/v1/d14-1162.

Peters, Jonas, Dominik Janzing, and Bernhard Schölkopf. 2017. Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press. https://doi.org/10.7551/mitpress/11283.001.0001.

Peters, Matthew, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. “Semi-Supervised Sequence Tagging with Bidirectional Language Models.” Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Volume 1, 1756–65. https://doi.org/10.18653/v1/p17-1161.

Peters, Matthew, Mark Neumann, Mohit Iyyer, et al. 2018. “Deep Contextualized Word Representations.” Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, 2227–37. https://doi.org/10.18653/v1/n18-1202.

Petersen, Kaare Brandt, and Michael Syskind Pedersen. 2008. The Matrix Cookbook. Technical University of Denmark. https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf.

Pleiss, Geoff, Danlu Chen, Gao Huang, Tongcheng Li, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. “Memory-Efficient Implementation of Densenets.” ArXiv:1707.06990. https://arxiv.org/abs/1707.06990.

Polyak, Boris T. 1964. “Some Methods of Speeding up the Convergence of Iteration Methods.” USSR Computational Mathematics and Mathematical Physics 4 (5): 1–17. https://doi.org/10.1016/0041-5553(64)90137-5.

Prakash, Aaditya, Sadid A Hasan, Kathy Lee, et al. 2016. “Neural Paraphrase Generation with Stacked Residual LSTM Networks.” ArXiv:1610.03098. https://arxiv.org/abs/1610.03098.

Qin, Chengwei, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. 2023. “Is ChatGPT a General-Purpose Natural Language Processing Task Solver?” ArXiv:2302.06476. https://arxiv.org/abs/2302.06476.

Quadrana, Massimo, Paolo Cremonesi, and Dietmar Jannach. 2018. “Sequence-Aware Recommender Systems.” ACM Computing Surveys 51 (4): 66. https://doi.org/10.1145/3190616.

Quinlan, J Ross. 1993. C4.5: Programs for Machine Learning. Elsevier. https://doi.org/10.1016/c2009-0-27846-9.

Rabiner, Lawrence, and Biing-Hwang Juang. 1993. Fundamentals of Speech Recognition. Prentice-Hall.

Radford, Alec, Jong Wook Kim, Chris Hallacy, et al. 2021. “Learning Transferable Visual Models from Natural Language Supervision.” International Conference on Machine Learning, 8748–63. https://proceedings.mlr.press/v139/radford21a.html.

Radford, Alec, Luke Metz, and Soumith Chintala. 2015. “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks.” ArXiv:1511.06434. https://arxiv.org/abs/1511.06434.

Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. “Improving Language Understanding by Generative Pre-Training.” OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.

Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask Learners.” OpenAI Blog 1 (8): 9. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.

Radosavovic, Ilija, Justin Johnson, Saining Xie, Wan-Yen Lo, and Piotr Dollár. 2019. “On Network Design Spaces for Visual Recognition.” Proceedings of the IEEE/CVF International Conference on Computer Vision, 1882–90. https://doi.org/10.1109/iccv.2019.00052.

Radosavovic, Ilija, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. 2020. “Designing Network Design Spaces.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10428–36. https://doi.org/10.1109/cvpr42600.2020.01044.

Rae, Jack W, Sebastian Borgeaud, Trevor Cai, et al. 2021. “Scaling Language Models: Methods, Analysis & Insights from Training Gopher.” ArXiv:2112.11446. https://arxiv.org/abs/2112.11446.

Raffel, Colin, Noam Shazeer, Adam Roberts, et al. 2020. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” Journal of Machine Learning Research 21: 1–67. https://arxiv.org/abs/1910.10683.

Rajpurkar, Pranav, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. “SQuAD: 100,000+ Questions for Machine Comprehension of Text.” ArXiv:1606.05250. https://arxiv.org/abs/1606.05250.

Ramachandran, Prajit, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens. 2019. “Stand-Alone Self-Attention in Vision Models.” Advances in Neural Information Processing Systems 32. https://arxiv.org/abs/1906.05909.

Ramachandran, Prajit, Barret Zoph, and Quoc V Le. 2017. “Searching for Activation Functions.” ArXiv:1710.05941. https://arxiv.org/abs/1710.05941.

Ramesh, Aditya, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. “Hierarchical Text-Conditional Image Generation with Clip Latents.” ArXiv:2204.06125. https://arxiv.org/abs/2204.06125.

Ramón y Cajal, Santiago, and L. Azoulay. 1894. Les Nouvelles Idées Sur La Structure Du Système Nerveux Chez l’Homme Et Chez Les Vertébrés. Paris, C. Reinwald & Cie.

Ranzato, Marc-Aurelio, Y-Lan Boureau, Sumit Chopra, and Yann LeCun. 2007. “A Unified Energy-Based Framework for Unsupervised Learning.” Artificial Intelligence and Statistics, 371–79. https://proceedings.mlr.press/v2/ranzato07a.html.

Rasmussen, Carl Edward, and Christopher KI Williams. 2006. Gaussian Processes for Machine Learning. MIT Press. https://gaussianprocess.org/gpml/.

Reddi, Sashank J, Satyen Kale, and Sanjiv Kumar. 2019. “On the Convergence of Adam and Beyond.” ArXiv:1904.09237. https://arxiv.org/abs/1904.09237.

Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. “You Only Look Once: Unified, Real-Time Object Detection.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 779–88. https://doi.org/10.1109/cvpr.2016.91.

Redmon, Joseph, and Ali Farhadi. 2018. “YOLOv3: An Incremental Improvement.” ArXiv:1804.02767. https://arxiv.org/abs/1804.02767.

Reed, Scott, and Nando De Freitas. 2015. “Neural Programmer-Interpreters.” ArXiv:1511.06279. https://arxiv.org/abs/1511.06279.

Reed, Scott, Konrad Zolna, Emilio Parisotto, et al. 2022. “A Generalist Agent.” ArXiv:2205.06175. https://arxiv.org/abs/2205.06175.

Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. 2015. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.” Advances in Neural Information Processing Systems, 91–99. https://doi.org/10.5555/2969239.2969250.

Rendle, Steffen. 2010. “Factorization Machines.” 2010 IEEE International Conference on Data Mining, 995–1000. https://doi.org/10.1109/icdm.2010.127.

Rendle, Steffen, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. “BPR: Bayesian Personalized Ranking from Implicit Feedback.” Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, 452–61. https://arxiv.org/abs/1205.2618.

Revels, Jarrett, Miles Lubin, and Theodore Papamarkou. 2016. “Forward-Mode Automatic Differentiation in Julia.” ArXiv:1607.07892. https://arxiv.org/abs/1607.07892.

Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. 2014. “Stochastic Backpropagation and Approximate Inference in Deep Generative Models.” International Conference on Machine Learning, 1278–86. https://proceedings.mlr.press/v32/rezende14.html.

Riesenhuber, Maximilian, and Tomaso Poggio. 1999. “Hierarchical Models of Object Recognition in Cortex.” Nature Neuroscience 2 (11): 1019–25. https://doi.org/10.1038/14819.

Rockafellar, R. T. 1970. Convex Analysis. Princeton University Press. https://doi.org/10.1515/9781400873173.

Rolnick, David, Andreas Veit, Serge Belongie, and Nir Shavit. 2017. “Deep Learning Is Robust to Massive Label Noise.” ArXiv:1705.10694. https://arxiv.org/abs/1705.10694.

Rudin, W. 1973. Functional Analysis. McGraw-Hill.

Rumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6088): 533–36. https://doi.org/10.1038/323533a0.

Russakovsky, Olga, Jia Deng, Zhiheng Huang, Alexander C. Berg, and Li Fei-Fei. 2013. “Detecting Avocados to Zucchinis: What Have We Done, and Where Are We Going?” International Conference on Computer Vision (ICCV). https://doi.org/10.1109/iccv.2013.258.

Russakovsky, Olga, Jia Deng, Hao Su, et al. 2015. “ImageNet Large Scale Visual Recognition Challenge.” International Journal of Computer Vision 115 (3): 211–52. https://doi.org/10.1007/s11263-015-0816-y.

Russell, Stuart J, and Peter Norvig. 2016. Artificial Intelligence: A Modern Approach. Pearson Education Limited.

Saharia, Chitwan, William Chan, Saurabh Saxena, et al. 2022. “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding.” ArXiv:2205.11487. https://arxiv.org/abs/2205.11487.

Salinas, D., M. Seeger, A. Klein, V. Perrone, M. Wistuba, and C. Archambeau. 2022. “Syne Tune: A Library for Large Scale Hyperparameter Tuning and Reproducible Research.” First Conference on Automated Machine Learning. https://proceedings.mlr.press/v188/salinas22a.html.

Sanh, Victor, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter.” ArXiv:1910.01108. https://arxiv.org/abs/1910.01108.

Sanh, Victor, Albert Webson, Colin Raffel, et al. 2021. “Multitask Prompted Training Enables Zero-Shot Task Generalization.” ArXiv:2110.08207. https://arxiv.org/abs/2110.08207.

Santurkar, Shibani, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. 2018. “How Does Batch Normalization Help Optimization?” Advances in Neural Information Processing Systems, 2483–93. https://doi.org/10.5555/3327345.3327508.

Sarwar, Badrul Munir, George Karypis, Joseph A Konstan, and John Riedl. 2001. “Item-Based Collaborative Filtering Recommendation Algorithms.” Proceedings of 10th International Conference on World Wide Web, 285–95. https://doi.org/10.1145/371920.372071.

Scao, Teven Le, Angela Fan, Christopher Akiki, et al. 2022. “BLOOM: A 176B-Parameter Open-Access Multilingual Language Model.” ArXiv:2211.05100. https://arxiv.org/abs/2211.05100.

Schein, Andrew I, Alexandrin Popescul, Lyle H Ungar, and David M Pennock. 2002. “Methods and Metrics for Cold-Start Recommendations.” Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 253–60. https://doi.org/10.1145/564376.564421.

Schölkopf, Bernhard, Chris Burges, and Vladimir Vapnik. 1996. “Incorporating Invariances in Support Vector Learning Machines.” International Conference on Artificial Neural Networks, 47–52. https://doi.org/10.1007/3-540-61510-5_12.

Schölkopf, Bernhard, and Alexander J Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press. https://doi.org/10.7551/mitpress/4175.001.0001.

Schölkopf, B., R. Herbrich, and A. J. Smola. 2001. “A Generalized Representer Theorem.” In Proceedings of the Annual Conference on Computational Learning Theory, edited by D. P. Helmbold and B. Williamson. Springer-Verlag. https://doi.org/10.1007/3-540-44581-1_27.

Schuhmann, Christoph, Romain Beaumont, Richard Vencu, et al. 2022. “LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models.” ArXiv:2210.08402. https://arxiv.org/abs/2210.08402.

Schuster, Mike, and Kuldip K Paliwal. 1997. “Bidirectional Recurrent Neural Networks.” IEEE Transactions on Signal Processing 45 (11): 2673–81. https://doi.org/10.1109/78.650093.

Sedhain, Suvash, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015. “AutoRec: Autoencoders Meet Collaborative Filtering.” Proceedings of the 24th International Conference on World Wide Web, 111–12. https://doi.org/10.1145/2740908.2742726.

Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. “Neural Machine Translation of Rare Words with Subword Units.” Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1715–25. https://doi.org/10.18653/v1/P16-1162.

Sergeev, Alexander, and Mike Del Balso. 2018. “Horovod: Fast and Easy Distributed Deep Learning in TensorFlow.” ArXiv:1802.05799. https://arxiv.org/abs/1802.05799.

Shannon, Claude Elwood. 1948. “A Mathematical Theory of Communication.” The Bell System Technical Journal 27 (3): 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.

Shao, Huajie, Shuochao Yao, Dachun Sun, et al. 2020. “ControlVAE: Controllable Variational Autoencoder.” Proceedings of the 37th International Conference on Machine Learning. https://proceedings.mlr.press/v119/shao20b.html.

Shaw, Peter, Jakob Uszkoreit, and Ashish Vaswani. 2018. “Self-Attention with Relative Position Representations.” ArXiv:1803.02155. https://arxiv.org/abs/1803.02155.

Shoeybi, Mohammad, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” ArXiv:1909.08053. https://arxiv.org/abs/1909.08053.

Silver, David, Aja Huang, Chris J Maddison, et al. 2016. “Mastering the Game of Go with Deep Neural Networks and Tree Search.” Nature 529 (7587): 484–89. https://doi.org/10.1038/nature16961.

Silverman, B. W. 1986. Density Estimation for Statistical and Data Analysis. Chapman; Hall.

Simard, Patrice Y, Yann A LeCun, John S Denker, and Bernard Victorri. 1998. “Transformation Invariance in Pattern Recognition – Tangent Distance and Tangent Propagation.” In Neural Networks: Tricks of the Trade. Springer. https://doi.org/10.1007/978-3-642-35289-8_17.

Simonyan, Karen, and Andrew Zisserman. 2015. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” International Conference on Learning Representations. https://arxiv.org/abs/1409.1556.

Sindhwani, Vikas, Tara N Sainath, and Sanjiv Kumar. 2015. “Structured Transforms for Small-Footprint Deep Learning.” ArXiv:1510.01722. https://arxiv.org/abs/1510.01722.

Sivic, Josef, and Andrew Zisserman. 2003. “Video Google: A Text Retrieval Approach to Object Matching in Videos.” Proceedings of the IEEE International Conference on Computer Vision 3: 1470–77. https://doi.org/10.1109/iccv.2003.1238663.

Smith, Shaden, Mostofa Patwary, Brandon Norick, et al. 2022. “Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, a Large-Scale Generative Language Model.” ArXiv:2201.11990. https://arxiv.org/abs/2201.11990.

Smola, Alexander, and Shravan Narayanamurthy. 2010. “An Architecture for Parallel Topic Models.” Proceedings of the VLDB Endowment 3 (1-2): 703–10. https://doi.org/10.14778/1920841.1920931.

Snoek, J., H. Larochelle, and R. Adams. 2012. “Practical Bayesian Optimization of Machine Learning Algorithms.” Advances in Neural Information Processing Systems 25, 2951–59. https://doi.org/10.5555/2999325.2999464.

Sohl-Dickstein, Jascha, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. “Deep Unsupervised Learning Using Nonequilibrium Thermodynamics.” International Conference on Machine Learning, 2256–65. https://proceedings.mlr.press/v37/sohl-dickstein15.html.

Song, Yang, and Stefano Ermon. 2019. “Generative Modeling by Estimating Gradients of the Data Distribution.” Advances in Neural Information Processing Systems 32. https://arxiv.org/abs/1907.05600.

Song, Yang, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. “Score-Based Generative Modeling Through Stochastic Differential Equations.” International Conference on Learning Representations. https://doi.org/10.52202/075280-1645.

Speelpenning, Bert. 1980. “Compiling Fast Partial Derivatives of Functions Given by Algorithms.” PhD thesis, University of Illinois at Urbana-Champaign. https://doi.org/10.2172/5254402.

Srivastava, Aarohi, Abhinav Rastogi, Abhishek Rao, et al. 2022. “Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.” ArXiv:2206.04615. https://arxiv.org/abs/2206.04615.

Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” Journal of Machine Learning Research 15 (1): 1929–58. https://doi.org/10.5555/2627435.2670313.

Srivastava, Rupesh Kumar, Klaus Greff, and Jürgen Schmidhuber. 2015. “Highway Networks.” ArXiv:1505.00387. https://arxiv.org/abs/1505.00387.

Strang, Gilbert. 1993. Introduction to Linear Algebra. Wellesley–Cambridge Press. https://math.mit.edu/~gs/linearalgebra/.

Su, Xiaoyuan, and Taghi M Khoshgoftaar. 2009. “A Survey of Collaborative Filtering Techniques.” Advances in Artificial Intelligence 2009. https://doi.org/10.1155/2009/421425.

Sukhbaatar, Sainbayar, Jason Weston, and Rob Fergus. 2015. “End-to-End Memory Networks.” Advances in Neural Information Processing Systems, 2440–48. https://doi.org/10.5555/2969239.2969426.

Sutskever, Ilya, James Martens, George Dahl, and Geoffrey Hinton. 2013. “On the Importance of Initialization and Momentum in Deep Learning.” International Conference on Machine Learning, 1139–47. https://proceedings.mlr.press/v28/sutskever13.html.

Sutskever, Ilya, Oriol Vinyals, and Quoc V Le. 2014. “Sequence to Sequence Learning with Neural Networks.” Advances in Neural Information Processing Systems, 3104–12. https://doi.org/10.5555/2969033.2969173.

Szegedy, Christian, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. “Inception-V4, Inception-ResNet and the Impact of Residual Connections on Learning.” 31st AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v31i1.11231.

Szegedy, Christian, Wei Liu, Yangqing Jia, et al. 2015. “Going Deeper with Convolutions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–9. https://doi.org/10.1109/cvpr.2015.7298594.

Szegedy, Christian, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. “Rethinking the Inception Architecture for Computer Vision.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–26. https://doi.org/10.1109/cvpr.2016.308.

Tallec, Corentin, and Yann Ollivier. 2017. “Unbiasing Truncated Backpropagation Through Time.” ArXiv:1705.08209. https://arxiv.org/abs/1705.08209.

Tan, Mingxing, and Quoc Le. 2019. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” International Conference on Machine Learning, 6105–14. https://proceedings.mlr.press/v97/tan19a.html.

Tang, Jiaxi, and Ke Wang. 2018. “Personalized Top-n Sequential Recommendation via Convolutional Sequence Embedding.” Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 565–73. https://doi.org/10.1145/3159652.3159656.

Taskar, Ben, Carlos Guestrin, and Daphne Koller. 2004. “Max-Margin Markov Networks.” Advances in Neural Information Processing Systems 16: 25.

Tay, Yi, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. “Efficient Transformers: A Survey.” ArXiv:2009.06732. https://arxiv.org/abs/2009.06732.

Taylor, Ross, Marcin Kardas, Guillem Cucurull, et al. 2022. “Galactica: A Large Language Model for Science.” ArXiv:2211.09085. https://arxiv.org/abs/2211.09085.

Teye, Mattias, Hossein Azizpour, and Kevin Smith. 2018. “Bayesian Uncertainty Estimation for Batch Normalized Deep Networks.” ArXiv:1802.06455. https://arxiv.org/abs/1802.06455.

Thomee, Bart, David A Shamma, Gerald Friedland, et al. 2016. “YFCC100M: The New Data in Multimedia Research.” Communications of the ACM 59 (2): 64–73. https://doi.org/10.1145/2812802.

Tieleman, Tijmen, and Geoffrey Hinton. 2012. “Divide the Gradient by a Running Average of Its Recent Magnitude.” In COURSERA: Neural Networks for Machine Learning, Lecture 6.5-Rmsprop. https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf.

Tikhonov, A. N., and V. Y. Arsenin. 1977. Solutions of Ill-Posed Problems. W.H. Winston. https://doi.org/10.1137/1.9780898719741.

Tolstikhin, Ilya O, Neil Houlsby, Alexander Kolesnikov, et al. 2021. “MLP-Mixer: An All-MLP Architecture for Vision.” Advances in Neural Information Processing Systems 34. https://arxiv.org/abs/2105.01601.

Torralba, Antonio, Rob Fergus, and William T Freeman. 2008. “80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition.” IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (11): 1958–70. https://doi.org/10.1109/tpami.2008.128.

Töscher, Andreas, Michael Jahrer, and Robert M Bell. 2009. The Bigchaos Solution to the Netflix Grand Prize. https://www.netflixprize.com/assets/GrandPrize2009_BPC_BigChaos.pdf.

Touvron, Hugo, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. “Training Data-Efficient Image Transformers & Distillation Through Attention.” International Conference on Machine Learning, 10347–57. https://proceedings.mlr.press/v139/touvron21a.html.

Touvron, Hugo, Thibaut Lavril, Gautier Izacard, et al. 2023a. “LLaMA: Open and Efficient Foundation Language Models.” ArXiv:2302.13971, 2023a. https://arxiv.org/abs/2302.13971.

Touvron, Hugo, Louis Martin, Kevin Stone, et al. 2023b. “LLaMA 2: Open Foundation and Fine-Tuned Chat Models.” ArXiv:2307.09288, 2023b. https://arxiv.org/abs/2307.09288.

Tsoumakas, Grigorios, and Ioannis Katakis. 2007. “Multi-Label Classification: An Overview.” International Journal of Data Warehousing and Mining 3 (3): 1–13. https://doi.org/10.4018/jdwm.2007070101.

Turing, Alan. 1950. “Computing Machinery and Intelligence.” Mind 59 (236): 433–60. https://doi.org/10.1093/mind/lix.236.433.

Uijlings, Jasper RR, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. 2013. “Selective Search for Object Recognition.” International Journal of Computer Vision 104 (2): 154–71. https://doi.org/10.1007/s11263-013-0620-5.

Vapnik, V. 1995. The Nature of Statistical Learning Theory. Springer.

Vapnik, V. 1998. Statistical Learning Theory. John Wiley; Sons.

Vapnik, V. N., and A. Y. Chervonenkis. 1974. “Ordered Risk Minimization.” Automation and Remote Control 35: 1226–35, 1403–12.

Vapnik, V., and A. Chervonenkis. 1964. “A Note on One Class of Perceptrons.” Automation and Remote Control 25.

Vapnik, V., and A. Chervonenkis. 1968. “Uniform Convergence of Frequencies of Occurence of Events to Their Probabilities.” Dokl. Akad. Nauk SSSR 181: 915–18.

Vapnik, V., and A. Chervonenkis. 1971. “On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities.” Theory Probab. Appl. 16 (2): 264–81.

Vapnik, V., and A. Chervonenkis. 1981. “The Necessary and Sufficient Conditions for the Uniform Convergence of Averages to Their Expected Values.” Teoriya Veroyatnostei i Ee Primeneniya 26 (3): 543–64.

Vapnik, V., and A. Chervonenkis. 1991. “The Necessary and Sufficient Conditions for Consistency in the Empirical Risk Minimization Method.” Pattern Recognition and Image Analysis 1 (3): 283–305.

Vapnik, Vladimir. 1992. “Principles of Risk Minimization for Learning Theory.” Advances in Neural Information Processing Systems, 831–38. https://doi.org/10.5555/2986916.2987019.

Vapnik, Vladimir, Esther Levin, and Yann Le Cun. 1994. “Measuring the VC-Dimension of a Learning Machine.” Neural Computation 6 (5): 851–76. https://doi.org/10.1162/neco.1994.6.5.851.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems, 5998–6008. https://doi.org/10.5555/3295222.3295349.

Wahba, Grace. 1990. Spline Models for Observational Data. SIAM. https://doi.org/10.1137/1.9781611970128.

Waibel, Alex, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, and Kevin J Lang. 1989. “Phoneme Recognition Using Time-Delay Neural Networks.” IEEE Transactions on Acoustics, Speech, and Signal Processing 37 (3): 328–39. https://doi.org/10.1016/b978-0-08-051584-7.50037-1.

Wang, Haotao, Aston Zhang, Shuai Zheng, Xingjian Shi, Mu Li, and Zhangyang Wang. 2022. “Removing Batch Normalization Boosts Adversarial Training.” International Conference on Machine Learning, 23433–45. https://openreview.net/forum?id=2J8bBfGCPi.

Wang, Leyuan, Mu Li, Edo Liberty, and Alex J Smola. 2018. “Optimal Message Scheduling for Aggregation.” Networks 2 (3): 2–3. https://arxiv.org/abs/1710.09465.

Wang, Qiang, Bei Li, Tong Xiao, et al. 2019. “Learning Deep Transformer Models for Machine Translation.” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1810–22. https://doi.org/10.18653/v1/p19-1176.

Wang, Xuezhi, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2023. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” International Conference on Learning Representations. https://openreview.net/forum?id=1PL1NIMMrw.

Wang, Yangzihao, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D Owens. 2016. “Gunrock: A High-Performance Graph Processing Library on the GPU.” ACM SIGPLAN Notices 51: 11. https://doi.org/10.1145/2688500.2688538.

Warstadt, Alex, Amanpreet Singh, and Samuel R Bowman. 2019. “Neural Network Acceptability Judgments.” Transactions of the Association for Computational Linguistics 7: 625–41. https://doi.org/10.1162/tacl_a_00290.

Wasserman, Larry. 2013. All of Statistics: A Concise Course in Statistical Inference. Springer. https://link.springer.com/book/10.1007/978-0-387-21736-9.

Watkins, Christopher JCH, and Peter Dayan. 1992. “Q-Learning.” Machine Learning 8 (3–4): 279–92. https://doi.org/10.1007/bf00992698.

Watson, Geoffrey S. 1964. “Smooth Regression Analysis.” Sankhyā: The Indian Journal of Statistics, Series A, 359–72. https://doi.org/10.1007/bf02868765.

Wei, Jason, Maarten Bosma, Vincent Y Zhao, et al. 2021. “Finetuned Language Models Are Zero-Shot Learners.” ArXiv:2109.01652. https://arxiv.org/abs/2109.01652.

Wei, Jason, Yi Tay, Rishi Bommasani, et al. 2022. “Emergent Abilities of Large Language Models.” ArXiv:2206.07682. https://arxiv.org/abs/2206.07682.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, et al. 2022. “Chain of Thought Prompting Elicits Reasoning in Large Language Models.” ArXiv:2201.11903. https://arxiv.org/abs/2201.11903.

Welling, Max, and Yee W Teh. 2011. “Bayesian Learning via Stochastic Gradient Langevin Dynamics.” Proceedings of the 28th International Conference on Machine Learning (ICML-11), 681–88. https://dl.acm.org/doi/10.5555/3104482.3104568.

Wengert, Robert Edwin. 1964. “A Simple Automatic Derivative Evaluation Program.” Communications of the ACM 7 (8): 463–64. https://doi.org/10.1145/355588.365726.

Werbos, Paul J. 1990. “Backpropagation Through Time: What It Does and How to Do It.” Proceedings of the IEEE 78 (10): 1550–60. https://doi.org/10.1109/5.58337.

Wigner, Eugene P. 1958. “On the Distribution of the Roots of Certain Symmetric Matrices.” Ann. Math., 325–27. https://doi.org/10.2307/1970079.

Wilson, Andrew G, and Pavel Izmailov. 2020. “Bayesian Deep Learning and a Probabilistic Perspective of Generalization.” Advances in Neural Information Processing Systems 33: 4697–708. https://arxiv.org/abs/2002.08791.

Wistuba, M., A. Rawat, and T. Pedapati. 2019. “A Survey on Neural Architecture Search.” ArXiv:1905.01392 [Cs.LG]. https://arxiv.org/abs/1905.01392.

Wistuba, M., N. Schilling, and L. Schmidt-Thieme. 2018. “Scalable Gaussian Process-Based Transfer Surrogates for Hyperparameter Optimization.” Machine Learning 108: 43–78. https://doi.org/10.1007/s10994-017-5684-y.

Wolpert, David H, and William G Macready. 1995. No Free Lunch Theorems for Search. Technical Report SFI-TR-95-02-010, Santa Fe Institute. https://www.santafe.edu/research/results/working-papers/no-free-lunch-theorems-for-search.

Wood, Frank, Jan Gasthaus, Cédric Archambeau, Lancelot James, and Yee Whye Teh. 2011. “The Sequence Memoizer.” Communications of the ACM 54 (2): 91–98. https://doi.org/10.1162/neco_a_00154.

Wu, Bichen, Alvin Wan, Xiangyu Yue, et al. 2018. “Shift: A Zero Flop, Zero Parameter Alternative to Spatial Convolutions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 9127–35. https://doi.org/10.1109/cvpr.2018.00951.

Wu, Yonghui, Mike Schuster, Zhifeng Chen, et al. 2016. “Google’s Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation.” ArXiv:1609.08144. https://arxiv.org/abs/1609.08144.

Xiao, Han, Kashif Rasul, and Roland Vollgraf. 2017. “Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms.” ArXiv:1708.07747. https://arxiv.org/abs/1708.07747.

Xiao, Lechao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Pennington. 2018. “Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks.” International Conference on Machine Learning, 5393–402. https://proceedings.neurips.cc/paper/2018/hash/d76b67bcd3ec823de384ef62dc7e4c7e-Abstract.html.

Xie, Saining, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. “Aggregated Residual Transformations for Deep Neural Networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1492–500. https://doi.org/10.1109/cvpr.2017.634.

Xiong, Ruibin, Yunchang Yang, Di He, et al. 2020. “On Layer Normalization in the Transformer Architecture.” International Conference on Machine Learning, 10524–33. https://proceedings.mlr.press/v119/xiong20b.html.

Xiong, Wayne, Lingfeng Wu, Fil Alleva, Jasha Droppo, Xuedong Huang, and Andreas Stolcke. 2018. “The Microsoft 2017 Conversational Speech Recognition System.” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5934–38. https://doi.org/10.1109/TASLP.2018.2876459.

Yamaguchi, Kouichi, Kenji Sakamoto, Toshio Akabane, and Yoshiji Fujimoto. 1990. “A Neural Network for Speaker-Independent Isolated Word Recognition.” First International Conference on Spoken Language Processing. https://doi.org/10.21437/icslp.1990-282.

Yang, Zichao, Zhiting Hu, Yuntian Deng, Chris Dyer, and Alex Smola. 2016. “Neural Machine Translation with Recurrent Attention Modeling.” ArXiv:1607.05108. https://arxiv.org/abs/1607.05108.

Yang, Zichao, Marcin Moczulski, Misha Denil, et al. 2015. “Deep Fried Convnets.” Proceedings of the IEEE International Conference on Computer Vision, 1476–83. https://doi.org/10.1109/iccv.2015.173.

Ye, Mao, Peifeng Yin, Wang-Chien Lee, and Dik-Lun Lee. 2011. “Exploiting Geographical Influence for Collaborative Point-of-Interest Recommendation.” Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, 325–34. https://doi.org/10.1145/2009916.2009962.

You, Yang, Igor Gitman, and Boris Ginsburg. 2017. “Large Batch Training of Convolutional Networks.” ArXiv:1708.03888. https://arxiv.org/abs/1708.03888.

Yu, Jiahui, Yuanzhong Xu, Jing Yu Koh, et al. 2022. “Scaling Autoregressive Models for Content-Rich Text-to-Image Generation.” ArXiv:2206.10789. https://arxiv.org/abs/2206.10789.

Zaheer, Manzil, Sashank Reddi, Devendra Sachan, Satyen Kale, and Sanjiv Kumar. 2018. “Adaptive Methods for Nonconvex Optimization.” Advances in Neural Information Processing Systems, 9793–803. https://proceedings.neurips.cc/paper/2018/hash/90365351ccc7437a1309dc64e4db32a3-Abstract.html.

Zeiler, Matthew D. 2012. “ADADELTA: An Adaptive Learning Rate Method.” ArXiv:1212.5701. https://arxiv.org/abs/1212.5701.

Zeiler, Matthew D, and Rob Fergus. 2013. “Stochastic Pooling for Regularization of Deep Convolutional Neural Networks.” ArXiv:1301.3557. https://arxiv.org/abs/1301.3557.

Zhang, Aston, Yi Tay, Shuai Zhang, et al. 2021. “Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with 1/n Parameters.” International Conference on Learning Representations. https://openreview.net/forum?id=rcQdycl0zyk.

Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2021. “Understanding Deep Learning (Still) Requires Rethinking Generalization.” Communications of the ACM 64 (3): 107–15. https://doi.org/10.1145/3446776.

Zhang, Shuai, Lina Yao, Aixin Sun, and Yi Tay. 2019. “Deep Learning Based Recommender System: A Survey and New Perspectives.” ACM Computing Surveys 52 (1): 5. https://doi.org/10.1145/3285029.

Zhang, Susan, Stephen Roller, Naman Goyal, et al. 2022. “OPT: Open Pre-Trained Transformer Language Models.” ArXiv:2205.01068. https://arxiv.org/abs/2205.01068.

Zhang, Wei, Jun Tanida, Kazuyoshi Itoh, and Yoshiki Ichioka. 1988. “Shift-Invariant Pattern Recognition Neural Network and Its Optical Architecture.” Proceedings of Annual Conference of the Japan Society of Applied Physics.

Zhang, Yifu, Peize Sun, Yi Jiang, et al. 2021. “ByteTrack: Multi-Object Tracking by Associating Every Detection Box.” ArXiv:2110.06864. https://arxiv.org/abs/2110.06864.

Zhang, Zhuosheng, Aston Zhang, Mu Li, and Alex Smola. 2023. “Automatic Chain of Thought Prompting in Large Language Models.” International Conference on Learning Representations. https://openreview.net/forum?id=5NTt8GFjUHkr.

Zhang, Zhuosheng, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. 2023. “Multimodal Chain-of-Thought Reasoning in Language Models.” ArXiv:2302.00923. https://arxiv.org/abs/2302.00923.

Zhao, Zhong-Qiu, Peng Zheng, Shou-tao Xu, and Xindong Wu. 2019. “Object Detection with Deep Learning: A Review.” IEEE Transactions on Neural Networks and Learning Systems 30 (11): 3212–32. https://doi.org/10.1016/j.neucom.2018.09.013.

Zhou, Denny, Nathanael Schärli, Le Hou, et al. 2023. “Least-to-Most Prompting Enables Complex Reasoning in Large Language Models.” International Conference on Learning Representations. https://openreview.net/forum?id=WZH7099tgfM.

Zhu, Jun-Yan, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. “Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks.” Proceedings of the IEEE International Conference on Computer Vision, 2223–32. https://doi.org/10.1109/iccv.2017.244.

Zhu, Yukun, Ryan Kiros, Rich Zemel, et al. 2015. “Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books.” Proceedings of the IEEE International Conference on Computer Vision, 19–27. https://doi.org/10.1109/iccv.2015.11.

Zoph, Barret, and Quoc V Le. 2016. “Neural Architecture Search with Reinforcement Learning.” ArXiv:1611.01578. https://arxiv.org/abs/1611.01578.