References
Abadi, Martı́n, Paul Barham, Jianmin Chen, et
al. 2016. “TensorFlow: A System for
Large-Scale Machine Learning.” 12th USENIX
Symposium on Operating Systems
Design and Implementation (OSDI
16), 265–83. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi.
Abdel-Hamid, Ossama, Abdel-Rahman Mohamed, Hui Jiang, Li Deng, Gerald
Penn, and Dong Yu. 2014. “Convolutional Neural Networks for Speech
Recognition.” IEEE/ACM Transactions
on Audio, Speech, and Language
Processing 22 (10): 1533–45. https://doi.org/10.1109/taslp.2014.2339736.
Ahmed, Amr, Mohamed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and
Alexander J Smola. 2012. “Scalable Inference in Latent Variable
Models.” Proceedings of the Fifth
ACM International Conference on
Web Search and Data
Mining, 123–32. https://doi.org/10.1145/2124295.2124312.
Akiba, T., S. Sano, T. Yanase, T. Ohta, and M. Koyama. 2019.
“Optuna: A Next-Generation Hyperparameter
Optimization Framework.” Proceedings of the 25th
ACM SIGKDD International
Conference on Knowledge Discovery
& Data Mining. https://doi.org/10.1145/3292500.3330701.
Alayrac, Jean-Baptiste, Jeff Donahue, Pauline Luc,
et al. 2022. “Flamingo: A Visual Language Model for
Few-Shot Learning.” ArXiv:2204.14198. https://arxiv.org/abs/2204.14198.
Alsallakh, Bilal, Narine Kokhlikyan, Vivek Miglani, Jun Yuan, and Orion
Reblitz-Richardson. 2020. “Mind the PAD –
CNNs Can Develop Blind Spots.”
ArXiv:2010.02178. https://arxiv.org/abs/2010.02178.
Anil, Rohan, Andrew M Dai, Orhan Firat, et
al. 2023. “PaLM 2 Technical
Report.” ArXiv:2305.10403. https://arxiv.org/abs/2305.10403.
Anil, Rohan, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer.
2020. “Scalable Second-Order Optimization for Deep
Learning.” ArXiv:2002.09018. https://arxiv.org/abs/2002.09018.
Aronszajn, Nachman. 1950. “Theory of reproducing kernels.” Transactions of
the American Mathematical
Society 68 (3): 337–404. https://doi.org/10.1090/s0002-9947-1950-0051437-7.
Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016.
“Layer Normalization.”
ArXiv:1607.06450. https://arxiv.org/abs/1607.06450.
Baevski, Alexei, and Michael Auli. 2018. “Adaptive Input
Representations for Neural Language Modeling.”
International Conference on Learning
Representations. https://openreview.net/forum?id=ByxZX20qFQ.
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. “Neural
Machine Translation by Jointly Learning to Align and Translate.”
International Conference on Learning Representations. https://doi.org/10.48550/arXiv.1409.0473.
Bai, Yuntao, Saurav Kadavath, Sandipan Kundu, et
al. 2022. “Constitutional AI: Harmlessness
from AI Feedback.”
ArXiv:2212.08073. https://arxiv.org/abs/2212.08073.
Baptista, R., and M. Poloczek. 2018. “Bayesian
Optimization of Combinatorial Structures.” Proceedings of the
35th International Conference on
Machine Learning. https://proceedings.mlr.press/v80/baptista18a.html.
Bardenet, R., M. Brendel, B. Kégl, and M. Sebag. 2013.
“Collaborative Hyperparameter Tuning.” Proceedings of
the 30th International Conference on
Machine Learning (ICML’13).
https://proceedings.mlr.press/v28/bardenet13.html.
Bay, Herbert, Tinne Tuytelaars, and Luc Van Gool. 2006.
“SURF: Speeded up Robust
Features.” European Conference on
Computer Vision, 404–17. https://doi.org/10.1007/11744023_32.
Bellman, R. 1966. “Dynamic Programming.” Science
153: 34–37. https://doi.org/10.1126/science.153.3731.34.
Bellman, Richard. 1952. “On the Theory of Dynamic
Programming.” Proceedings of the National
Academy of Sciences 38 (8): 716–19. https://doi.org/10.1073/pnas.38.8.716.
Bellman, Richard. 1957a. “A Markovian Decision
Process.” Journal of Mathematics and
Mechanics 6 (5): 679–84. http://www.jstor.org/stable/24900506.
Bellman, Richard. 1957b. Dynamic Programming.
Dover. Dover Publications. https://doi.org/10.1515/9781400835386.
Beltagy, Iz, Matthew E Peters, and Arman Cohan. 2020. “Longformer:
The Long-Document Transformer.”
ArXiv:2004.05150. https://arxiv.org/abs/2004.05150.
Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin.
2003. “A Neural Probabilistic Language Model.”
Journal of Machine Learning
Research 3 (Feb): 1137–55. https://jmlr.org/papers/v3/bengio03a.html.
Bengio, Yoshua, Patrice Simard, and Paolo Frasconi. 1994.
“Learning Long-Term Dependencies with Gradient Descent Is
Difficult.” IEEE Transactions on
Neural Networks 5 (2): 157–66. https://doi.org/10.1109/72.279181.
Bergstra, James, and Yoshua Bengio. 2012. “Random Search for
Hyper-Parameter Optimization.” Journal of Machine Learning
Research 13: 281–305.
Bergstra, James, Olivier Breuleux, Frédéric Bastien, et al. 2010.
“Theano: A CPU and GPU Math Compiler in
Python.” Proc. 9th Python in
Science Conference 1: 3–10. https://www.iro.umontreal.ca/~lisa/pointeurs/theano_scipy2010.pdf.
Beutel, Alex, Kenton Murray, Christos Faloutsos, and Alexander J Smola.
2014. “CoBaFi: Collaborative
Bayesian Filtering.” Proceedings of the 23rd
International Conference on World
Wide Web, 97–108. https://doi.org/10.1145/2566486.2567980.
Bishop, Chris M. 1995. “Training with Noise Is Equivalent to
Tikhonov Regularization.” Neural
Computation 7 (1): 108–16. https://doi.org/10.1162/neco.1995.7.1.108.
Bishop, Christopher M. 2006. Pattern Recognition and
Machine Learning. Springer. https://doi.org/10.1007/978-0-387-45528-0.
Black, Fischer, and Myron Scholes. 1973. “The Pricing of Options
and Corporate Liabilities.” Journal of Political
Economy 81: 637–54. https://doi.org/10.1086/260062.
Bodla, Navaneeth, Bharat Singh, Rama Chellappa, and Larry S Davis. 2017.
“Soft-NMS-Improving Object Detection with One Line of
Code.” Proceedings of the IEEE
International Conference on
Computer Vision, 5561–69. https://doi.org/10.1109/iccv.2017.593.
Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov.
2017. “Enriching Word Vectors with Subword Information.”
Transactions of the Association for
Computational Linguistics 5: 135–46. https://doi.org/10.1162/tacl_a_00051.
Bollobás, B. 1999. Linear Analysis. Cambridge
University Press. https://doi.org/10.1017/CBO9780511626296.
Bommasani, Rishi, Drew A Hudson, Ehsan Adeli, et
al. 2021. “On the Opportunities and Risks of Foundation
Models.” ArXiv:2108.07258. https://arxiv.org/abs/2108.07258.
Bottou, Léon. 2010. “Large-Scale Machine Learning with Stochastic
Gradient Descent.” In Proceedings of
COMPSTAT’2010. Springer. https://doi.org/10.1201/b11429-4.
Bottou, Léon, and Yann Le Cun. 1988. “SN: A Simulator
for Connectionist Models.” Proceedings of
NeuroNimes 88 (Nimes, France), 371–82. http://leon.bottou.org/papers/bottou-lecun-88.
Boucheron, Stéphane, Olivier Bousquet, and Gábor Lugosi. 2005.
“Theory of Classification: A Survey of Some Recent
Advances.” ESAIM: Probability and
Statistics 9: 323–75. https://doi.org/10.1214/154957805100000046.
Bowman, Samuel R, Gabor Angeli, Christopher Potts, and Christopher D
Manning. 2015. “A Large Annotated Corpus for Learning Natural
Language Inference.” Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Processing, 632–42. https://arxiv.org/abs/1508.05326.
Boyd, Stephen, and Lieven Vandenberghe. 2004. Convex
Optimization. Cambridge University
Press. https://web.stanford.edu/~boyd/cvxbook/.
Bradley, Ralph Allan, and Milton E Terry. 1952. “Rank Analysis of
Incomplete Block Designs: I. The Method of
Paired Comparisons.” Biometrika 39 (3/4): 324–45. https://doi.org/10.1093/biomet/41.3-4.502.
Brown, Noam, and Tuomas Sandholm. 2017. “Libratus: The Superhuman
AI for No-Limit Poker.” IJCAI,
5226–28. https://doi.org/10.24963/ijcai.2017/772.
Brown, Peter F, John Cocke, Stephen A Della Pietra, et al. 1988.
“A Statistical Approach to Language Translation.”
COLING Budapest 1988 Volume
1: International Conference on
Computational Linguistics.
Brown, Peter F, John Cocke, Stephen A Della Pietra, et al. 1990.
“A Statistical Approach to Machine Translation.”
Computational Linguistics 16 (2):
79–85. https://doi.org/10.1162/coli.1990.16.2.79.
Brown, Tom, Benjamin Mann, Nick Ryder, et
al. 2020. “Language Models Are Few-Shot Learners.”
Advances in Neural Information
Processing Systems 33: 1877–901. https://arxiv.org/abs/2005.14165.
Buslaev, Alexander, Vladimir I Iglovikov, Eugene Khvedchenya, Alex
Parinov, Mikhail Druzhinin, and Alexandr A Kalinin. 2020.
“Albumentations: Fast and Flexible Image
Augmentations.” Information 11 (2): 125. https://doi.org/10.3390/info11020125.
Campbell, Murray, A Joseph Hoane Jr, and Feng-hsiung Hsu. 2002.
“Deep Blue.” Artificial Intelligence
134 (1-2): 57–83. https://doi.org/10.1016/s0004-3702(01)00129-1.
Canny, John. 1987. “A Computational Approach to Edge
Detection.” In Readings in Computer
Vision. Elsevier. https://doi.org/10.1109/tpami.1986.4767851.
Cer, Daniel, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia
Specia. 2017. “SemEval-2017 Task 1:
Semantic Textual Similarity Multilingual and Crosslingual Focused
Evaluation.” Proceedings of the 11th
International Workshop on
Semantic Evaluation
(SemEval-2017), 1–14. https://doi.org/10.18653/v1/s17-2001.
Chan, William, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. 2015.
“Listen, Attend and Spell.”
ArXiv:1508.01211. https://arxiv.org/abs/1508.01211.
Chen, Lili, Kevin Lu, Aravind Rajeswaran, et al. 2021. “Decision
Transformer: Reinforcement Learning via Sequence Modeling.”
Advances in Neural Information
Processing Systems 34: 15084–97. https://arxiv.org/abs/2106.01345.
Chen, Tianqi, Mu Li, Yutian Li, et al. 2015. “MXNET:
A Flexible and Efficient Machine Learning Library for Heterogeneous
Distributed Systems.” ArXiv:1512.01274. https://arxiv.org/abs/1512.01274.
Cheng, Jianpeng, Li Dong, and Mirella Lapata. 2016. “Long
Short-Term Memory-Networks for Machine Reading.” Proceedings
of the 2016 Conference on Empirical
Methods in Natural Language
Processing, 551–61. https://doi.org/10.18653/v1/d16-1053.
Chetlur, Sharan, Cliff Woolley, Philippe Vandermersch, et al. 2014.
“CuDNN: Efficient Primitives for Deep
Learning.” ArXiv:1410.0759. https://arxiv.org/abs/1410.0759.
Cho, Kyunghyun, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua
Bengio. 2014. “On the Properties of Neural Machine Translation:
Encoder–Decoder Approaches.”
ArXiv:1409.1259. https://arxiv.org/abs/1409.1259.
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, et al. 2014.
“Learning Phrase Representations Using RNN
Encoder-Decoder for Statistical Machine Translation.”
Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing, 1724–34. https://arxiv.org/abs/1406.1078.
Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin,
et al. 2022. “PaLM: Scaling Language Modeling
with Pathways.” ArXiv:2204.02311. https://arxiv.org/abs/2204.02311.
Chung, Junyoung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio.
2014. “Empirical Evaluation of Gated Recurrent Neural Networks on
Sequence Modeling.” ArXiv:1412.3555. https://arxiv.org/abs/1412.3555.
Clark, Kevin, Minh-Thang Luong, Quoc V Le, and Christopher D Manning.
2020. “ELECTRA: Pre-Training Text Encoders as
Discriminators Rather Than Generators.”
International Conference on
Learning Representations. https://openreview.net/forum?id=r1xMH1BtvB.
Collobert, Ronan, Jason Weston, Léon Bottou, Michael Karlen, Koray
Kavukcuoglu, and Pavel Kuksa. 2011. “Natural Language Processing
(Almost) from Scratch.” Journal of
Machine Learning Research
12: 2493–537. https://jmlr.org/papers/v12/collobert11a.html.
Cordonnier, Jean-Baptiste, Andreas Loukas, and Martin Jaggi. 2020.
“On the Relationship Between Self-Attention and Convolutional
Layers.” International Conference
on Learning Representations. https://openreview.net/forum?id=HJlnC1rKPB.
Cover, T, and JM Thomas. 1999. Elements of Information
Theory. John Wiley &
Sons. https://doi.org/10.1002/047174882X.
Csiszár, Imre. 2008. “Axiomatic Characterizations of Information
Measures.” Entropy 10 (3): 261–73. https://doi.org/10.3390/e10030261.
Cybenko, George. 1989. “Approximation by Superpositions of a
Sigmoidal Function.” Mathematics of Control,
Signals and Systems 2 (4): 303–14. https://doi.org/10.1007/bf02134016.
Dalal, Navneet, and Bill Triggs. 2005. “Histograms of Oriented
Gradients for Human Detection.” 2005 IEEE
Computer Society Conference on
Computer Vision and Pattern
Recognition (CVPR’05) 1: 886–93. https://doi.org/10.1109/cvpr.2005.177.
De Cock, Dean. 2011. “Ames, Iowa: Alternative to the
Boston Housing Data as an End of Semester Regression
Project.” Journal of Statistics
Education 19 (3). https://doi.org/10.1080/10691898.2011.11889627.
Dean, Jeffrey, Greg S Corrado, Rajat Monga, et
al. 2012. “Large Scale Distributed Deep Networks.”
Proceedings of the 25th International
Conference on Neural Information
Processing Systems, Volume
1, 1223–31. https://doi.org/10.5555/2999134.2999271.
DeCandia, Giuseppe, Deniz Hastorun, Madan Jampani, et al. 2007.
“Dynamo: Amazon’s Highly Available Key-Value
Store.” ACM SIGOPS
Operating Systems Review 41:
205–20. https://doi.org/10.1145/1323293.1294281.
Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
2009. “Imagenet: A Large-Scale Hierarchical Image
Database.” 2009 IEEE Conference on
Computer Vision and Pattern
Recognition, 248–55. https://doi.org/10.1109/cvpr.2009.5206848.
Der Kiureghian, Armen, and Ove Ditlevsen. 2009. “Aleatory or
Epistemic? Does It Matter?” Structural
Safety 31 (2): 105–12. https://doi.org/10.1016/j.strusafe.2008.06.020.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018.
“BERT: Pre-Training of Deep
Bidirectional Transformers for Language Understanding.”
ArXiv:1810.04805. https://arxiv.org/abs/1810.04805.
Dinh, Laurent, David Krueger, and Yoshua Bengio. 2014.
“NICE: Non-Linear Independent Components
Estimation.” ArXiv:1410.8516. https://arxiv.org/abs/1410.8516.
Dinh, Laurent, Jascha Sohl-Dickstein, and Samy Bengio. 2017.
“Density Estimation Using Real NVP.”
International Conference on
Learning Representations. https://openreview.net/forum?id=HkpbnH9lx.
Doersch, Carl, Abhinav Gupta, and Alexei A Efros. 2015.
“Unsupervised Visual Representation Learning by Context
Prediction.” Proceedings of the IEEE
International Conference on
Computer Vision, 1422–30. https://doi.org/10.1109/iccv.2015.167.
Dosovitskiy, Alexey, Lucas Beyer, Alexander
Kolesnikov, et al. 2021. “An Image Is Worth 16 x 16 Words:
Transformers for Image Recognition at Scale.”
International Conference on
Learning Representations. https://openreview.net/forum?id=YicbFdNTTy.
Duchi, John, Elad Hazan, and Yoram Singer. 2011. “Adaptive
Subgradient Methods for Online Learning and Stochastic
Optimization.” Journal of Machine
Learning Research 12: 2121–59. https://jmlr.org/papers/v12/duchi11a.html.
Dumoulin, Vincent, and Francesco Visin. 2016. “A Guide to
Convolution Arithmetic for Deep Learning.”
ArXiv:1603.07285. https://arxiv.org/abs/1603.07285.
Dwivedi, Vijay Prakash, and Xavier Bresson. 2020. “A
Generalization of Transformer Networks to Graphs.”
ArXiv:2012.09699. https://arxiv.org/abs/2012.09699.
Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer
Reingold, and Aaron Leon Roth. 2015. “Preserving Statistical
Validity in Adaptive Data Analysis.” Proceedings of the 47th
Annual ACM Symposium on
Theory of Computing, 117–26. https://doi.org/10.1145/2746539.2746580.
Elman, Jeffrey L. 1990. “Finding Structure in Time.”
Cognitive Science 14 (2): 179–211. https://doi.org/10.1016/0364-0213(90)90002-E.
Elsken, T., J. H. Metzen, and F. Hutter. 2018. “Neural
Architecture Search: A Survey.” ArXiv:1808.05377
[Stat.ML]. https://arxiv.org/abs/1808.05377.
Fechner, Gustav Theodor. 1860. Elemente Der
Psychophysik. Vol. 2. Breitkopf u.
Härtel.
Fedus, William, Barret Zoph, and Noam Shazeer. 2022. “Switch
Transformers: Scaling to Trillion Parameter Models with Simple and
Efficient Sparsity.” Journal of
Machine Learning Research 23
(120): 1–39. https://arxiv.org/abs/2101.03961.
Fernando, Randima. 2004. GPU Gems:
Programming Techniques, Tips, and
Tricks for Real-Time
Graphics. Addison-Wesley.
Feurer, M., and F. Hutter. 2019. “Hyperparameter
Optimization.” In Automated Machine
Learning: Methods, Systems,
Challenges. Springer. https://doi.org/10.1007/978-3-030-05318-5_1.
Feurer, M., B. Letham, F. Hutter, and E. Bakshy. 2022. “Practical
Transfer Learning for Bayesian Optimization.”
ArXiv:1802.02219 [Stat.ML]. https://arxiv.org/abs/1802.02219.
Field, David J. 1987. “Relations Between the Statistics of Natural
Images and the Response Properties of Cortical Cells.”
JOSA A 4 (12): 2379–94. https://doi.org/10.1364/josaa.4.002379.
Fisher, R A. 1925. Statistical Methods for
Research Workers. Oliver &
Boyd.
Flammarion, Nicolas, and Francis Bach. 2015. “From Averaging to
Acceleration, There Is Only a Step-Size.”
Conference on Learning
Theory, 658–95. https://proceedings.mlr.press/v40/Flammarion15.html.
Forrester, Alexander IJ, András Sóbester, and Andy J Keane. 2007.
“Multi-Fidelity Optimization via Surrogate Modelling.”
Proceedings of the Royal Society
A: Mathematical, Physical and
Engineering Sciences 463 (2088): 3251–69.
https://doi.org/10.1098/rspa.2007.1900.
Franceschi, L., M. Donini, P. Frasconi, and M. Pontil. 2017.
“Forward and Reverse Gradient-Based Hyperparameter
Optimization.” Proceedings of the 34th
International Conference on
Machine Learning (ICML’17).
https://proceedings.mlr.press/v70/franceschi17a.html.
Frankle, Jonathan, and Michael Carbin. 2019. “The Lottery Ticket
Hypothesis: Finding Sparse, Trainable Neural Networks.”
International Conference on Learning Representations. https://arxiv.org/abs/1803.03635.
Frazier, Peter I. 2018. “A Tutorial on Bayesian
Optimization.” ArXiv:1807.02811. https://arxiv.org/abs/1807.02811.
Freund, Yoav, and Robert E Schapire. 1996. “Experiments with a New
Boosting Algorithm.” Proceedings of the
International Conference on
Machine Learning 96: 148–56. https://dl.acm.org/doi/10.5555/3091696.3091715.
Friedman, Jerome H. 1987. “Exploratory Projection Pursuit.”
Journal of the American Statistical
Association 82 (397): 249–66. https://doi.org/10.1080/01621459.1987.10478427.
Frostig, Roy, Matthew James Johnson, and Chris Leary. 2018.
“Compiling Machine Learning Programs via High-Level
Tracing.” In Proceedings of Systems for Machine
Learning. https://mlsys.org/Conferences/2018/doc/19.pdf.
Fukushima, Kunihiko. 1982. “Neocognitron: A Self-Organizing Neural
Network Model for a Mechanism of Visual Pattern Recognition.” In
Competition and Cooperation in Neural
Nets. Springer. https://doi.org/10.1007/978-3-642-46466-9_18.
Gardner, Jacob, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and
Andrew G Wilson. 2018. “GPyTorch: Blackbox
Matrix–Matrix Gaussian Process Inference with
GPU Acceleration.” Advances in
Neural Information Processing
Systems 31. https://doi.org/10.5555/3327345.3327419.
Garg, Saurabh, Sivaraman Balakrishnan, Zico Kolter, and Zachary Lipton.
2021. “RATT: Leveraging Unlabeled Data to Guarantee
Generalization.” International
Conference on Machine
Learning, 3598–609. https://proceedings.mlr.press/v139/garg21a.html.
Gatys, Leon A, Alexander S Ecker, and Matthias Bethge. 2016.
“Image Style Transfer Using Convolutional Neural Networks.”
Proceedings of the IEEE Conference on
Computer Vision and Pattern
Recognition, 2414–23. https://doi.org/10.1109/cvpr.2016.265.
Gauss, Carl Friedrich. 1809. “Theoria Motus Corporum
Coelestium.” In Werke. Königlich
Preussische Akademie der
Wissenschaften. https://doi.org/10.1007/978-3-642-92478-1.
Gibbs, Josiah Willard. 1902. Elementary Principles of
Statistical Mechanics. Scribner’s.
Ginibre, Jean. 1965. “Statistical Ensembles of Complex,
Quaternion, and Real Matrices.” Journal of
Mathematical Physics 6 (3): 440–49. https://doi.org/10.1063/1.1704292.
Girshick, Ross. 2015. “Fast
R-CNN.” Proceedings of the
IEEE International Conference on
Computer Vision, 1440–48. https://doi.org/10.1109/iccv.2015.169.
Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014.
“Rich Feature Hierarchies for Accurate Object Detection and
Semantic Segmentation.” Proceedings of the IEEE
Conference on Computer Vision and
Pattern Recognition, 580–87. https://doi.org/10.18127/j00338486-202109-11.
Glorot, Xavier, and Yoshua Bengio. 2010. “Understanding the
Difficulty of Training Deep Feedforward Neural Networks.”
Proceedings of the 13th International
Conference on Artificial
Intelligence and Statistics, 249–56. https://proceedings.mlr.press/v9/glorot10a.html.
Goh, Gabriel. 2017. “Why Momentum Really Works.”
Distill. http://distill.pub/2017/momentum.
Goldberg, David, David Nichols, Brian M Oki, and Douglas Terry. 1992.
“Using Collaborative Filtering to Weave an Information
Tapestry.” Communications of the ACM 35
(12): 61–71. https://doi.org/10.1145/138859.138867.
Golub, Gene H, and Charles F Van Loan. 1996. Matrix
Computations. Johns Hopkins
University Press. https://jhupbooks.press.jhu.edu/title/matrix-computations.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep
Learning. MIT Press.
Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, et al. 2014.
“Generative Adversarial Nets.” Advances in
Neural Information Processing
Systems, 2672–80. https://doi.org/10.5555/2969033.2969125.
Gotmare, Akhilesh, Nitish Shirish Keskar, Caiming Xiong, and Richard
Socher. 2018. “A Closer Look at Deep Learning Heuristics: Learning
Rate Restarts, Warmup and Distillation.”
ArXiv:1810.13243. https://arxiv.org/abs/1810.13243.
Goyal, Ankit, Alexey Bochkovskiy, Jia Deng, and Vladlen Koltun. 2021.
“Non-Deep Networks.”
ArXiv:2110.07641. https://arxiv.org/abs/2110.07641.
Graves, Alex. 2013. “Generating Sequences with Recurrent Neural
Networks.” ArXiv:1308.0850. https://arxiv.org/abs/1308.0850.
Graves, Alex, Marcus Liwicki, Santiago Fernández, Roman Bertolami, Horst
Bunke, and Jürgen Schmidhuber. 2008. “A Novel Connectionist System
for Unconstrained Handwriting Recognition.” IEEE
Transactions on Pattern Analysis
and Machine Intelligence 31 (5): 855–68.
https://doi.org/10.1109/tpami.2008.137.
Graves, Alex, and Jürgen Schmidhuber. 2005. “Framewise Phoneme
Classification with Bidirectional LSTM and Other Neural
Network Architectures.” Neural Networks 18 (5-6):
602–10. https://doi.org/10.1109/ijcnn.2005.1556215.
Griewank, Andreas. 1989. “On Automatic Differentiation.” In
Mathematical Programming: Recent
Developments and Applications. Kluwer. https://doi.org/10.1007/bfb0092220.
Gulati, Anmol, James Qin, Chung-Cheng Chiu, et
al. 2020. “Conformer: Convolution-Augmented Transformer for
Speech Recognition.” Proc. Interspeech
2020, 5036–40. https://doi.org/10.21437/interspeech.2020-3015.
Gunawardana, Asela, and Guy Shani. 2015. “Evaluating Recommender
Systems.” In Recommender Systems
Handbook. Springer. https://doi.org/10.1007/978-1-4899-7637-6_8.
Guo, Huifeng, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He.
2017. “DeepFM: A Factorization-Machine Based Neural Network for
CTR Prediction.” Proceedings of the 26th
International Joint Conference on
Artificial Intelligence, 1725–31. https://doi.org/10.24963/ijcai.2017/239.
Guyon, Isabelle, Steve Gunn, Masoud Nikravesh, and Lotfi A Zadeh. 2008.
Feature Extraction: Foundations and
Applications. Springer. https://doi.org/10.1007/978-3-540-35488-8.
Hadjis, Stefan, Ce Zhang, Ioannis Mitliagkas, Dan Iter, and Christopher
Ré. 2016. “Omnivore: An Optimizer for Multi-Device Deep Learning
on CPUs and GPUs.”
ArXiv:1606.04487. https://arxiv.org/abs/1606.04487.
Hartley, Richard I, and Fredrik Kahl. 2009. “Global Optimization
Through Rotation Space Search.” International
Journal of Computer Vision
82 (1): 64–79. https://doi.org/10.1007/s11263-008-0186-9.
Hartley, Richard, and Andrew Zisserman. 2000. Multiple
View Geometry in Computer
Vision. Cambridge University
Press. https://doi.org/10.1017/cbo9780511811685.
He, Kaiming, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and
Ross Girshick. 2022. “Masked Autoencoders Are Scalable Vision
Learners.” Proceedings of the
IEEE/CVF Conference on
Computer Vision and Pattern
Recognition, 16000–16009. https://doi.org/10.1109/cvpr52688.2022.01553.
He, Kaiming, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017.
“Mask R-CNN.” Proceedings of
the IEEE International Conference
on Computer Vision, 2961–69. https://doi.org/10.1109/iccv.2017.322.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015.
“Delving Deep into Rectifiers: Surpassing Human-Level Performance
on ImageNet Classification.”
Proceedings of the IEEE International
Conference on Computer
Vision, 1026–34. https://doi.org/10.1109/iccv.2015.123.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016.
“Deep Residual Learning for Image Recognition.”
Proceedings of the IEEE Conference on
Computer Vision and Pattern
Recognition, 770–78. https://doi.org/10.1109/cvpr.2016.90.
He, Xiangnan, and Tat-Seng Chua. 2017. “Neural Factorization
Machines for Sparse Predictive Analytics.” Proceedings of the
40th International ACM SIGIR
Conference on Research and
Development in Information
Retrieval, 355–64. https://doi.org/10.1145/3077136.3080777.
He, Xiangnan, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and
Tat-Seng Chua. 2017. “Neural Collaborative Filtering.”
Proceedings of the 26th International
Conference on World Wide
Web, 173–82. https://doi.org/10.1145/3038912.3052569.
Hebb, Donald Olding. 1949. The Organization of
Behavior. Wiley.
Hendrycks, Dan, and Kevin Gimpel. 2016. “Gaussian Error Linear
Units (GELUs).”
ArXiv:1606.08415. https://arxiv.org/abs/1606.08415.
Hennessy, John L, and David A Patterson. 2011. Computer
Architecture: A Quantitative
Approach. Elsevier. https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-383872-8.
Herlocker, Jonathan L, Joseph A Konstan, Al Borchers, and John Riedl.
1999. “An Algorithmic Framework for Performing Collaborative
Filtering.” 22nd Annual
International ACM Conference on
Research and Development in
Information Retrieval, SIGIR
1999, 230–37. https://doi.org/10.1145/312624.312682.
Hidasi, Balázs, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos
Tikk. 2015. “Session-Based Recommendations with Recurrent Neural
Networks.” ArXiv:1511.06939. https://arxiv.org/abs/1511.06939.
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. “Denoising
Diffusion Probabilistic Models.” Advances in
Neural Information Processing
Systems 33: 6840–51. https://arxiv.org/abs/2006.11239.
Hochreiter, Sepp, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber.
2001. “Gradient Flow in Recurrent Nets: The Difficulty of Learning
Long-Term Dependencies.” In A Field
Guide to Dynamical Recurrent
Neural Networks. IEEE
Press. https://doi.org/10.18034/ei.v8i2.570.
Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term
Memory.” Neural Computation 9 (8): 1735–80.
https://doi.org/10.1162/neco.1997.9.8.1735.
Hoffmann, Jordan, Sebastian Borgeaud, Arthur
Mensch, et al. 2022. “Training Compute-Optimal Large
Language Models.” ArXiv:2203.15556. https://arxiv.org/abs/2203.15556.
Howard, Andrew, Mark Sandler, Grace Chu, et al. 2019. “Searching
for MobileNetV3.”
Proceedings of the IEEE/CVF
International Conference on
Computer Vision, 1314–24. https://doi.org/10.1109/iccv.2019.00140.
Hoyer, Patrik O, Dominik Janzing, Joris M Mooij, Jonas Peters, and
Bernhard Schölkopf. 2009. “Nonlinear Causal Discovery with
Additive Noise Models.” Advances in Neural
Information Processing
Systems, 689–96. https://doi.org/10.5555/2981780.2981826.
Hu, Jie, Li Shen, and Gang Sun. 2018. “Squeeze-and-Excitation
Networks.” Proceedings of the IEEE
Conference on Computer Vision and
Pattern Recognition, 7132–41. https://doi.org/10.1109/cvpr.2018.00745.
Hu, Yifan, Yehuda Koren, and Chris Volinsky. 2008. “Collaborative
Filtering for Implicit Feedback Datasets.” 2008 8th
IEEE International Conference on
Data Mining, 263–72. https://doi.org/10.1109/icdm.2008.22.
Hu, Zhiqiang, Roy Ka-Wei Lee, Charu C. Aggarwal, and Aston Zhang. 2022.
“Text Style Transfer: A Review and Experimental
Evaluation.” SIGKDD Explor.
Newsl. 24 (1). https://doi.org/10.1145/3544903.3544906.
Huang, Cheng-Zhi Anna, Ashish Vaswani, Jakob Uszkoreit, et al. 2018.
“Music Transformer: Generating Music with Long-Term
Structure.” International
Conference on Learning
Representations. https://arxiv.org/abs/1809.04281.
Huang, Gao, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger.
2017. “Densely Connected Convolutional Networks.”
Proceedings of the IEEE Conference on
Computer Vision and Pattern
Recognition, 4700–4708. https://doi.org/10.1109/cvpr.2017.243.
Huang, Zhiheng, Wei Xu, and Kai Yu. 2015. “Bidirectional
LSTM–CRF Models for Sequence Tagging.”
ArXiv:1508.01991. https://arxiv.org/abs/1508.01991.
Hubel, David H, and Torsten N Wiesel. 1959. “Receptive Fields of
Single Neurones in the Cat’s Striate Cortex.” Journal of
Physiology 148 (3): 574–91. https://doi.org/10.1113/jphysiol.1959.sp006308.
Hubel, David H, and Torsten N Wiesel. 1962. “Receptive Fields,
Binocular Interaction and Functional Architecture in the Cat’s Visual
Cortex.” Journal of Physiology 160 (1):
106–54. https://doi.org/10.1113/jphysiol.1962.sp006837.
Hubel, David H, and Torsten N Wiesel. 1968. “Receptive Fields and
Functional Architecture of Monkey Striate Cortex.” Journal of
Physiology 195 (1): 215–43. https://doi.org/10.1113/jphysiol.1968.sp008455.
Hutter, F., H. Hoos, and K. Leyton-Brown. 2011. “Sequential
Model-Based Optimization for General Algorithm Configuration.”
Proceedings of the Fifth International
Conference on Learning and
Intelligent Optimization
(LION’11). https://doi.org/10.1007/978-3-642-25566-3_40.
Hutter, F., L. Kotthoff, and J. Vanschoren, eds. 2019. Automated
Machine Learning: Methods,
Systems, Challenges. Springer. https://doi.org/10.1007/978-3-030-05318-5.
Ioffe, Sergey. 2017. “Batch Renormalization: Towards Reducing
Minibatch Dependence in Batch-Normalized Models.” Advances in
Neural Information Processing
Systems, 1945–53. https://proceedings.mlr.press/v70/ioffe17a.html.
Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization:
Accelerating Deep Network Training by Reducing Internal Covariate
Shift.” International Conference on Machine Learning,
448–56. https://arxiv.org/abs/1502.03167.
Izmailov, Pavel, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and
Andrew Gordon Wilson. 2018. “Averaging Weights Leads to Wider
Optima and Better Generalization.” Uncertainty in Artificial
Intelligence, 876–85. https://arxiv.org/abs/1803.05407.
Jacot, Arthur, Franck Gabriel, and Clément Hongler. 2018. “Neural
Tangent Kernel: Convergence and Generalization in Neural
Networks.” Advances in Neural
Information Processing
Systems 31. https://doi.org/10.1145/3406325.3465355.
Jaeger, Herbert. 2002. Tutorial on Training Recurrent Neural
Networks, Covering BPTT, RTRL,
EKF and the “Echo State Network”
Approach. GMD-Forschungszentrum
Informationstechnik Bonn.
Jamieson, K., and A. Talwalkar. 2016. “Non-Stochastic Best Arm
Identification and Hyperparameter Optimization.” Proceedings
of the 19th International Conference on
Artificial Intelligence and
Statistics. https://proceedings.mlr.press/v51/jamieson16.html.
Jenatton, R., C. Archambeau, J. González, and M. Seeger. 2017.
“Bayesian Optimization with tree-Structured dependencies.” Proceedings of the 34th
International Conference on
Machine Learning (ICML’17).
https://proceedings.mlr.press/v70/jenatton17a.html.
Jia, Xianyan, Shutao Song, Wei He, et al.
2018. “Highly Scalable Deep Learning Training System with
Mixed-Precision: Training ImageNet in Four
Minutes.” ArXiv:1807.11205. https://arxiv.org/abs/1807.11205.
Jia, Yangqing, Evan Shelhamer, Jeff Donahue, et al. 2014. “Caffe:
Convolutional Architecture for Fast Feature Embedding.”
Proceedings of the 22nd ACM International
Conference on Multimedia, 675–78. https://doi.org/10.1145/2647868.2654889.
Joshi, Mandar, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer,
and Omer Levy. 2020. “SpanBERT: Improving
Pre-Training by Representing and Predicting Spans.”
Transactions of the Association for
Computational Linguistics 8: 64–77. https://arxiv.org/abs/1907.10529.
Jouppi, Norman P, Cliff Young, Nishant Patil, et
al. 2017. “In-Datacenter Performance Analysis of a Tensor
Processing Unit.” 2017 ACM/IEEE 44th
Annual International Symposium on
Computer Architecture
(ISCA), 1–12. https://doi.org/10.1145/3140659.3080246.
Kalchbrenner, Nal, Edward Grefenstette, and Phil Blunsom. 2014. “A
Convolutional Neural Network for Modelling Sentences.”
ArXiv:1404.2188. https://arxiv.org/abs/1404.2188.
Kalman, Barry L, and Stan C Kwasny. 1992. “Why Tanh: Choosing a
Sigmoidal Function.” Proceedings of the
International Joint Conference on
Neural Networks (IJCNN), 578–81. https://doi.org/10.1109/ijcnn.1992.227257.
Kaplan, Jared, Sam McCandlish, Tom Henighan, et al. 2020. “Scaling
Laws for Neural Language Models.”
ArXiv:2001.08361. https://arxiv.org/abs/2001.08361.
Karnin, Z., T. Koren, and O. Somekh. 2013. “Almost Optimal
Exploration in Multi-Armed Bandits.” Proceedings of the 30th
International Conference on
Machine Learning (ICML’13).
https://proceedings.mlr.press/v28/karnin13.html.
Karras, Tero, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018.
“Progressive Growing of GANs for Improved Quality,
Stability, and Variation.” International Conference on
Learning Representations. https://arxiv.org/abs/1710.10196.
Kim, Jaeyoung, Mostafa El-Khamy, and Jungwon Lee. 2017. “Residual
LSTM: Design of a Deep Recurrent Architecture for Distant
Speech Recognition.” ArXiv:1701.03360. https://arxiv.org/abs/1701.03360.
Kim, Yoon. 2014. “Convolutional Neural Networks for Sentence
Classification.” ArXiv:1408.5882. https://arxiv.org/abs/1408.5882.
Kimeldorf, G. S., and G. Wahba. 1971. “Some Results on
Tchebycheffian Spline Functions.” J.
Math. Anal. Appl. 33: 82–95.
https://doi.org/10.1214/aoms/1177693054.
Kingma, Diederik P, and Jimmy Ba. 2015. “Adam: A Method for
Stochastic Optimization.” International Conference on
Learning Representations. https://arxiv.org/abs/1412.6980.
Kingma, Diederik P., and Max Welling. 2014. “Auto-Encoding
Variational Bayes.” International
Conference on Learning
Representations (ICLR). https://arxiv.org/abs/1312.6114.
Kipf, Thomas N, and Max Welling. 2017. “Semi-Supervised
Classification with Graph Convolutional Networks.”
International Conference on Learning Representations. https://arxiv.org/abs/1609.02907.
Kojima, Takeshi, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and
Yusuke Iwasawa. 2022. “Large Language Models Are Zero-Shot
Reasoners.” Arxiv.org/Abs/2205.11916, ahead of print. https://doi.org/10.52202/068431-1613.
Koller, Daphne, and Nir Friedman. 2009. Probabilistic
Graphical Models: Principles and
Techniques. MIT Press. https://doi.org/10.7551/mitpress/7432.001.0001.
Kolmogorov, Andrey. 1933. “Sulla Determinazione Empirica Di Una
Legge Di Distribuzione.” Inst. Ital.
Attuari, Giorn. 4: 83–91. https://doi.org/10.1007/BF03017337.
Kolter, Zico. 2008. “Linear Algebra Review and Reference.”
Available Online:
Http://Cs229.stanford.edu/Section/Cs229-Linalg.pdf. http://cs229.stanford.edu/section/cs229-linalg.pdf.
Koren, Yehuda, Robert Bell, and Chris Volinsky. 2009. “Matrix
Factorization Techniques for Recommender Systems.”
Computer 42 (8): 30–37. https://doi.org/10.1109/mc.2009.263.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012.
“ImageNet Classification with Deep Convolutional
Neural Networks.” Advances in Neural
Information Processing
Systems, 1097–105. https://doi.org/10.5555/2999134.2999257.
Kung, Sun Yuan. 1988. “VLSI Array
Processors.” Prentice Hall.
Kuzovkin, Ilya, Raul Vicente, Mathilde Petton, et al. 2018.
“Activations of Deep Convolutional Neural Networks Are Aligned
with Gamma Band Activity of Human Visual Cortex.”
Communications Biology 1 (1): 1–12. https://doi.org/10.1038/s42003-018-0110-y.
Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush
Sharma, and Radu Soricut. 2019. “ALBERT: A Lite
BERT for Self-Supervised Learning of Language
Representations.” ArXiv:1909.11942. https://arxiv.org/abs/1909.11942.
Lavin, Andrew, and Scott Gray. 2016. “Fast Algorithms for
Convolutional Neural Networks.” Proceedings of the
IEEE Conference on Computer
Vision and Pattern
Recognition, 4013–21. https://doi.org/10.1109/cvpr.2016.435.
Le, Quoc V. 2013. “Building High-Level Features Using Large Scale
Unsupervised Learning.” Proceedings of the IEEE
International Conference on
Acoustics, Speech and Signal
Processing, 8595–98. https://doi.org/10.1109/icassp.2013.6639343.
LeCun, Yann, Yoshua Bengio, and et al. 1995.
“Convolutional Networks for Images, Speech, and Time
Series.” In The Handbook of Brain
Theory and Neural Networks.
MIT Press. http://yann.lecun.com/exdb/publis/pdf/lecun-bengio-95a.pdf.
LeCun, Yann, Bernhard Boser, John S Denker, et al. 1989.
“Backpropagation Applied to Handwritten Zip Code
Recognition.” Neural Computation 1 (4):
541–51. https://doi.org/10.1162/neco.1989.1.4.541.
LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998.
“Gradient-Based Learning Applied to Document Recognition.”
Proceedings of the IEEE 86 (11): 2278–324. https://doi.org/10.1109/5.726791.
LeCun, Yann, Leon Bottou, G Orr, and Klaus-Robert Muller. 1998.
“Efficient Backprop.” In Neural Networks:
Tricks of the Trade. Springer. https://doi.org/10.1007/3-540-49430-8_2.
LeCun, Yann, LD Jackel, Leon Bottou, et al.
1995. “Comparison of Learning Algorithms for Handwritten Digit
Recognition.” International
Conference on Artificial Neural
Networks, 53–60.
Legendre, Adrien Marie. 1805. Mémoire Sur Les
Opérations
Trigonométriques: Dont Les
Résultats Dépendent
de La Figure de La Terre. F.
Didot.
Lewis, Mike, Yinhan Liu, Naman Goyal, et al. 2019.
“BART: Denoising Sequence-to-Sequence Pre-Training
for Natural Language Generation, Translation, and Comprehension.”
ArXiv:1910.13461. https://arxiv.org/abs/1910.13461.
Lewkowycz, Aitor, Anders Andreassen, David Dohan,
et al. 2022. “Solving Quantitative Reasoning Problems with
Language Models.” ArXiv:2206.14858. https://arxiv.org/abs/2206.14858.
Li, L., K. Jamieson, A. Rostamizadeh, et al. 2018. “Massively
Parallel Hyperparameter Tuning.”
ArXiv:1810.05934. https://arxiv.org/abs/1810.05934.
Li, Mu. 2017. “Scaling Distributed
Machine Learning with System and
Algorithm Co-Design.” PhD thesis,
PhD Thesis, CMU. https://www.cs.cmu.edu/~muli/file/mu-thesis.pdf.
Li, Mu, David G Andersen, Jun Woo Park, et al. 2014. “Scaling
Distributed Machine Learning with the Parameter Server.” 11th
Symposium on Operating Systems
Design and Implementation (OSDI
14), 583–98. https://doi.org/10.1145/2640087.2644155.
Li, Mu, Tong Zhang, Yuqiang Chen, and Alexander J Smola. 2014.
“Efficient Mini-Batch Training for Stochastic
Optimization.” Proceedings of the 20th ACM
SIGKDD International Conference
on Knowledge Discovery and Data
Mining, 661–70. https://doi.org/10.1145/2623330.2623612.
Liaw, R., E. Liang, R. Nishihara, P. Moritz, J. Gonzalez, and I. Stoica.
2018. “Tune: A Research Platform for Distributed
Model Selection and Training.”
ArXiv:1807.05118. https://arxiv.org/abs/1807.05118.
Lin, Min, Qiang Chen, and Shuicheng Yan. 2013. “Network in
Network.” ArXiv:1312.4400. https://arxiv.org/abs/1312.4400.
Lin, Tsung-Yi, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár.
2017. “Focal Loss for Dense Object Detection.”
Proceedings of the IEEE International
Conference on Computer
Vision, 2980–88. https://doi.org/10.1109/iccv.2017.324.
Lin, Yuanqing, F Lv, S Zhu, et al. 2010.
“ImageNet Classification: Fast Descriptor Coding and
Large-Scale SVM Training.”
Large Scale Visual Recognition Challenge, ahead of print. https://doi.org/10.1109/cvpr.2010.5539970.
Lin, Zhouhan, Minwei Feng, Cicero Nogueira dos Santos, et al. 2017.
“A Structured Self-Attentive Sentence Embedding.”
ArXiv:1703.03130. https://arxiv.org/abs/1703.03130.
Lipton, Zachary C, John Berkowitz, and Charles Elkan. 2015. “A
Critical Review of Recurrent Neural Networks for Sequence
Learning.” ArXiv:1506.00019. https://arxiv.org/abs/1506.00019.
Lipton, Zachary C, David C Kale, Charles Elkan, and Randall Wetzel.
2016. “Learning to Diagnose with LSTM Recurrent
Neural Networks.” International
Conference on Learning
Representations (ICLR). https://arxiv.org/abs/1511.03677.
Lipton, Zachary C, and Jacob Steinhardt. 2018. “Troubling Trends
in Machine Learning Scholarship.” Communications of the
ACM 17: 45–77. https://doi.org/10.1145/3317287.3328534.
Liu, Dong C, and Jorge Nocedal. 1989. “On the Limited Memory
BFGS Method for Large Scale Optimization.”
Mathematical Programming 45 (1): 503–28. https://doi.org/10.1007/bf01589116.
Liu, Hanxiao, Karen Simonyan, and Yiming Yang. 2018.
“DARTS: Differentiable Architecture Search.”
ArXiv:1806.09055. https://arxiv.org/abs/1806.09055.
Liu, Wei, Dragomir Anguelov, Dumitru Erhan, et al. 2016.
“SSD: Single Shot Multibox Detector.”
European Conference on Computer
Vision, 21–37. https://doi.org/10.1007/978-3-319-46448-0_2.
Liu, Yinhan, Myle Ott, Naman Goyal, et al. 2019.
“RoBERTa: A Robustly Optimized BERT
Pretraining Approach.” ArXiv:1907.11692. https://arxiv.org/abs/1907.11692.
Liu, Ze, Yutong Lin, Yue Cao, et al. 2021. “Swin Transformer:
Hierarchical Vision Transformer Using Shifted Windows.”
Proceedings of the IEEE/CVF
International Conference on
Computer Vision, 10012–22. https://doi.org/10.1109/iccv48922.2021.00986.
Liu, Zhuang, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor
Darrell, and Saining Xie. 2022. “A ConvNet for the
2020s.” ArXiv:2201.03545. https://arxiv.org/abs/2201.03545.
Long, Jonathan, Evan Shelhamer, and Trevor Darrell. 2015. “Fully
Convolutional Networks for Semantic Segmentation.”
Proceedings of the IEEE Conference on
Computer Vision and Pattern
Recognition, 3431–40. https://doi.org/10.1109/tpami.2016.2572683.
Loshchilov, Ilya, and Frank Hutter. 2016. “SGDR:
Stochastic Gradient Descent with Warm Restarts.”
ArXiv:1608.03983. https://arxiv.org/abs/1608.03983.
Lowe, David G. 2004. “Distinctive Image Features from
Scale-Invariant Keypoints.” International
Journal of Computer Vision
60 (2): 91–110. https://doi.org/10.1023/b:visi.0000029664.99615.94.
Luo, Ping, Xinjiang Wang, Wenqi Shao, and Zhanglin Peng. 2018.
“Towards Understanding Regularization in Batch
Normalization.” ArXiv:1809.00846. https://arxiv.org/abs/1809.00846.
Maas, Andrew L, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng,
and Christopher Potts. 2011. “Learning Word Vectors for Sentiment
Analysis.” Proceedings of the 49th Annual
Meeting of the Association for
Computational Linguistics: Human
Language Technologies, Volume
1, 142–50. https://aclanthology.org/P11-1015.
Mack, Yue-Pok, and Bernard W Silverman. 1982. “Weak and Strong
Uniform Consistency of Kernel Regression Estimates.”
Zeitschrift für Wahrscheinlichkeitstheorie
Und Verwandte Gebiete 61 (3): 405–15. https://doi.org/10.1007/bf00539840.
MacKay, David JC. 2003. Information Theory,
Inference and Learning
Algorithms. Cambridge University
Press. https://www.inference.org.uk/mackay/itila/book.html.
Maclaurin, D., D. Duvenaud, and R. Adams. 2015. “Gradient-Based
Hyperparameter Optimization Through Reversible Learning.”
Proceedings of the 32nd International
Conference on Machine Learning
(ICML’15). https://proceedings.mlr.press/v37/maclaurin15.html.
Mangasarian, O. L. 1965. “Linear and Nonlinear Separation of
Patterns by Linear Programming.” Oper. Res.
13: 444–52. https://doi.org/10.1287/opre.13.3.444.
Mangram, Myles E. 2013. “A Simplified Perspective of the
Markowitz Portfolio Theory.” Global
Journal of Business Research
7 (1): 59–70. https://scholarworks.iu.edu/journals/index.php/jiuspa/article/view/4517.
Matthews, Alexander G de G, Mark Rowland, Jiri Hron, Richard E Turner,
and Zoubin Ghahramani. 2018. “Gaussian Process Behaviour in Wide
Deep Neural Networks.” ArXiv:1804.11271. https://arxiv.org/abs/1804.11271.
McCann, Bryan, James Bradbury, Caiming Xiong, and Richard Socher. 2017.
“Learned in Translation: Contextualized Word
Vectors.” Advances in Neural
Information Processing
Systems, 6294–305. https://doi.org/10.5555/3294996.3295037.
McCulloch, Warren S, and Walter Pitts. 1943. “A Logical Calculus
of the Ideas Immanent in Nervous Activity.” Bulletin of
Mathematical Biophysics 5 (4): 115–33. https://doi.org/10.1016/s0092-8240(05)80006-0.
McMahan, H Brendan, Gary Holt, David Sculley, et
al. 2013. “Ad Click Prediction: A View from the
Trenches.” Proceedings of the 19th ACM SIGKDD
International Conference on
Knowledge Discovery and Data
Mining, 1222–30. https://doi.org/10.1145/2487575.2488200.
Mead, Carver, and Lynn Conway. 1980. Introduction to
VLSI Systems. Addison-Wesley.
Merity, Stephen, Caiming Xiong, James Bradbury, and Richard Socher.
2016. “Pointer Sentinel Mixture Models.”
ArXiv:1609.07843. https://arxiv.org/abs/1609.07843.
Micchelli, Charles A. 1984. “Interpolation of Scattered Data:
Distance Matrices and Conditionally Positive Definite Functions.”
In Approximation Theory and Spline
Functions. Springer. https://doi.org/10.1007/bf01893414.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013.
“Efficient Estimation of Word Representations in Vector
Space.” ArXiv:1301.3781. https://arxiv.org/abs/1301.3781.
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean.
2013. “Distributed Representations of Words and Phrases and Their
Compositionality.” Advances in Neural
Information Processing
Systems, 3111–19. https://doi.org/10.5555/2999792.2999959.
Miller, George A. 1995. “WordNet: A Lexical Database
for English.” Communications of the
ACM 38 (11): 39–41. https://doi.org/10.1145/219717.219748.
Mirhoseini, Azalia, Hieu Pham, Quoc V Le, et al. 2017. “Device
Placement Optimization with Reinforcement Learning.”
Proceedings of the 34th International
Conference on Machine
Learning, 2430–39. https://proceedings.mlr.press/v70/mirhoseini17a.html.
Mnih, Volodymyr, Nicolas Heess, Alex Graves, et
al. 2014. “Recurrent Models of Visual Attention.”
Advances in Neural Information
Processing Systems, 2204–12. https://doi.org/10.5555/2969033.2969073.
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, et al. 2013.
“Playing Atari with Deep Reinforcement
Learning.” ArXiv:1312.5602.
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver,
et al. 2015. “Human-Level Control Through Deep
Reinforcement Learning.” Nature 518 (7540):
529–33. https://doi.org/10.1038/nature14236.
Moon, Taesup, Alex Smola, Yi Chang, and Zhaohui Zheng. 2010.
“Intervalrank: Isotonic Regression with Listwise and Pairwise
Constraints.” Proceedings of the 3rd ACM
International Conference on Web
Search and Data Mining,
151–60. https://doi.org/10.1145/1718487.1718520.
Morey, Richard D, Rink Hoekstra, Jeffrey N Rouder, Michael D Lee, and
Eric-Jan Wagenmakers. 2016. “The Fallacy of Placing Confidence in
Confidence Intervals.” Psychonomic Bulletin
& Review 23 (1): 103–23. https://doi.org/10.3758/s13423-015-0947-8.
Morozov, Vladimir Alekseevich. 1984. Methods for
Solving Incorrectly Posed
Problems. Springer.
Nadaraya, Elizbar A. 1964. “On Estimating Regression.”
Theory of Probability & Its
Applications 9 (1): 141–42. https://doi.org/10.1137/1109020.
Nair, Vinod, and Geoffrey E Hinton. 2010. “Rectified Linear Units
Improve Restricted Boltzmann Machines.”
ICML, 807–14. https://dl.acm.org/doi/10.5555/3104322.3104425.
Nakkiran, Preetum, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak,
and Ilya Sutskever. 2021. “Deep Double Descent: Where Bigger
Models and More Data Hurt.” Journal of
Statistical Mechanics: Theory and
Experiment 2021 (12): 124003. https://doi.org/10.1088/1742-5468/ac3a74.
Naor, Moni, and Omer Reingold. 1999. “On the Construction of
Pseudorandom Permutations: Luby–Rackoff
Revisited.” Journal of Cryptology 12 (1):
29–66. https://doi.org/10.1007/s001459900037.
Neal, Radford M. 1996. Bayesian Learning for
Neural Networks. Springer. https://doi.org/10.1007/978-1-4612-0745-0.
Nesterov, Yu. 2018. Lectures on Convex
Optimization. Springer. https://doi.org/10.1007/978-3-319-91578-4.
Nesterov, Yu, and J-Ph Vial. 2008. “Confidence Level Solutions for
Stochastic Programming.” Automatica 44 (6): 1559–68. https://doi.org/10.1016/j.automatica.2008.01.017.
Neyman, Jerzy. 1937. “Outline of a Theory of Statistical
Estimation Based on the Classical Theory of Probability.”
Philosophical Transactions of the Royal
Society of London. Series
A, Mathematical and Physical
Sciences 236 (767): 333–80. https://doi.org/10.2307/jj.8501421.24.
Norelli, Antonio, Marco Fumero, Valentino Maiorca, Luca Moschella,
Emanuele Rodolà, and Francesco Locatello. 2022.
“ASIF: Coupled Data Turns Unimodal Models to
Multimodal Without Training.”
ArXiv:2210.01738. https://arxiv.org/abs/2210.01738.
Novak, Roman, Lechao Xiao, Jaehoon Lee, et al. 2018. “Bayesian
Deep Convolutional Networks with Many Channels Are Gaussian
Processes.” ArXiv:1810.05148. https://arxiv.org/abs/1810.05148.
Novikoff, A. B. J. 1962. “On Convergence Proofs for
Perceptrons.” Proceedings of the Symposium on
the Mathematical Theory of
Automata, 615–22. https://cs.nyu.edu/~mohri/pub/nov62.pdf.
Olshausen, Bruno A, and David J Field. 1996. “Emergence of
Simple-Cell Receptive Field Properties by Learning a Sparse Code for
Natural Images.” Nature 381 (6583): 607–9. https://doi.org/10.1038/381607a0.
Ong, Cheng Soon, Alexander Smola, and Robert Williamson. 2005.
“Learning the Kernel with Hyperkernels.”
Journal of Machine Learning
Research 6: 1043–71. https://doi.org/10.1109/jcss.2005.25.2.
Ouyang, Long, Jeff Wu, Xu Jiang, et al.
2022. “Training Language Models to Follow Instructions with Human
Feedback.” ArXiv:2203.02155. https://arxiv.org/abs/2203.02155.
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002.
“BLEU: A Method for Automatic Evaluation of Machine
Translation.” Proceedings of the 40th Annual
Meeting of the Association for
Computational Linguistics, 311–18. https://doi.org/10.3115/1073083.1073135.
Parikh, Ankur P, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit.
2016. “A Decomposable Attention Model for Natural Language
Inference.” ArXiv:1606.01933. https://arxiv.org/abs/1606.01933.
Park, Taesung, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019.
“Semantic Image Synthesis with Spatially-Adaptive
Normalization.” Proceedings of the IEEE
Conference on Computer Vision and
Pattern Recognition, 2337–46. https://doi.org/10.1109/cvpr.2019.00244.
Parzen, Emanuel. 1957. “On Consistent Estimates of the Spectrum of
a Stationary Time Series.” Annals of
Mathematical Statistics 28: 329–48. https://doi.org/10.1214/aoms/1177706962.
Paszke, Adam, Sam Gross, Francisco Massa, et
al. 2019. “PyTorch: An Imperative Style,
High-Performance Deep Learning Library.” Advances in
Neural Information Processing
Systems 32: 8026–37. https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html.
Paulus, Romain, Caiming Xiong, and Richard Socher. 2017. “A Deep
Reinforced Model for Abstractive Summarization.”
ArXiv:1705.04304. https://arxiv.org/abs/1705.04304.
Penedo, Guilherme, Quentin Malartic, Daniel Hesslow, et al. 2023.
“The RefinedWeb Dataset for
Falcon LLM: Outperforming Curated Corpora with
Web Data, and Web Data Only.”
ArXiv:2306.01116. https://arxiv.org/abs/2306.01116.
Pennington, Jeffrey, Samuel Schoenholz, and Surya Ganguli. 2017.
“Resurrecting the Sigmoid in Deep Learning Through Dynamical
Isometry: Theory and Practice.” Advances in
Neural Information Processing
Systems, 4785–95. https://proceedings.neurips.cc/paper/2017/hash/a9fc2d3b0c721b5b3f0b3f11bb24a75c-Abstract.html.
Pennington, Jeffrey, Richard Socher, and Christopher Manning. 2014.
“GloVe: Global Vectors for Word
Representation.” Proceedings of the 2014
Conference on Empirical Methods
in Natural Language Processing
(EMNLP), 1532–43. https://doi.org/10.3115/v1/d14-1162.
Peters, Jonas, Dominik Janzing, and Bernhard Schölkopf. 2017.
Elements of Causal Inference:
Foundations and Learning
Algorithms. MIT Press. https://doi.org/10.7551/mitpress/11283.001.0001.
Peters, Matthew, Waleed Ammar, Chandra Bhagavatula, and Russell Power.
2017. “Semi-Supervised Sequence Tagging with Bidirectional
Language Models.” Proceedings of the 55th Annual
Meeting of the Association for
Computational Linguistics, Volume
1, 1756–65. https://doi.org/10.18653/v1/p17-1161.
Peters, Matthew, Mark Neumann, Mohit Iyyer, et al. 2018. “Deep
Contextualized Word Representations.” Proceedings of the 2018
Conference of the North American
Chapter of the Association for
Computational Linguistics: Human
Language Technologies, Volume
1, 2227–37. https://doi.org/10.18653/v1/n18-1202.
Petersen, Kaare Brandt, and Michael Syskind Pedersen. 2008. The
Matrix Cookbook. Technical University of Denmark. https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf.
Pleiss, Geoff, Danlu Chen, Gao Huang, Tongcheng Li, Laurens Van Der
Maaten, and Kilian Q Weinberger. 2017. “Memory-Efficient
Implementation of Densenets.”
ArXiv:1707.06990. https://arxiv.org/abs/1707.06990.
Polyak, Boris T. 1964. “Some Methods of Speeding up the
Convergence of Iteration Methods.” USSR
Computational Mathematics and
Mathematical Physics 4 (5): 1–17. https://doi.org/10.1016/0041-5553(64)90137-5.
Prakash, Aaditya, Sadid A Hasan, Kathy Lee, et al. 2016. “Neural
Paraphrase Generation with Stacked Residual LSTM
Networks.” ArXiv:1610.03098. https://arxiv.org/abs/1610.03098.
Qin, Chengwei, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro
Yasunaga, and Diyi Yang. 2023. “Is
ChatGPT a General-Purpose Natural Language
Processing Task Solver?” ArXiv:2302.06476.
https://arxiv.org/abs/2302.06476.
Quadrana, Massimo, Paolo Cremonesi, and Dietmar Jannach. 2018.
“Sequence-Aware Recommender Systems.” ACM
Computing Surveys 51 (4): 66. https://doi.org/10.1145/3190616.
Quinlan, J Ross. 1993. C4.5: Programs for
Machine Learning. Elsevier. https://doi.org/10.1016/c2009-0-27846-9.
Rabiner, Lawrence, and Biing-Hwang Juang. 1993. Fundamentals of
Speech Recognition.
Prentice-Hall.
Radford, Alec, Jong Wook Kim, Chris Hallacy, et
al. 2021. “Learning Transferable Visual Models from Natural
Language Supervision.” International
Conference on Machine Learning, 8748–63. https://proceedings.mlr.press/v139/radford21a.html.
Radford, Alec, Luke Metz, and Soumith Chintala. 2015.
“Unsupervised Representation Learning with Deep Convolutional
Generative Adversarial Networks.”
ArXiv:1511.06434. https://arxiv.org/abs/1511.06434.
Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.
2018. “Improving Language Understanding by Generative
Pre-Training.” OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and
Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask
Learners.” OpenAI Blog 1 (8):
9. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
Radosavovic, Ilija, Justin Johnson, Saining Xie, Wan-Yen Lo, and Piotr
Dollár. 2019. “On Network Design Spaces for Visual
Recognition.” Proceedings of the
IEEE/CVF International
Conference on Computer
Vision, 1882–90. https://doi.org/10.1109/iccv.2019.00052.
Radosavovic, Ilija, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and
Piotr Dollár. 2020. “Designing Network Design Spaces.”
Proceedings of the IEEE/CVF
Conference on Computer Vision and
Pattern Recognition, 10428–36. https://doi.org/10.1109/cvpr42600.2020.01044.
Rae, Jack W, Sebastian Borgeaud, Trevor Cai, et
al. 2021. “Scaling Language Models: Methods, Analysis &
Insights from Training Gopher.”
ArXiv:2112.11446. https://arxiv.org/abs/2112.11446.
Raffel, Colin, Noam Shazeer, Adam Roberts, et al. 2020. “Exploring
the Limits of Transfer Learning with a Unified Text-to-Text
Transformer.” Journal of Machine
Learning Research 21: 1–67. https://arxiv.org/abs/1910.10683.
Rajpurkar, Pranav, Jian Zhang, Konstantin Lopyrev, and Percy Liang.
2016. “SQuAD:
100,000+ Questions for Machine Comprehension of Text.”
ArXiv:1606.05250. https://arxiv.org/abs/1606.05250.
Ramachandran, Prajit, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm
Levskaya, and Jon Shlens. 2019. “Stand-Alone Self-Attention in
Vision Models.” Advances in Neural
Information Processing
Systems 32. https://arxiv.org/abs/1906.05909.
Ramachandran, Prajit, Barret Zoph, and Quoc V Le. 2017. “Searching
for Activation Functions.”
ArXiv:1710.05941. https://arxiv.org/abs/1710.05941.
Ramesh, Aditya, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark
Chen. 2022. “Hierarchical Text-Conditional Image Generation with
Clip Latents.” ArXiv:2204.06125. https://arxiv.org/abs/2204.06125.
Ramón y Cajal, Santiago, and L. Azoulay. 1894. Les
Nouvelles Idées Sur La
Structure Du Système
Nerveux Chez l’Homme Et Chez Les
Vertébrés. Paris, C.
Reinwald & Cie.
Ranzato, Marc-Aurelio, Y-Lan Boureau, Sumit Chopra, and Yann LeCun.
2007. “A Unified Energy-Based Framework for Unsupervised
Learning.” Artificial Intelligence and
Statistics, 371–79. https://proceedings.mlr.press/v2/ranzato07a.html.
Rasmussen, Carl Edward, and Christopher KI Williams. 2006. Gaussian
Processes for Machine
Learning. MIT Press. https://gaussianprocess.org/gpml/.
Reddi, Sashank J, Satyen Kale, and Sanjiv Kumar. 2019. “On the
Convergence of Adam and Beyond.”
ArXiv:1904.09237. https://arxiv.org/abs/1904.09237.
Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016.
“You Only Look Once: Unified, Real-Time Object Detection.”
Proceedings of the IEEE Conference on
Computer Vision and Pattern
Recognition, 779–88. https://doi.org/10.1109/cvpr.2016.91.
Redmon, Joseph, and Ali Farhadi. 2018. “YOLOv3: An
Incremental Improvement.” ArXiv:1804.02767.
https://arxiv.org/abs/1804.02767.
Reed, Scott, and Nando De Freitas. 2015. “Neural
Programmer-Interpreters.” ArXiv:1511.06279.
https://arxiv.org/abs/1511.06279.
Reed, Scott, Konrad Zolna, Emilio Parisotto, et
al. 2022. “A Generalist Agent.”
ArXiv:2205.06175. https://arxiv.org/abs/2205.06175.
Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. 2015.
“Faster R-CNN: Towards Real-Time Object
Detection with Region Proposal Networks.” Advances in
Neural Information Processing
Systems, 91–99. https://doi.org/10.5555/2969239.2969250.
Rendle, Steffen. 2010. “Factorization Machines.” 2010
IEEE International Conference on
Data Mining, 995–1000. https://doi.org/10.1109/icdm.2010.127.
Rendle, Steffen, Christoph Freudenthaler, Zeno Gantner, and Lars
Schmidt-Thieme. 2009. “BPR: Bayesian
Personalized Ranking from Implicit Feedback.” Proceedings of
the 25th Conference on Uncertainty in
Artificial Intelligence, 452–61. https://arxiv.org/abs/1205.2618.
Revels, Jarrett, Miles Lubin, and Theodore Papamarkou. 2016.
“Forward-Mode Automatic Differentiation in
Julia.” ArXiv:1607.07892. https://arxiv.org/abs/1607.07892.
Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. 2014.
“Stochastic Backpropagation and Approximate Inference in Deep
Generative Models.” International
Conference on Machine
Learning, 1278–86. https://proceedings.mlr.press/v32/rezende14.html.
Riesenhuber, Maximilian, and Tomaso Poggio. 1999. “Hierarchical
Models of Object Recognition in Cortex.” Nature
Neuroscience 2 (11): 1019–25. https://doi.org/10.1038/14819.
Rockafellar, R. T. 1970. Convex Analysis.
Princeton University Press. https://doi.org/10.1515/9781400873173.
Rolnick, David, Andreas Veit, Serge Belongie, and Nir Shavit. 2017.
“Deep Learning Is Robust to Massive Label Noise.”
ArXiv:1705.10694. https://arxiv.org/abs/1705.10694.
Rudin, W. 1973. Functional Analysis. McGraw-Hill.
Rumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1986.
“Learning Representations by Back-Propagating Errors.”
Nature 323 (6088): 533–36. https://doi.org/10.1038/323533a0.
Russakovsky, Olga, Jia Deng, Zhiheng Huang, Alexander C. Berg, and Li
Fei-Fei. 2013. “Detecting Avocados to Zucchinis: What Have We
Done, and Where Are We Going?” International
Conference on Computer Vision
(ICCV). https://doi.org/10.1109/iccv.2013.258.
Russakovsky, Olga, Jia Deng, Hao Su, et al.
2015. “ImageNet Large Scale Visual Recognition
Challenge.” International Journal of
Computer Vision 115 (3): 211–52. https://doi.org/10.1007/s11263-015-0816-y.
Russell, Stuart J, and Peter Norvig. 2016. Artificial
Intelligence: A Modern
Approach. Pearson Education
Limited.
Saharia, Chitwan, William Chan, Saurabh Saxena, et
al. 2022. “Photorealistic Text-to-Image Diffusion Models
with Deep Language Understanding.”
ArXiv:2205.11487. https://arxiv.org/abs/2205.11487.
Salinas, D., M. Seeger, A. Klein, V. Perrone, M. Wistuba, and C.
Archambeau. 2022. “Syne Tune: A Library for Large
Scale Hyperparameter Tuning and Reproducible Research.” First
Conference on Automated Machine
Learning. https://proceedings.mlr.press/v188/salinas22a.html.
Sanh, Victor, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019.
“DistilBERT, a Distilled Version of
BERT: Smaller, Faster, Cheaper and Lighter.”
ArXiv:1910.01108. https://arxiv.org/abs/1910.01108.
Sanh, Victor, Albert Webson, Colin Raffel, et
al. 2021. “Multitask Prompted Training Enables Zero-Shot
Task Generalization.” ArXiv:2110.08207. https://arxiv.org/abs/2110.08207.
Santurkar, Shibani, Dimitris Tsipras, Andrew Ilyas, and Aleksander
Madry. 2018. “How Does Batch Normalization Help
Optimization?” Advances in Neural
Information Processing
Systems, 2483–93. https://doi.org/10.5555/3327345.3327508.
Sarwar, Badrul Munir, George Karypis, Joseph A Konstan, and John Riedl.
2001. “Item-Based Collaborative Filtering Recommendation
Algorithms.” Proceedings of 10th International
Conference on World Wide
Web, 285–95. https://doi.org/10.1145/371920.372071.
Scao, Teven Le, Angela Fan, Christopher Akiki, et
al. 2022. “BLOOM: A
176B-Parameter Open-Access Multilingual Language
Model.” ArXiv:2211.05100. https://arxiv.org/abs/2211.05100.
Schein, Andrew I, Alexandrin Popescul, Lyle H Ungar, and David M
Pennock. 2002. “Methods and Metrics for Cold-Start
Recommendations.” Proceedings of the 25th Annual
International ACM SIGIR
Conference on Research and
Development in Information
Retrieval, 253–60. https://doi.org/10.1145/564376.564421.
Schölkopf, Bernhard, Chris Burges, and Vladimir Vapnik. 1996.
“Incorporating Invariances in Support Vector Learning
Machines.” International Conference
on Artificial Neural
Networks, 47–52. https://doi.org/10.1007/3-540-61510-5_12.
Schölkopf, Bernhard, and Alexander J Smola. 2002. Learning with
Kernels: Support Vector
Machines, Regularization,
Optimization, and Beyond.
MIT Press. https://doi.org/10.7551/mitpress/4175.001.0001.
Schölkopf, B., R. Herbrich, and A. J. Smola. 2001. “A Generalized
Representer Theorem.” In Proceedings of the
Annual Conference on
Computational Learning
Theory, edited by D. P. Helmbold and B. Williamson.
Springer-Verlag. https://doi.org/10.1007/3-540-44581-1_27.
Schuhmann, Christoph, Romain Beaumont, Richard
Vencu, et al. 2022. “LAION-5B: An
Open Large-Scale Dataset for Training Next Generation Image-Text
Models.” ArXiv:2210.08402. https://arxiv.org/abs/2210.08402.
Schuster, Mike, and Kuldip K Paliwal. 1997. “Bidirectional
Recurrent Neural Networks.” IEEE Transactions on
Signal Processing 45 (11): 2673–81. https://doi.org/10.1109/78.650093.
Sedhain, Suvash, Aditya Krishna Menon, Scott Sanner, and Lexing Xie.
2015. “AutoRec: Autoencoders Meet Collaborative
Filtering.” Proceedings of the 24th
International Conference on World
Wide Web, 111–12. https://doi.org/10.1145/2740908.2742726.
Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. “Neural
Machine Translation of Rare Words with Subword Units.”
Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistics, 1715–25. https://doi.org/10.18653/v1/P16-1162.
Sergeev, Alexander, and Mike Del Balso. 2018. “Horovod: Fast and
Easy Distributed Deep Learning in
TensorFlow.”
ArXiv:1802.05799. https://arxiv.org/abs/1802.05799.
Shannon, Claude Elwood. 1948. “A Mathematical Theory of
Communication.” The Bell System
Technical Journal 27 (3): 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.
Shao, Huajie, Shuochao Yao, Dachun Sun, et al. 2020.
“ControlVAE: Controllable Variational
Autoencoder.” Proceedings of the 37th
International Conference on
Machine Learning. https://proceedings.mlr.press/v119/shao20b.html.
Shaw, Peter, Jakob Uszkoreit, and Ashish Vaswani. 2018.
“Self-Attention with Relative Position Representations.”
ArXiv:1803.02155. https://arxiv.org/abs/1803.02155.
Shoeybi, Mohammad, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared
Casper, and Bryan Catanzaro. 2019. “Megatron-LM:
Training Multi-Billion Parameter Language Models Using Model
Parallelism.” ArXiv:1909.08053. https://arxiv.org/abs/1909.08053.
Silver, David, Aja Huang, Chris J Maddison, et
al. 2016. “Mastering the Game of Go with Deep
Neural Networks and Tree Search.” Nature 529 (7587):
484–89. https://doi.org/10.1038/nature16961.
Silverman, B. W. 1986. Density Estimation for
Statistical and Data
Analysis. Chapman; Hall.
Simard, Patrice Y, Yann A LeCun, John S Denker, and Bernard Victorri.
1998. “Transformation Invariance in Pattern Recognition – Tangent
Distance and Tangent Propagation.” In Neural
Networks: Tricks of the
Trade. Springer. https://doi.org/10.1007/978-3-642-35289-8_17.
Simonyan, Karen, and Andrew Zisserman. 2015. “Very Deep
Convolutional Networks for Large-Scale Image Recognition.”
International Conference on Learning Representations. https://arxiv.org/abs/1409.1556.
Sindhwani, Vikas, Tara N Sainath, and Sanjiv Kumar. 2015.
“Structured Transforms for Small-Footprint Deep Learning.”
ArXiv:1510.01722. https://arxiv.org/abs/1510.01722.
Sivic, Josef, and Andrew Zisserman. 2003. “Video
Google: A Text Retrieval Approach to Object Matching in
Videos.” Proceedings of the IEEE
International Conference on
Computer Vision 3: 1470–77. https://doi.org/10.1109/iccv.2003.1238663.
Smith, Shaden, Mostofa Patwary, Brandon Norick, et
al. 2022. “Using DeepSpeed and
Megatron to Train Megatron-Turing
NLG 530B, a Large-Scale Generative Language
Model.” ArXiv:2201.11990. https://arxiv.org/abs/2201.11990.
Smola, Alexander, and Shravan Narayanamurthy. 2010. “An
Architecture for Parallel Topic Models.” Proceedings of the
VLDB Endowment 3 (1-2): 703–10. https://doi.org/10.14778/1920841.1920931.
Snoek, J., H. Larochelle, and R. Adams. 2012. “Practical
Bayesian Optimization of Machine Learning
Algorithms.” Advances in Neural
Information Processing Systems
25, 2951–59. https://doi.org/10.5555/2999325.2999464.
Sohl-Dickstein, Jascha, Eric Weiss, Niru Maheswaranathan, and Surya
Ganguli. 2015. “Deep Unsupervised Learning Using Nonequilibrium
Thermodynamics.” International
Conference on Machine
Learning, 2256–65. https://proceedings.mlr.press/v37/sohl-dickstein15.html.
Song, Yang, and Stefano Ermon. 2019. “Generative Modeling by
Estimating Gradients of the Data Distribution.” Advances in
Neural Information Processing
Systems 32. https://arxiv.org/abs/1907.05600.
Song, Yang, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar,
Stefano Ermon, and Ben Poole. 2021. “Score-Based Generative
Modeling Through Stochastic Differential Equations.”
International Conference on
Learning Representations. https://doi.org/10.52202/075280-1645.
Speelpenning, Bert. 1980. “Compiling Fast Partial Derivatives of
Functions Given by Algorithms.” PhD thesis, University of
Illinois at Urbana-Champaign. https://doi.org/10.2172/5254402.
Srivastava, Aarohi, Abhinav Rastogi, Abhishek Rao,
et al. 2022. “Beyond the Imitation Game: Quantifying and
Extrapolating the Capabilities of Language Models.”
ArXiv:2206.04615. https://arxiv.org/abs/2206.04615.
Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever,
and Ruslan Salakhutdinov. 2014. “Dropout: A Simple Way to Prevent
Neural Networks from Overfitting.” Journal of
Machine Learning Research 15
(1): 1929–58. https://doi.org/10.5555/2627435.2670313.
Srivastava, Rupesh Kumar, Klaus Greff, and Jürgen Schmidhuber. 2015.
“Highway Networks.” ArXiv:1505.00387.
https://arxiv.org/abs/1505.00387.
Strang, Gilbert. 1993. Introduction to Linear
Algebra. Wellesley–Cambridge
Press. https://math.mit.edu/~gs/linearalgebra/.
Su, Xiaoyuan, and Taghi M Khoshgoftaar. 2009. “A Survey of
Collaborative Filtering Techniques.” Advances in
Artificial Intelligence 2009. https://doi.org/10.1155/2009/421425.
Sukhbaatar, Sainbayar, Jason Weston, and Rob Fergus. 2015.
“End-to-End Memory Networks.” Advances in
Neural Information Processing
Systems, 2440–48. https://doi.org/10.5555/2969239.2969426.
Sutskever, Ilya, James Martens, George Dahl, and Geoffrey Hinton. 2013.
“On the Importance of Initialization and Momentum in Deep
Learning.” International Conference
on Machine Learning, 1139–47. https://proceedings.mlr.press/v28/sutskever13.html.
Sutskever, Ilya, Oriol Vinyals, and Quoc V Le. 2014. “Sequence to
Sequence Learning with Neural Networks.” Advances in
Neural Information Processing
Systems, 3104–12. https://doi.org/10.5555/2969033.2969173.
Szegedy, Christian, Sergey Ioffe, Vincent Vanhoucke, and Alexander A
Alemi. 2017. “Inception-V4,
Inception-ResNet and the Impact
of Residual Connections on Learning.” 31st AAAI
Conference on Artificial
Intelligence. https://doi.org/10.1609/aaai.v31i1.11231.
Szegedy, Christian, Wei Liu, Yangqing Jia, et al. 2015. “Going
Deeper with Convolutions.” Proceedings of the
IEEE Conference on Computer
Vision and Pattern
Recognition, 1–9. https://doi.org/10.1109/cvpr.2015.7298594.
Szegedy, Christian, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and
Zbigniew Wojna. 2016. “Rethinking the Inception
Architecture for Computer Vision.” Proceedings of the
IEEE Conference on Computer
Vision and Pattern
Recognition, 2818–26. https://doi.org/10.1109/cvpr.2016.308.
Tallec, Corentin, and Yann Ollivier. 2017. “Unbiasing Truncated
Backpropagation Through Time.”
ArXiv:1705.08209. https://arxiv.org/abs/1705.08209.
Tan, Mingxing, and Quoc Le. 2019. “EfficientNet:
Rethinking Model Scaling for Convolutional Neural Networks.”
International Conference on
Machine Learning, 6105–14. https://proceedings.mlr.press/v97/tan19a.html.
Tang, Jiaxi, and Ke Wang. 2018. “Personalized Top-n Sequential
Recommendation via Convolutional Sequence Embedding.”
Proceedings of the Eleventh ACM
International Conference on Web
Search and Data Mining,
565–73. https://doi.org/10.1145/3159652.3159656.
Taskar, Ben, Carlos Guestrin, and Daphne Koller. 2004. “Max-Margin
Markov Networks.” Advances in
Neural Information Processing
Systems 16: 25.
Tay, Yi, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020.
“Efficient Transformers: A Survey.”
ArXiv:2009.06732. https://arxiv.org/abs/2009.06732.
Taylor, Ross, Marcin Kardas, Guillem Cucurull, et al. 2022.
“Galactica: A Large Language Model for Science.”
ArXiv:2211.09085. https://arxiv.org/abs/2211.09085.
Teye, Mattias, Hossein Azizpour, and Kevin Smith. 2018. “Bayesian
Uncertainty Estimation for Batch Normalized Deep Networks.”
ArXiv:1802.06455. https://arxiv.org/abs/1802.06455.
Thomee, Bart, David A Shamma, Gerald Friedland, et al. 2016.
“YFCC100M: The New Data in Multimedia Research.”
Communications of the ACM 59 (2): 64–73. https://doi.org/10.1145/2812802.
Tieleman, Tijmen, and Geoffrey Hinton. 2012. “Divide the Gradient
by a Running Average of Its Recent Magnitude.” In
COURSERA: Neural Networks for
Machine Learning, Lecture 6.5-Rmsprop. https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf.
Tikhonov, A. N., and V. Y. Arsenin. 1977. Solutions of
Ill-Posed Problems.
W.H. Winston. https://doi.org/10.1137/1.9780898719741.
Tolstikhin, Ilya O, Neil Houlsby, Alexander
Kolesnikov, et al. 2021. “MLP-Mixer: An
All-MLP Architecture for Vision.” Advances in
Neural Information Processing
Systems 34. https://arxiv.org/abs/2105.01601.
Torralba, Antonio, Rob Fergus, and William T Freeman. 2008. “80
Million Tiny Images: A Large Data Set for Nonparametric Object and Scene
Recognition.” IEEE Transactions on
Pattern Analysis and Machine
Intelligence 30 (11): 1958–70. https://doi.org/10.1109/tpami.2008.128.
Töscher, Andreas, Michael Jahrer, and Robert M Bell. 2009. The
Bigchaos Solution to the Netflix Grand Prize. https://www.netflixprize.com/assets/GrandPrize2009_BPC_BigChaos.pdf.
Touvron, Hugo, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre
Sablayrolles, and Hervé Jégou. 2021. “Training Data-Efficient
Image Transformers & Distillation Through Attention.”
International Conference on
Machine Learning, 10347–57. https://proceedings.mlr.press/v139/touvron21a.html.
Touvron, Hugo, Thibaut Lavril, Gautier Izacard, et
al. 2023a. “LLaMA: Open and
Efficient Foundation Language Models.”
ArXiv:2302.13971, 2023a. https://arxiv.org/abs/2302.13971.
Touvron, Hugo, Louis Martin, Kevin Stone, et
al. 2023b. “LLaMA 2: Open
Foundation and Fine-Tuned Chat Models.”
ArXiv:2307.09288, 2023b. https://arxiv.org/abs/2307.09288.
Tsoumakas, Grigorios, and Ioannis Katakis. 2007. “Multi-Label
Classification: An Overview.” International
Journal of Data Warehousing and
Mining 3 (3): 1–13. https://doi.org/10.4018/jdwm.2007070101.
Turing, Alan. 1950. “Computing Machinery and Intelligence.”
Mind 59 (236): 433–60. https://doi.org/10.1093/mind/lix.236.433.
Uijlings, Jasper RR, Koen EA Van De Sande, Theo Gevers, and Arnold WM
Smeulders. 2013. “Selective Search for Object Recognition.”
International Journal of Computer
Vision 104 (2): 154–71. https://doi.org/10.1007/s11263-013-0620-5.
Vapnik, V. 1995. The Nature of Statistical
Learning Theory. Springer.
Vapnik, V. 1998. Statistical Learning Theory. John Wiley; Sons.
Vapnik, V. N., and A. Y. Chervonenkis. 1974. “Ordered Risk
Minimization.” Automation and Remote Control 35:
1226–35, 1403–12.
Vapnik, V., and A. Chervonenkis. 1964. “A Note on One Class of
Perceptrons.” Automation and Remote Control 25.
Vapnik, V., and A. Chervonenkis. 1968. “Uniform Convergence of
Frequencies of Occurence of Events to Their Probabilities.”
Dokl. Akad. Nauk SSSR 181: 915–18.
Vapnik, V., and A. Chervonenkis. 1971. “On the Uniform Convergence
of Relative Frequencies of Events to Their Probabilities.”
Theory Probab. Appl. 16 (2): 264–81.
Vapnik, V., and A. Chervonenkis. 1981. “The Necessary and
Sufficient Conditions for the Uniform Convergence of Averages to Their
Expected Values.” Teoriya Veroyatnostei i Ee Primeneniya
26 (3): 543–64.
Vapnik, V., and A. Chervonenkis. 1991. “The Necessary and
Sufficient Conditions for Consistency in the Empirical Risk Minimization
Method.” Pattern Recognition and
Image Analysis 1 (3): 283–305.
Vapnik, Vladimir. 1992. “Principles of Risk Minimization for
Learning Theory.” Advances in Neural
Information Processing
Systems, 831–38. https://doi.org/10.5555/2986916.2987019.
Vapnik, Vladimir, Esther Levin, and Yann Le Cun. 1994. “Measuring
the VC-Dimension of a Learning Machine.” Neural
Computation 6 (5): 851–76. https://doi.org/10.1162/neco.1994.6.5.851.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017.
“Attention Is All You Need.” Advances in
Neural Information Processing
Systems, 5998–6008. https://doi.org/10.5555/3295222.3295349.
Wahba, Grace. 1990. Spline Models for
Observational Data. SIAM. https://doi.org/10.1137/1.9781611970128.
Waibel, Alex, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, and
Kevin J Lang. 1989. “Phoneme Recognition Using Time-Delay Neural
Networks.” IEEE Transactions on
Acoustics, Speech, and Signal
Processing 37 (3): 328–39. https://doi.org/10.1016/b978-0-08-051584-7.50037-1.
Wang, Haotao, Aston Zhang, Shuai Zheng, Xingjian Shi, Mu Li, and
Zhangyang Wang. 2022. “Removing Batch Normalization Boosts
Adversarial Training.” International
Conference on Machine
Learning, 23433–45. https://openreview.net/forum?id=2J8bBfGCPi.
Wang, Leyuan, Mu Li, Edo Liberty, and Alex J Smola. 2018. “Optimal
Message Scheduling for Aggregation.” Networks 2 (3):
2–3. https://arxiv.org/abs/1710.09465.
Wang, Qiang, Bei Li, Tong Xiao, et al. 2019. “Learning Deep
Transformer Models for Machine Translation.” Proceedings of
the 57th Annual Meeting of the
Association for Computational
Linguistics, 1810–22. https://doi.org/10.18653/v1/p19-1176.
Wang, Xuezhi, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny
Zhou. 2023. “Self-Consistency Improves Chain of Thought Reasoning
in Language Models.” International
Conference on Learning
Representations. https://openreview.net/forum?id=1PL1NIMMrw.
Wang, Yangzihao, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel,
and John D Owens. 2016. “Gunrock: A High-Performance Graph
Processing Library on the GPU.” ACM
SIGPLAN Notices 51: 11. https://doi.org/10.1145/2688500.2688538.
Warstadt, Alex, Amanpreet Singh, and Samuel R Bowman. 2019.
“Neural Network Acceptability Judgments.”
Transactions of the Association for
Computational Linguistics 7: 625–41. https://doi.org/10.1162/tacl_a_00290.
Wasserman, Larry. 2013. All of Statistics:
A Concise Course in
Statistical Inference. Springer. https://link.springer.com/book/10.1007/978-0-387-21736-9.
Watkins, Christopher JCH, and Peter Dayan. 1992.
“Q-Learning.” Machine Learning 8
(3–4): 279–92. https://doi.org/10.1007/bf00992698.
Watson, Geoffrey S. 1964. “Smooth Regression Analysis.”
Sankhyā: The Indian Journal
of Statistics, Series A,
359–72. https://doi.org/10.1007/bf02868765.
Wei, Jason, Maarten Bosma, Vincent Y Zhao, et al. 2021. “Finetuned
Language Models Are Zero-Shot Learners.”
ArXiv:2109.01652. https://arxiv.org/abs/2109.01652.
Wei, Jason, Yi Tay, Rishi Bommasani, et al.
2022. “Emergent Abilities of Large Language Models.”
ArXiv:2206.07682. https://arxiv.org/abs/2206.07682.
Wei, Jason, Xuezhi Wang, Dale Schuurmans, et al. 2022. “Chain of
Thought Prompting Elicits Reasoning in Large Language Models.”
ArXiv:2201.11903. https://arxiv.org/abs/2201.11903.
Welling, Max, and Yee W Teh. 2011. “Bayesian Learning via
Stochastic Gradient Langevin Dynamics.”
Proceedings of the 28th International
Conference on Machine Learning
(ICML-11), 681–88. https://dl.acm.org/doi/10.5555/3104482.3104568.
Wengert, Robert Edwin. 1964. “A Simple Automatic Derivative
Evaluation Program.” Communications of the
ACM 7 (8): 463–64. https://doi.org/10.1145/355588.365726.
Werbos, Paul J. 1990. “Backpropagation Through Time: What It Does
and How to Do It.” Proceedings of the IEEE
78 (10): 1550–60. https://doi.org/10.1109/5.58337.
Wigner, Eugene P. 1958. “On the Distribution of the Roots of
Certain Symmetric Matrices.” Ann. Math.,
325–27. https://doi.org/10.2307/1970079.
Wilson, Andrew G, and Pavel Izmailov. 2020. “Bayesian Deep
Learning and a Probabilistic Perspective of Generalization.”
Advances in Neural Information
Processing Systems 33: 4697–708. https://arxiv.org/abs/2002.08791.
Wistuba, M., A. Rawat, and T. Pedapati. 2019. “A Survey on Neural
Architecture Search.” ArXiv:1905.01392
[Cs.LG]. https://arxiv.org/abs/1905.01392.
Wistuba, M., N. Schilling, and L. Schmidt-Thieme. 2018. “Scalable
Gaussian Process-Based Transfer Surrogates for
Hyperparameter Optimization.” Machine
Learning 108: 43–78. https://doi.org/10.1007/s10994-017-5684-y.
Wolpert, David H, and William G Macready. 1995. No Free Lunch
Theorems for Search. Technical Report
SFI-TR-95-02-010, Santa Fe
Institute. https://www.santafe.edu/research/results/working-papers/no-free-lunch-theorems-for-search.
Wood, Frank, Jan Gasthaus, Cédric Archambeau, Lancelot James, and Yee
Whye Teh. 2011. “The Sequence Memoizer.” Communications
of the ACM 54 (2): 91–98. https://doi.org/10.1162/neco_a_00154.
Wu, Bichen, Alvin Wan, Xiangyu Yue, et al. 2018. “Shift: A Zero
Flop, Zero Parameter Alternative to Spatial Convolutions.”
Proceedings of the IEEE Conference on
Computer Vision and Pattern
Recognition, 9127–35. https://doi.org/10.1109/cvpr.2018.00951.
Wu, Yonghui, Mike Schuster, Zhifeng Chen, et
al. 2016. “Google’s Neural Machine Translation System:
Bridging the Gap Between Human and Machine Translation.”
ArXiv:1609.08144. https://arxiv.org/abs/1609.08144.
Xiao, Han, Kashif Rasul, and Roland Vollgraf. 2017.
“Fashion-MNIST: A Novel Image Dataset for
Benchmarking Machine Learning Algorithms.”
ArXiv:1708.07747. https://arxiv.org/abs/1708.07747.
Xiao, Lechao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz,
and Jeffrey Pennington. 2018. “Dynamical Isometry and a Mean Field
Theory of CNNs: How to Train 10,000-Layer Vanilla
Convolutional Neural Networks.” International
Conference on Machine
Learning, 5393–402. https://proceedings.neurips.cc/paper/2018/hash/d76b67bcd3ec823de384ef62dc7e4c7e-Abstract.html.
Xie, Saining, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He.
2017. “Aggregated Residual Transformations for Deep Neural
Networks.” Proceedings of the IEEE
Conference on Computer Vision and
Pattern Recognition, 1492–500. https://doi.org/10.1109/cvpr.2017.634.
Xiong, Ruibin, Yunchang Yang, Di He, et al. 2020. “On Layer
Normalization in the Transformer Architecture.”
International Conference on
Machine Learning, 10524–33. https://proceedings.mlr.press/v119/xiong20b.html.
Xiong, Wayne, Lingfeng Wu, Fil Alleva, Jasha Droppo, Xuedong Huang, and
Andreas Stolcke. 2018. “The Microsoft 2017
Conversational Speech Recognition System.” 2018
IEEE International Conference on
Acoustics, Speech and Signal
Processing (ICASSP), 5934–38. https://doi.org/10.1109/TASLP.2018.2876459.
Yamaguchi, Kouichi, Kenji Sakamoto, Toshio Akabane, and Yoshiji
Fujimoto. 1990. “A Neural Network for Speaker-Independent Isolated
Word Recognition.” First International
Conference on Spoken Language
Processing. https://doi.org/10.21437/icslp.1990-282.
Yang, Zichao, Zhiting Hu, Yuntian Deng, Chris Dyer, and Alex Smola.
2016. “Neural Machine Translation with Recurrent Attention
Modeling.” ArXiv:1607.05108. https://arxiv.org/abs/1607.05108.
Yang, Zichao, Marcin Moczulski, Misha Denil, et al. 2015. “Deep
Fried Convnets.” Proceedings of the IEEE
International Conference on
Computer Vision, 1476–83. https://doi.org/10.1109/iccv.2015.173.
Ye, Mao, Peifeng Yin, Wang-Chien Lee, and Dik-Lun Lee. 2011.
“Exploiting Geographical Influence for Collaborative
Point-of-Interest Recommendation.” Proceedings of the 34th
International ACM SIGIR
Conference on Research and
Development in Information
Retrieval, 325–34. https://doi.org/10.1145/2009916.2009962.
You, Yang, Igor Gitman, and Boris Ginsburg. 2017. “Large Batch
Training of Convolutional Networks.”
ArXiv:1708.03888. https://arxiv.org/abs/1708.03888.
Yu, Jiahui, Yuanzhong Xu, Jing Yu Koh, et al. 2022. “Scaling
Autoregressive Models for Content-Rich Text-to-Image Generation.”
ArXiv:2206.10789. https://arxiv.org/abs/2206.10789.
Zaheer, Manzil, Sashank Reddi, Devendra Sachan, Satyen Kale, and Sanjiv
Kumar. 2018. “Adaptive Methods for Nonconvex Optimization.”
Advances in Neural Information
Processing Systems, 9793–803. https://proceedings.neurips.cc/paper/2018/hash/90365351ccc7437a1309dc64e4db32a3-Abstract.html.
Zeiler, Matthew D. 2012. “ADADELTA: An Adaptive
Learning Rate Method.” ArXiv:1212.5701. https://arxiv.org/abs/1212.5701.
Zeiler, Matthew D, and Rob Fergus. 2013. “Stochastic Pooling for
Regularization of Deep Convolutional Neural Networks.”
ArXiv:1301.3557. https://arxiv.org/abs/1301.3557.
Zhang, Aston, Yi Tay, Shuai Zhang, et al. 2021. “Beyond
Fully-Connected Layers with Quaternions: Parameterization of
Hypercomplex Multiplications with 1/n Parameters.”
International Conference on
Learning Representations. https://openreview.net/forum?id=rcQdycl0zyk.
Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol
Vinyals. 2021. “Understanding Deep Learning (Still) Requires
Rethinking Generalization.” Communications of the
ACM 64 (3): 107–15. https://doi.org/10.1145/3446776.
Zhang, Shuai, Lina Yao, Aixin Sun, and Yi Tay. 2019. “Deep
Learning Based Recommender System: A Survey and New
Perspectives.” ACM Computing
Surveys 52 (1): 5. https://doi.org/10.1145/3285029.
Zhang, Susan, Stephen Roller, Naman Goyal, et
al. 2022. “OPT: Open Pre-Trained Transformer
Language Models.” ArXiv:2205.01068. https://arxiv.org/abs/2205.01068.
Zhang, Wei, Jun Tanida, Kazuyoshi Itoh, and Yoshiki Ichioka. 1988.
“Shift-Invariant Pattern Recognition Neural Network and Its
Optical Architecture.” Proceedings of Annual
Conference of the Japan Society
of Applied Physics.
Zhang, Yifu, Peize Sun, Yi Jiang, et al. 2021.
“ByteTrack: Multi-Object Tracking by Associating
Every Detection Box.” ArXiv:2110.06864. https://arxiv.org/abs/2110.06864.
Zhang, Zhuosheng, Aston Zhang, Mu Li, and Alex Smola. 2023.
“Automatic Chain of Thought Prompting in Large Language
Models.” International Conference
on Learning Representations. https://openreview.net/forum?id=5NTt8GFjUHkr.
Zhang, Zhuosheng, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex
Smola. 2023. “Multimodal Chain-of-Thought Reasoning in Language
Models.” ArXiv:2302.00923. https://arxiv.org/abs/2302.00923.
Zhao, Zhong-Qiu, Peng Zheng, Shou-tao Xu, and Xindong Wu. 2019.
“Object Detection with Deep Learning: A Review.”
IEEE Transactions on Neural
Networks and Learning
Systems 30 (11): 3212–32. https://doi.org/10.1016/j.neucom.2018.09.013.
Zhou, Denny, Nathanael Schärli, Le Hou, et al. 2023.
“Least-to-Most Prompting Enables Complex Reasoning in Large
Language Models.” International
Conference on Learning
Representations. https://openreview.net/forum?id=WZH7099tgfM.
Zhu, Jun-Yan, Taesung Park, Phillip Isola, and Alexei A Efros. 2017.
“Unpaired Image-to-Image Translation Using Cycle-Consistent
Adversarial Networks.” Proceedings of the IEEE
International Conference on
Computer Vision, 2223–32. https://doi.org/10.1109/iccv.2017.244.
Zhu, Yukun, Ryan Kiros, Rich Zemel, et al. 2015. “Aligning Books
and Movies: Towards Story-Like Visual Explanations by Watching Movies
and Reading Books.” Proceedings of the IEEE
International Conference on
Computer Vision, 19–27. https://doi.org/10.1109/iccv.2015.11.
Zoph, Barret, and Quoc V Le. 2016. “Neural Architecture Search
with Reinforcement Learning.”
ArXiv:1611.01578. https://arxiv.org/abs/1611.01578.