Computational Statistics in Data Science
Computational Statistics in Data Science
Lee, Thomas C. M.; Piegorsch, Walter W.; Levine, Richard A.; Zhang, Hao Helen
John Wiley & Sons Inc
04/2022
672
Dura
Inglês
9781119561071
15 a 20 dias
1270
Preface xxix
Part I Computational Statistics and Data Science 1
1 Computational Statistics and Data Science in the Twenty-first Century 3
Andrew J. Holbrook, Akihiko Nishimura, Xiang Ji, and Marc A. Suchard
1 Introduction 3
2 Core Challenges 1-3 5
3 Model-Specific Advances 8
4 Core Challenges 4 and 5 12
5 Rise of Data Science 16
2 Statistical Software 23
Alfred G. Schissler and Alexander D. Knudson
1 User Development Environments 23
2 Popular Statistical Software 26
3 Noteworthy Statistical Software and Related Tools 30
4 Promising and Emerging Statistical Software 36
5 The Future of Statistical Computing 38
6 Concluding Remarks 39
3 An Introduction to Deep Learning Methods 43
Yao Li, Justin Wang and Thomas C.M. Lee
1 Introduction 43
2 Machine Learning: An Overview 43
3 Feedforward Neural Networks 45
4 Convolutional Neural Networks 48
5 Autoencoders 52
6 Recurrent Neural Networks 54
7 Conclusion 57
4 Streaming Data and Data Streams 59
Taiwo Kolajo, Olawande Daramola, and Ayodele Adebiyi
1 Introduction 59
2 Data Stream Computing 61
3 Issues in Data Stream Mining 61
4 Streaming Data Tools and Technologies 64
5 Streaming Data Pre-Processing: Concept and Implementation 65
6 Streaming Data Algorithms 65
7 Strategies for Processing Data Streams 68
8 Best Practices for Managing Data Streams 69
9 Conclusion and theWay Forward 70
Part II Simulation-Based Methods 79
5 Monte Carlo Simulation: Are We There Yet? 81
Dootika Vats, James M. Flegal, and Galin L. Jones
1 Introduction 81
2 Estimation 83
3 Sampling Distribution 84
4 Estimating ? 87
5 Stopping Rules 88
6 Workflow 89
7 Examples 90
6 Sequential Monte Carlo: Particle Filters and Beyond 99
Adam M. Johansen
1 Introduction 99
2 Sequential Importance Sampling and Resampling 99
3 SMC in Statistical Contexts 106
4 Selected Recent Developments 112
7 Markov Chain Monte Carlo Methods, A Survey with Some Frequent Misunderstandings 119
Christian P. Robert and Wu Changye
1 Introduction 119
2 Monte Carlo Methods 121
3 Markov Chain Monte Carlo Methods 128
4 Approximate Bayesian Computation 141
5 Further Reading 145
8 Bayesian Inference with Adaptive Markov Chain Monte Carlo 151
Matti Vihola
1 Introduction 151
2 Random-Walk Metropolis Algorithm 151
3 Adaptation of Random-Walk Metropolis 152
4 Multimodal Targets with Parallel Tempering 156
5 Dynamic Models with Particle Filters 157
6 Discussion 159
9 Advances in Importance Sampling 165
Victor Elvira and Luca Martino
1 Introduction and Problem Statement 165
2 Importance Sampling 167
3 Multiple Importance Sampling (MIS) 171
4 Adaptive Importance Sampling (AIS) 174
Part III Statistical Learning 183
10 Supervised Learning 185
Weibin Mo and Yufeng Liu
1 Introduction 185
2 Penalized Empirical Risk Minimization 186
3 Linear Regression 190
4 Classification 193
5 Extensions for Complex Data 200
6 Discussion 203
11 Unsupervised and Semisupervised Learning 209
Jia Li and Vincent A. Pisztora
1 Introduction 209
2 Unsupervised Learning 210
3 Semisupervised Learning 219
4 Conclusions 224
12 Random Forest 231
Peter Calhoun, Xiaogang Su, Kelly M. Spoon, Richard A. Levine, and Juanjuan Fan
1 Introduction 231
2 Random Forest (RF) 232
3 Random Forest Extensions 235
4 Random Forests of Interaction Trees (RFIT) 239
5 Random Forest of Interaction Trees for Observational Studies 243
6 Discussion 249
13 Network Analysis 253
Rong Ma and Hongzhe Li
1 Introduction 253
2 Gaussian Graphical Models for Mixed Partial Compositional Data 255
3 Theoretical Properties 257
4 Graphical Model Selection 260
5 Analysis of a Microbiome-Metabolomics Data 260
6 Discussion 261
14 Tensors in Modern Statistical Learning 269
Will Wei Sun, Botao Hao, and Lexin Li
1 Introduction 269
2 Background270
3 Tensor Supervised Learning 272
4 Tensor Unsupervised Learning 276
5 Tensor Reinforcement Learning 282
6 Tensor Deep Learning 286
15 Computational Approaches to Bayesian Additive Regression Trees 297
Hugh Chipman, Edward George, Richard Hahn, Robert McCulloch, Matthew Pratola, and Rodney Sparapani
1 Introduction 297
2 Bayesian CART 298
3 TreeMCMC302
4 The BART Model 308
5 BART Example: Boston Housing Values and Air Pollution 310
6 BARTMCMC311
7 BART Extentions 313
8 Conclusion 320
Part IV High-Dimensional Data Analysis 323
16 Penalized Regression 325
Seung Jun Shin and Yichao Wu
1 Introduction 325
2 Penalization for Smoothness 326
3 Penalization for Sparsity 328
4 Tuning Parameter Selection 330
17 Model Selection in High-Dimensional Regression 333
Hao H. Zhang
1 Model Selection Problem 333
2 Model Selection in High-Dimensional Linear Regression 335
3 Interaction-Effect Selection for High-Dimensional Data 339
4 Model Selection in High-Dimensional Nonparametric Models 342
5 Concluding Remarks 349
18 Sampling Local Scale Parameters in High-Dimensional Regression Models 355
Anirban Bhattacharya and James E. Johndrow
1 Introduction 355
2 A Blocked Gibbs Sampler for the Horseshoe 356
3 Sampling (??, ??2, ??) 359
4 Sampling ?? 360
5 Appendix: A. Newton-Raphson Steps for the Inverse-cdf Sampler for ?? 367
19 Factor Modeling for High-Dimensional Time Series 371
Chun Yip Yau
1 Introduction 371
2 Identifiability 372
3 Estimation of High-Dimensional Factor Model 373
4 Determining the Number of Factors 383
Part V Quantitative Visualization 387
20 Visual Communication of Data: It Is Not a Programming Problem, It Is Viewer Perception 389
Edward Mulrow and Nola du Toit
1 Introduction 389
2 Case Studies Part 1 391
3 Let StAR Be Your Guide 393
4 Case Studies Part 2: Using StAR Principles to Develop Better Graphics 394
5 Ask Colleagues Their Opinion 397
6 Case Studies: Part 3 398
7 Iterate 401
8 Final Thoughts 402
21 Uncertainty Visualization 405
Lace Padilla, Matthew Kay, and Jessica Hullman
1 Introduction 405
2 Uncertainty Visualization Theories 408
3 General Discussion 420
22 Big Data Visualization 427
Leland Wilkinson
1 Introduction 427
2 Architecture for Big Data Analytics 428
3 Filtering430
4 Aggregating 430
5 Analyzing 436
6 Big Data Graphics 436
7 Conclusion 440
23 Visualization-Assisted Statistical Learning 443
Catherine B. Hurley and Katarina Domijan
1 Introduction 443
2 Better Visualizations with Seriation 444
3 Visualizing Machine Learning Fits 445
4 Condvis2 Case Studies 447
5 Discussion 453
24 Functional Data Visualization 457
Marc G. Genton and Ying Sun
1 Introduction 457
2 Univariate Functional Data Visualization 458
3 Multivariate Functional Data Visualization 461
4 Conclusions 465
Part VI Numerical Approximation and Optimization 469
25 Gradient-Based Optimizers for Statistics and Machine Learning 471
Cho-Jui Hsieh
1 Introduction 471
2 Convex Versus Nonconvex Optimization 472
3 Gradient Descent 473
4 Proximal Gradient Descent: Handling Nondifferentiable Regularization 475
5 Stochastic Gradient Descent 476
26 Alternating Minimization Algorithms 481
David R. Hunter
1 Introduction 481
2 Coordinate Descent 482
3 EM as Alternating Minimization 484
3.1 Finite Mixture Models 485
4 Matrix Approximation Algorithms 486
5 Conclusion 489
27 A Gentle Introduction to Alternating Direction Method of Multipliers (ADMM) for Statistical Problems 493
Shiqian Ma and Mingyi Hong
1 Introduction 493
2 Two Perfect Examples of ADMM 494
3 Variable Splitting and Linearized ADMM 496
4 Multiblock ADMM 499
5 Nonconvex Problems 501
6 Stopping Criteria 502
7 Convergence Results of ADMM 502
28 Nonconvex Optimization via MM Algorithms: Convergence Theory 509
Kenneth Lange, Joong-Ho Won, Alfonso Landeros, and Hua Zhou
1 Background509
2 Convergence Theorems 510
3 Paracontraction 521
4 Bregman Majorization 523
Part VII High-Performance Computing 535
29 Massive Parallelization 537
Robert B. Gramacy
1 Introduction 537
2 Gaussian Process Regression and Surrogate Modeling 539
3 Divide-and-Conquer GP Regression 542
4 Empirical Results 548
5 Conclusion 552
30 Divide-and-Conquer Methods for Big Data Analysis 559
Xueying Chen, Jerry Q. Cheng, and Min-ge Xie
1 Introduction 559
2 Linear Regression Model 560
3 Parametric Models 561
4 Nonparametric and Semiparametric Models 567
5 Online Sequential Updating 568
6 Splitting the Number of Covariates 569
7 Bayesian Divide-and-Conquer and Median-Based Combining 570
8 Real-World Applications 571
9 Discussion 572
31 Bayesian Aggregation 577
Yuling Yao
1 From Model Selection to Model Combination 577
2 From Bayesian Model Averaging to Bayesian Stacking 580
3 Asymptotic Theories of Stacking 584
4 Stacking in Practice 586
5 Discussion 588
32 Asynchronous Parallel Computing 593
Ming Yan
1 Introduction 593
2 Asynchronous Parallel Coordinate Update 597
3 Asynchronous Parallel Stochastic Approaches 602
4 Doubly Stochastic Coordinate Optimization with Variance Reduction 604
5 Concluding Remarks 605
Preface xxix
Part I Computational Statistics and Data Science 1
1 Computational Statistics and Data Science in the Twenty-first Century 3
Andrew J. Holbrook, Akihiko Nishimura, Xiang Ji, and Marc A. Suchard
1 Introduction 3
2 Core Challenges 1-3 5
3 Model-Specific Advances 8
4 Core Challenges 4 and 5 12
5 Rise of Data Science 16
2 Statistical Software 23
Alfred G. Schissler and Alexander D. Knudson
1 User Development Environments 23
2 Popular Statistical Software 26
3 Noteworthy Statistical Software and Related Tools 30
4 Promising and Emerging Statistical Software 36
5 The Future of Statistical Computing 38
6 Concluding Remarks 39
3 An Introduction to Deep Learning Methods 43
Yao Li, Justin Wang and Thomas C.M. Lee
1 Introduction 43
2 Machine Learning: An Overview 43
3 Feedforward Neural Networks 45
4 Convolutional Neural Networks 48
5 Autoencoders 52
6 Recurrent Neural Networks 54
7 Conclusion 57
4 Streaming Data and Data Streams 59
Taiwo Kolajo, Olawande Daramola, and Ayodele Adebiyi
1 Introduction 59
2 Data Stream Computing 61
3 Issues in Data Stream Mining 61
4 Streaming Data Tools and Technologies 64
5 Streaming Data Pre-Processing: Concept and Implementation 65
6 Streaming Data Algorithms 65
7 Strategies for Processing Data Streams 68
8 Best Practices for Managing Data Streams 69
9 Conclusion and theWay Forward 70
Part II Simulation-Based Methods 79
5 Monte Carlo Simulation: Are We There Yet? 81
Dootika Vats, James M. Flegal, and Galin L. Jones
1 Introduction 81
2 Estimation 83
3 Sampling Distribution 84
4 Estimating ? 87
5 Stopping Rules 88
6 Workflow 89
7 Examples 90
6 Sequential Monte Carlo: Particle Filters and Beyond 99
Adam M. Johansen
1 Introduction 99
2 Sequential Importance Sampling and Resampling 99
3 SMC in Statistical Contexts 106
4 Selected Recent Developments 112
7 Markov Chain Monte Carlo Methods, A Survey with Some Frequent Misunderstandings 119
Christian P. Robert and Wu Changye
1 Introduction 119
2 Monte Carlo Methods 121
3 Markov Chain Monte Carlo Methods 128
4 Approximate Bayesian Computation 141
5 Further Reading 145
8 Bayesian Inference with Adaptive Markov Chain Monte Carlo 151
Matti Vihola
1 Introduction 151
2 Random-Walk Metropolis Algorithm 151
3 Adaptation of Random-Walk Metropolis 152
4 Multimodal Targets with Parallel Tempering 156
5 Dynamic Models with Particle Filters 157
6 Discussion 159
9 Advances in Importance Sampling 165
Victor Elvira and Luca Martino
1 Introduction and Problem Statement 165
2 Importance Sampling 167
3 Multiple Importance Sampling (MIS) 171
4 Adaptive Importance Sampling (AIS) 174
Part III Statistical Learning 183
10 Supervised Learning 185
Weibin Mo and Yufeng Liu
1 Introduction 185
2 Penalized Empirical Risk Minimization 186
3 Linear Regression 190
4 Classification 193
5 Extensions for Complex Data 200
6 Discussion 203
11 Unsupervised and Semisupervised Learning 209
Jia Li and Vincent A. Pisztora
1 Introduction 209
2 Unsupervised Learning 210
3 Semisupervised Learning 219
4 Conclusions 224
12 Random Forest 231
Peter Calhoun, Xiaogang Su, Kelly M. Spoon, Richard A. Levine, and Juanjuan Fan
1 Introduction 231
2 Random Forest (RF) 232
3 Random Forest Extensions 235
4 Random Forests of Interaction Trees (RFIT) 239
5 Random Forest of Interaction Trees for Observational Studies 243
6 Discussion 249
13 Network Analysis 253
Rong Ma and Hongzhe Li
1 Introduction 253
2 Gaussian Graphical Models for Mixed Partial Compositional Data 255
3 Theoretical Properties 257
4 Graphical Model Selection 260
5 Analysis of a Microbiome-Metabolomics Data 260
6 Discussion 261
14 Tensors in Modern Statistical Learning 269
Will Wei Sun, Botao Hao, and Lexin Li
1 Introduction 269
2 Background270
3 Tensor Supervised Learning 272
4 Tensor Unsupervised Learning 276
5 Tensor Reinforcement Learning 282
6 Tensor Deep Learning 286
15 Computational Approaches to Bayesian Additive Regression Trees 297
Hugh Chipman, Edward George, Richard Hahn, Robert McCulloch, Matthew Pratola, and Rodney Sparapani
1 Introduction 297
2 Bayesian CART 298
3 TreeMCMC302
4 The BART Model 308
5 BART Example: Boston Housing Values and Air Pollution 310
6 BARTMCMC311
7 BART Extentions 313
8 Conclusion 320
Part IV High-Dimensional Data Analysis 323
16 Penalized Regression 325
Seung Jun Shin and Yichao Wu
1 Introduction 325
2 Penalization for Smoothness 326
3 Penalization for Sparsity 328
4 Tuning Parameter Selection 330
17 Model Selection in High-Dimensional Regression 333
Hao H. Zhang
1 Model Selection Problem 333
2 Model Selection in High-Dimensional Linear Regression 335
3 Interaction-Effect Selection for High-Dimensional Data 339
4 Model Selection in High-Dimensional Nonparametric Models 342
5 Concluding Remarks 349
18 Sampling Local Scale Parameters in High-Dimensional Regression Models 355
Anirban Bhattacharya and James E. Johndrow
1 Introduction 355
2 A Blocked Gibbs Sampler for the Horseshoe 356
3 Sampling (??, ??2, ??) 359
4 Sampling ?? 360
5 Appendix: A. Newton-Raphson Steps for the Inverse-cdf Sampler for ?? 367
19 Factor Modeling for High-Dimensional Time Series 371
Chun Yip Yau
1 Introduction 371
2 Identifiability 372
3 Estimation of High-Dimensional Factor Model 373
4 Determining the Number of Factors 383
Part V Quantitative Visualization 387
20 Visual Communication of Data: It Is Not a Programming Problem, It Is Viewer Perception 389
Edward Mulrow and Nola du Toit
1 Introduction 389
2 Case Studies Part 1 391
3 Let StAR Be Your Guide 393
4 Case Studies Part 2: Using StAR Principles to Develop Better Graphics 394
5 Ask Colleagues Their Opinion 397
6 Case Studies: Part 3 398
7 Iterate 401
8 Final Thoughts 402
21 Uncertainty Visualization 405
Lace Padilla, Matthew Kay, and Jessica Hullman
1 Introduction 405
2 Uncertainty Visualization Theories 408
3 General Discussion 420
22 Big Data Visualization 427
Leland Wilkinson
1 Introduction 427
2 Architecture for Big Data Analytics 428
3 Filtering430
4 Aggregating 430
5 Analyzing 436
6 Big Data Graphics 436
7 Conclusion 440
23 Visualization-Assisted Statistical Learning 443
Catherine B. Hurley and Katarina Domijan
1 Introduction 443
2 Better Visualizations with Seriation 444
3 Visualizing Machine Learning Fits 445
4 Condvis2 Case Studies 447
5 Discussion 453
24 Functional Data Visualization 457
Marc G. Genton and Ying Sun
1 Introduction 457
2 Univariate Functional Data Visualization 458
3 Multivariate Functional Data Visualization 461
4 Conclusions 465
Part VI Numerical Approximation and Optimization 469
25 Gradient-Based Optimizers for Statistics and Machine Learning 471
Cho-Jui Hsieh
1 Introduction 471
2 Convex Versus Nonconvex Optimization 472
3 Gradient Descent 473
4 Proximal Gradient Descent: Handling Nondifferentiable Regularization 475
5 Stochastic Gradient Descent 476
26 Alternating Minimization Algorithms 481
David R. Hunter
1 Introduction 481
2 Coordinate Descent 482
3 EM as Alternating Minimization 484
3.1 Finite Mixture Models 485
4 Matrix Approximation Algorithms 486
5 Conclusion 489
27 A Gentle Introduction to Alternating Direction Method of Multipliers (ADMM) for Statistical Problems 493
Shiqian Ma and Mingyi Hong
1 Introduction 493
2 Two Perfect Examples of ADMM 494
3 Variable Splitting and Linearized ADMM 496
4 Multiblock ADMM 499
5 Nonconvex Problems 501
6 Stopping Criteria 502
7 Convergence Results of ADMM 502
28 Nonconvex Optimization via MM Algorithms: Convergence Theory 509
Kenneth Lange, Joong-Ho Won, Alfonso Landeros, and Hua Zhou
1 Background509
2 Convergence Theorems 510
3 Paracontraction 521
4 Bregman Majorization 523
Part VII High-Performance Computing 535
29 Massive Parallelization 537
Robert B. Gramacy
1 Introduction 537
2 Gaussian Process Regression and Surrogate Modeling 539
3 Divide-and-Conquer GP Regression 542
4 Empirical Results 548
5 Conclusion 552
30 Divide-and-Conquer Methods for Big Data Analysis 559
Xueying Chen, Jerry Q. Cheng, and Min-ge Xie
1 Introduction 559
2 Linear Regression Model 560
3 Parametric Models 561
4 Nonparametric and Semiparametric Models 567
5 Online Sequential Updating 568
6 Splitting the Number of Covariates 569
7 Bayesian Divide-and-Conquer and Median-Based Combining 570
8 Real-World Applications 571
9 Discussion 572
31 Bayesian Aggregation 577
Yuling Yao
1 From Model Selection to Model Combination 577
2 From Bayesian Model Averaging to Bayesian Stacking 580
3 Asymptotic Theories of Stacking 584
4 Stacking in Practice 586
5 Discussion 588
32 Asynchronous Parallel Computing 593
Ming Yan
1 Introduction 593
2 Asynchronous Parallel Coordinate Update 597
3 Asynchronous Parallel Stochastic Approaches 602
4 Doubly Stochastic Coordinate Optimization with Variance Reduction 604
5 Concluding Remarks 605