|SQL Server 2017 Machine Learning Services with R|
Page 3 of 3
Chapter 7 Operationalizing R Code
We now look at how you put the R code you’re created into use, taking advantage of SQL Server’s security model, optimizations, and location of data.
The chapter opens with a look at predictive modelling workflow, where the data is prepared, the model trained, and then used. Each of these steps is illustrated with helpful R code. The model used here predicts the chance of tipping for a taxi driver, using regression. This is followed with an interesting diversion into using the new SQL Server’s 2017 PREDICT function, this can be a substitute for using R, providing you have the trained model. The authors now look at how security, via roles and permissions, can be integrated into the running of R scripts.
The next section looks at the main tools for operationalizing R code in T-SQL, these are:
The tools are discussed with helpful examples. There’s a very brief section on monitoring and management via Dynamic Management Views (DMVs) and the Resource Governor – much too brief to be anything other than a starting point.
The chapter ends with a look at integration into other products, namely Powershell, SQL Agent, and SQL Server Integration Services (SSIS). It should be noted that each of the examples provides only the bare minimum to show R code integration.
This chapter provided a useful look at how R code integrates into the wider environment. I found the example of predicting the probability of tipping a taxi driver a very interesting, it certainly makes you think of other potential uses for this approach.
Chapter 8 Deploying, Managing, and Monitoring Database Solutions containing R Code
Having looked at how to get R code production ready, this chapter focuses on various necessary features of Database Lifecycle Management (DLM) and related DevOps practices. These include version control, unit testing, Continuous Integration/Continuous Delivery (CI/CD), and monitoring the model’s accuracy.
To follow along with the chapter’s example DLM, you’ll need SQL Server Data Tools (SSDT) in Visual Studio 2017 and Visual Studio Team Services (VSTS). Helpfully the authors provide useful step-by-step instructions for the example (create a database project, import the database, add a stored procedure, publish changes, and add unit tests).
Next, the authors demonstrate how version control is used from within VSTS – a critical process for recovery and progress. This is followed with step-by-step instructions on how to set up Continuous Integration / Continuous Delivery – allowing automatic builds and deployment based on triggers. This automation allows for consistency and is less prone to human error.
The chapter ends with a look at monitoring the model’s accuracy, ensuring it remains relevant as new data is added. Other models should be examined and replace the original model if one is found to be more accurate. A useful technique for comparing the different models is illustrated.
I found this a very useful chapter. Although it’s important to know about R functionality to get the most from it, you also need to understand how R integrates into the wider software lifecycle, and this chapter does that admirably. I found it useful that the authors provided brief details, and links, to some alternate DLM tools, together with DLM references.
Chapter 9 Machine Learning Services with R for DBAs
Using R in-database offers various advantages, particularly in reference to real-time analysis and prediction. This chapter explores how R can be used by DBAs, for gathering, analysing, baselining, and providing predictions.
Various aspects of gathering data are discussed, including impact on performance, storage, and the data to collect. These are discussed in an example using delayed durability, which can improve performance at the cost of potential data loss. The example shows various tools to measure performance namely: SQL Query Stress, Performance Monitor (perfMon), and Activity Monitor. The examples compare the performance counters Log Flushes/sec and Transaction/sec with delayed durability enabled and disabled. Since perfMon can be expensive on system resources, another example using lightweight Extended Events (XE) is also provided – useful SQL code is provided to create the relevant XE session.
All the data collected using the XE session is then analysed. In the example, there are 433 measure points from 32 different extended events. The R boxplot function is used to show the distribution of the measures, with the aim of highlighting any outliers to clean/remove them so they don’t skew the data. Next, summary statistics and a correlation are applied to the data. The correlation matrix heat map is an easy way to visually identify correlations. Factor analysis is then applied to the data to identify underlying variable relationships. Finally, the underlying factors are show as a scatterplot.
The chapter continues with a look at how the collected data can be used as a baseline, allowing subsequent data to be compared and predictions made. Some useful typical data usage patterns are shown (e.g. ETL then ERP usage over time). The data can be collected as perfMon counters (e.g. RAM, CPU), DMVs, Query Store etc. The example shows how differences can be discerned using a Mahalanobis graph, from within Power BI.
The chapter ends with an example of predicting disk usage, with useful example code provided.
This chapter provided a useful demonstration of how R code can be used in practice by the DBA for gathering, analysing, and predicting performance. Useful template R code is provided throughout.
Chapter 10 R and SQL Server 2016/2017 Features Extended
Here the authors round off the book by looking at how R can be used with some of the more recent SQL Server features, namely: JSON, PolyBase, ColumnStore indexes, and in-memory OLTP.
The interplay between R and JSON code is examined with an interesting example relating to Emergency Medical Services (EMS) incidents, by area, and response times. R is used to forecast future EMS incidents.
Next, the authors show how PolyBase (which can access data in Hadoop or Azure blob storage), is used to obtain data, which is then processed by R and the output then displayed with SQL Operations Studio.
The chapter ends with examples showing how R can take advantage of various table storage options. In the example, the rxLinMod function is used (together with SET STATISTICS TIME), to show how performance changes when the data is stored in:
This chapter provided a useful look at how R can take advantage of some of the newer SQL Server features, and the newer table storage options. The instructions for using PolyBase itself were very helpful. I had to look up the meaning of the acronym EMS since it wasn’t provided.
The book aims to introduce you to using R with SQL Server and mostly succeeds. The book is generally well written, easy to read, with informative discussions, a good flow between the topics, and useful links for further material. I found some of the initial chapters a bit shaky in places, but the book got better as you progress through it. Some of the later chapters were especially useful. A few of the code examples need checking, since some gave me errors. In some places better editing is needed.
To get the most out of this book you need a solid understanding of the various statistical functions used, this might cause problems for readers lacking this knowledge. It might have been useful for the authors to include an appendix on basic statistics or provide links for further basic information.
The book primarily focuses on SQL Server 2017, but much of it is applicable to SQL Server 2016. Similarly, much of the peripheral information is applicable to SQL Server 2017’s other Machine Learning language Python.
Overall, if you can put up with its faults, it’s a useful tutorial on using R within SQL Server.
|Last Updated ( Tuesday, 02 July 2019 )|