SQL Server 2017 Machine Learning Services with R

Article Index
SQL Server 2017 Machine Learning Services with R
Chapters 3 - 6
Chapters 7 - 10, Conclusion

Page 2 of 3

Chapter 3 Managing Machine Learning Services for SQL Server 2017 and R

In many ways, this book really starts here, and looks at getting Machine Learning Services up and running.

The chapter opens with a brief look at the minimum requirements that R integration requires. Next, there’s a look at what the various SQL Server editions have in relation to basic or advanced R capabilities. There’s a helpful step-by-step guide on how to install R Machine Learning Services, with screenshots and discussions on user accounts, open version licence, and the update cycle.

A short script is provided to enable R code to run within SQL Server, this involves enabling ‘external scripts enabled’. There’s a brief note for a link to install “R Tools for Visual Studio (RTVS)”.

The chapter continues with a look at security, with code for creating a SQL login, database roles, and “EXECUTE AS” functionality. However, the script provided doesn’t work as-is (you can’t create the login for the default SQLR database until you’ve created that database). I note the created login is added to the sysadmin server role, this gives too much privilege - not a best practise.

The section continues with a look at how the Resource Governor can be used to control the amount of CPU and RAM that R uses, thus influencing performance. Next, the process of installing R packages is examined.

The chapter ends with a miscellaneous group of topics, relating to tools that can be used with Package Information. There’s a very useful section on ‘sp_execute_external_script’ together with its arguments – this is the central tool for SQL Server and R communication.

This chapter provides a useful walkthrough on how to install Machine Learning Services for SQL Server 2017, together with an ‘interesting’ look at the tools that can be used with packages.

While there’s a brief mention on installing “R Tools for Visual Studio (RTVS)”, the authors don’t say why you might want to do this (it’s because processing R code within SQL Server is a bit like running Dynamic SQL – difficult to maintain and debug, so running the R code first in Visual Studio can help). It also states: “…if MAXDOP is turned off…”, this needs better wording (MAXDOP can’t be turned on or off, but this setting helps control the degree of parallelism). Some of the topics under the heading are ‘unusual’ e.g. Resource Governor is under Security, but security isn’t mentioned.

Chapter 4 Data Exploration and Data Visualization

Examining and visualizing data for patterns and meaning is essential before predictive modelling can occur. The authors briefly discuss the relationship between the data types of R and T-SQL, important because implicit conversion can occur. The data frame is examined, being like a SQL table, and is used as the input/output parameters of sp_execute_external_script. There’s a useful tip concerning what computation should be done in R and what in SQL server.

There’s a useful walkthrough on using RTVS to access SQL Server data. This is followed with a brief overview of some useful R functions that allow you to explore a data frame. Next, various data munging techniques are examined, allowed data to be changed to make it more useful.

Having shown how to use RTVS to get the base R code working, this is then extended and integrated into a T-SQL example via the call to sp_execute_external_script routine.

The chapter now switches to Data Visualization, which aims to highlight salient content visually. Various R functions and libraries are explained, together with example R code for use within RTVS, including: Plot, Histogram, Tree diagram. Having described the function usage in RTVS, an example is now provided to use T-SQL and displayed in SQL Operations Studio (this needs to be downloaded, it is not part of SQL Server Management Studio [SSMS]).

The chapter ends with examples of integrating R into SSRS reports, and Power BI. I had various problems following the Power BI instructions as given.

This was a mixed chapter, it is both informative with useful examples and explanations, but also problematic. There are some problems with the text. The text “To check the data type, you can call the function call on the variable”, should read “To check the data type, you can call the function class on the variable”. The text “Or alternatively in RTVS go to the Variable Explorer window as shown in Figure 4 - 2” is incorrect, since Figure 4 -2 doesn’t contain the Variable Explorer. One code example has “orders_ discount” which should be “orders_discount”

While it is possible to understand the chapter by figuring out what the authors are trying to say, the reader should not have to work so hard, remember this is an introductory book from which some people are learning to use R! Better editing of the chapter is needed.

Chapter 5 RevoScaleR Package

This package provides parallelism and scalability, allowing faster processing on servers with plenty of memory and CPUs.

Embedded R Code is provided to get a list of all the functions within the RevoScaleR package, additionally, a more meaningful table of the various function descriptions by area is provided. However, these are not linked to the function names – which would have been useful.

The authors next move on to describing functions that are useful for data preparation, specifically for importing data from SAS, SPSS, and ODBC. In each case, useful code is provided to import and process the source data.

The chapter continues with a look at various functions and how they affect data transformation. The authors argue that most data transformation should be done inside the database using R scripts (this overcomes problems of data movement, potential disparate datasets, performance etc). Next, code is provided to illustrate the power of various common statistical functions (e.g. Cross Tabs). Running some of the script gave me errors.

This chapter provides an overview of the useful functions in the RevoScaleR package. The chapter didn’t show me how the RevoScaleR package is used for scalability and parallelism – its chief advantage. I wanted some metrics on how performance can be improved by using this package.

Often, the statistical functions are not explained (e.g. standard deviation), which might be a concern for readers lacking a maths background. The ODBC code provided gave me errors, because the server\instance name needs to escape the backslash character e.g. Server=TOMAZK\MSSQLSERVER2017 would need to be Server=TOMAZK\\MSSQLSERVER2017.

This chapter has some occasional problems with the use of English, and there are some spelling errors e.g. ODBS instead of ODBC. Again, better editing is required.

Chapter 6 Predictive Modeling

Forecasting is important, it allows businesses to plan for future demand. The RevoScaleR package is used to create models and run predictions against them.

The chapter opens with a look at data modelling, for me the explanation of data modelling was cumbersome. Example code explores the relationship between car rentals and holidays, these are explored graphically using SQL Operational Studio (this has recently been superseded by Azure Data Studio, and the book’s link picks up this change). The authors provide an impressive visual example of correlation using the corrplot R library.

Next, some advanced predictive R functions are briefly described, and an example usage of two of these (rxLinMod and rxGlm) is provided, and a helpful discussion of the predictive output provided.

The authors now look at deploying the solution. The importance of having a powerful scalable server is noted, as is data movement, together with configuration settings. There’s a brief diversion into supervised training/testing/validating datasets – which seems odd when explanations are not provided for other areas (ok, they’ve done it here as a precursor to explaining the sample code they’ll use). There’s a useful predictive code example provided.

The chapter ends with a look at integrating what’s been done so far into SQL Server, which largely means wrapping the code into Stored Procedures. The section now takes a look at using clustering (as opposed to supervised training) – organizationally this seems odd, it would appear to belong to the previous section, and not the section on using the code within SQL Server.

This chapter provided some useful example code for making predictions, but also seemed a bit muddled. The English usage in parts of this chapter is labored. Various modelling techniques and acronyms are referenced without any explanation (e.g. ORM, CRC). Similarly, various statistical features (e.g. F-statistics) have little or no explanation – but the book does assume the reader has an adequate knowledge of Data Science.

<< Prev - Next >>

Last Updated ( Tuesday, 02 July 2019 )