Highlights
sparklyr
and friends have received some major updates over the past few months, here are some highlights:
-
spark_apply()
now works on Databricks Connect v2 -
sparkxgb
is coming back to life -
Support for Spark 2.3 and earlier has ended
pysparklyr 0.1.4
spark_apply()
now works on Databricks Connect v2. the latest pysparklyr
launch uses the rpy2
Python library as the backbone of the integration.
Databricks Connect v2 is based on Spark Connect. Currently, it supports Python user-defined functions (UDFs), but not R user-defined functions. Using rpy2
circumvents this limitation. As shown in the diagram, sparklyr
send R code to locally installed rpy2
which in turn sends it to Spark. Then he rpy2
installed on the remote Databricks cluster will run the R code.
Figure 1: R code via rpy2
A great advantage of this approach is that rpy2
supports arrow. In fact, it is the recommended Python library to use when integrating Spark, Arrow and R. This means that data exchange between all three environments will be much faster!
As in its original implementation, schema inference works, and as with the original implementation, it comes at a performance cost. But unlike the original, this implementation will return a ‘columns’ specification that you can use the next time you execute the call.
spark_apply(
tbl_mtcars,
nrow,group_by = "am"
)
#> To increase performance, use the following schema:
#> columns = "am double, x long"
#> # Source: table<`sparklyr_tmp_table_b84460ea_b1d3_471b_9cef_b13f339819b6`> (2 x 2)
#> # Database: spark_connection
#> am x
#>
#> 1 0 19
#> 2 1 13
A full article on this new capability is available here:
Run R inside Databricks Connect
sparkxgb
He sparkxgb
is an extension of sparklyr
. Allows integration with
XGBoost. The current version of CRAN is not compatible with the latest versions of XGBoost. This limitation has recently caused a complete update of sparkxgb
. Here is a summary of the improvements, which are currently in the development version of the package:
-
He
xgboost_classifier()
andxgboost_regressor()
Functions no longer pass values of two arguments. These were deprecated by XGBoost and cause an error if used. In the R function, the arguments will remain for backward compatibility, but will generate an informational error if they are not leftNULL
: -
Updates the JVM version used during the Spark session. now use xgboost4j-spark version 2.0.3instead of 0.8.1. This gives us access to the latest Spark code from XGboost.
-
Updates code that used deprecated features from previous R dependencies. Also stops using an unmaintained package as a dependency (
forge
). This eliminated all warnings that occurred when fitting a model. -
Major improvements to packet testing. Unit tests were updated and expanded, the way
sparkxgb
Automatically starts and stops Spark session for testing and continuous integration tests have been restored. This will ensure the health of the package in the future.
::install_github("rstudio/sparkxgb")
remotes
library(sparkxgb)
library(sparklyr)
<- spark_connect(master = "local")
sc <- copy_to(sc, iris)
iris_tbl
<- xgboost_classifier(
xgb_model
iris_tbl,~ .,
Species num_class = 3,
num_round = 50,
max_depth = 4
)
%>%
xgb_model ml_predict(iris_tbl) %>%
select(Species, predicted_label, starts_with("probability_")) %>%
::glimpse()
dplyr#> Rows: ??
#> Columns: 5
#> Database: spark_connection
#> $ Species "setosa", "setosa", "setosa", "setosa", "setosa…
#> $ predicted_label "setosa", "setosa", "setosa", "setosa", "setosa…
#> $ probability_setosa 0.9971547, 0.9948581, 0.9968392, 0.9968392, 0.9…
#> $ probability_versicolor 0.002097376, 0.003301427, 0.002284616, 0.002284…
#> $ probability_virginica 0.0007479066, 0.0018403779, 0.0008762418, 0.000…
shiny 1.8.5
The new version of sparklyr
It has no improvements for the user. But internally it has passed an important milestone. Support for Spark version 2.3 and earlier has effectively ended. The Scala code needed to do this is no longer part of the package. According to Spark’s versioning policy, found hereSpark 2.3 reached end of life in 2018.
This is part of a larger and ongoing effort to make the immense codebase of
sparklyr
a little easier to maintain and therefore reduce the risk of failures. As part of the same effort, the number of upstream packets that sparklyr
depends have been reduced. This has been happening in multiple versions of CRAN and in this latest version tibble
and rappdirs
They are no longer imported by sparklyr
.
Re-use
Text and figures are licensed under a Creative Commons Attribution license. CC BY 4.0. Figures that have been reused from other sources are not covered by this license and can be recognized by a note in their caption: “Figure of…”.
Citation
For attribution, please cite this work as
Ruiz (2024, April 22). Posit AI Blog: News from the sparkly-verse. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2024-04-22-sparklyr-updates/
BibTeX Citation
@misc{sparklyr-updates-q1-2024, author = {Ruiz, Edgar}, title = {Posit AI Blog: News from the sparkly-verse}, url = {https://blogs.rstudio.com/tensorflow/posts/2024-04-22-sparklyr-updates/}, year = {2024} }