Posit AI Blog: News from the Brilliant Verse

By admin

2024年9月27日

0

42

Highlights

sparklyr and friends have received some major updates over the past few months, here are some highlights:

spark_apply() now works on Databricks Connect v2
sparkxgb is coming back to life
Support for Spark 2.3 and earlier has ended

pysparklyr 0.1.4

spark_apply() now works on Databricks Connect v2. the latest pysparklyr
launch uses the rpy2 Python library as the backbone of the integration.

Databricks Connect v2 is based on Spark Connect. Currently, it supports Python user-defined functions (UDFs), but not R user-defined functions. Using rpy2 circumvents this limitation. As shown in the diagram, sparklyr
send R code to locally installed rpy2which in turn sends it to Spark. Then he rpy2 installed on the remote Databricks cluster will run the R code.

Figure 1: R code via rpy2

A great advantage of this approach is that rpy2 supports arrow. In fact, it is the recommended Python library to use when integrating Spark, Arrow and R. This means that data exchange between all three environments will be much faster!

As in its original implementation, schema inference works, and as with the original implementation, it comes at a performance cost. But unlike the original, this implementation will return a ‘columns’ specification that you can use the next time you execute the call.

spark_apply(
  tbl_mtcars,
  nrow,
  group_by = "am"
)

#> To increase performance, use the following schema:
#> columns = "am double, x long"

#> # Source:   table<`sparklyr_tmp_table_b84460ea_b1d3_471b_9cef_b13f339819b6`> (2 x 2)
#> # Database: spark_connection
#>      am     x
#>    
#> 1     0    19
#> 2     1    13

Run R inside Databricks Connect

sparkxgb

He sparkxgb is an extension of sparklyr. Allows integration with
XGBoost. The current version of CRAN is not compatible with the latest versions of XGBoost. This limitation has recently caused a complete update of sparkxgb. Here is a summary of the improvements, which are currently in the development version of the package:

He xgboost_classifier() and xgboost_regressor() Functions no longer pass values of two arguments. These were deprecated by XGBoost and cause an error if used. In the R function, the arguments will remain for backward compatibility, but will generate an informational error if they are not left NULL:
Updates the JVM version used during the Spark session. now use xgboost4j-spark version 2.0.3instead of 0.8.1. This gives us access to the latest Spark code from XGboost.
Updates code that used deprecated features from previous R dependencies. Also stops using an unmaintained package as a dependency (forge). This eliminated all warnings that occurred when fitting a model.
Major improvements to packet testing. Unit tests were updated and expanded, the way sparkxgb Automatically starts and stops Spark session for testing and continuous integration tests have been restored. This will ensure the health of the package in the future.

remotes::install_github("rstudio/sparkxgb")

library(sparkxgb)
library(sparklyr)

sc <- spark_connect(master = "local")
iris_tbl <- copy_to(sc, iris)

xgb_model <- xgboost_classifier(
  iris_tbl,
  Species ~ .,
  num_class = 3,
  num_round = 50,
  max_depth = 4
)

xgb_model %>% 
  ml_predict(iris_tbl) %>% 
  select(Species, predicted_label, starts_with("probability_")) %>% 
  dplyr::glimpse()
#> Rows: ??
#> Columns: 5
#> Database: spark_connection
#> $ Species                 "setosa", "setosa", "setosa", "setosa", "setosa…
#> $ predicted_label         "setosa", "setosa", "setosa", "setosa", "setosa…
#> $ probability_setosa      0.9971547, 0.9948581, 0.9968392, 0.9968392, 0.9…
#> $ probability_versicolor  0.002097376, 0.003301427, 0.002284616, 0.002284…
#> $ probability_virginica   0.0007479066, 0.0018403779, 0.0008762418, 0.000…

found hereSpark 2.3 reached end of life in 2018.

This is part of a larger and ongoing effort to make the immense codebase of
sparklyr a little easier to maintain and therefore reduce the risk of failures. As part of the same effort, the number of upstream packets that sparklyr
depends have been reduced. This has been happening in multiple versions of CRAN and in this latest version tibbleand rappdirs They are no longer imported by sparklyr.

Re-use

Text and figures are licensed under a Creative Commons Attribution license. CC BY 4.0. Figures that have been reused from other sources are not covered by this license and can be recognized by a note in their caption: “Figure of…”.

Citation

For attribution, please cite this work as

Ruiz (2024, April 22). Posit AI Blog: News from the sparkly-verse. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2024-04-22-sparklyr-updates/

BibTeX Citation

@misc{sparklyr-updates-q1-2024,
  author = {Ruiz, Edgar},
  title = {Posit AI Blog: News from the sparkly-verse},
  url = {https://blogs.rstudio.com/tensorflow/posts/2024-04-22-sparklyr-updates/},
  year = {2024}
}

Posit AI Blog: News from the Brilliant Verse

Highlights

pysparklyr 0.1.4

sparkxgb

shiny 1.8.5

Re-use

Citation

Related Articles

Carplay software with net browser to transmit video blows App Retailer

314 issues that the federal government might find out about you

Inside O3 and O4 -Mini: Unlocking of recent potentialities by means of multimodal reasoning and built-in instruments units

Latest Articles

Carplay software with net browser to transmit video blows App Retailer

314 issues that the federal government might find out about you

Inside O3 and O4 -Mini: Unlocking of recent potentialities by means of multimodal reasoning and built-in instruments units

Many Logitech merchandise value greater than two months in the past

CISA points steering within the midst of non -confirmed Oracle Cloud

ABOUT US