1.6 C
New York
Saturday, January 18, 2025

Posit AI Blog: News from the Brilliant Verse


Highlights

sparklyr and friends have received some major updates over the past few months, here are some highlights:

  • spark_apply() now works on Databricks Connect v2

  • sparkxgb is coming back to life

  • Support for Spark 2.3 and earlier has ended

pysparklyr 0.1.4

spark_apply() now works on Databricks Connect v2. the latest pysparklyr
launch uses the rpy2 Python library as the backbone of the integration.

Databricks Connect v2 is based on Spark Connect. Currently, it supports Python user-defined functions (UDFs), but not R user-defined functions. Using rpy2 circumvents this limitation. As shown in the diagram, sparklyr
send R code to locally installed rpy2which in turn sends it to Spark. Then he rpy2 installed on the remote Databricks cluster will run the R code.


Figure 1: R code via rpy2

A great advantage of this approach is that rpy2 supports arrow. In fact, it is the recommended Python library to use when integrating Spark, Arrow and R. This means that data exchange between all three environments will be much faster!

As in its original implementation, schema inference works, and as with the original implementation, it comes at a performance cost. But unlike the original, this implementation will return a ‘columns’ specification that you can use the next time you execute the call.

Run R inside Databricks Connect

sparkxgb

He sparkxgb is an extension of sparklyr. Allows integration with
XGBoost. The current version of CRAN is not compatible with the latest versions of XGBoost. This limitation has recently caused a complete update of sparkxgb. Here is a summary of the improvements, which are currently in the development version of the package:

  • He xgboost_classifier() and xgboost_regressor() Functions no longer pass values ​​of two arguments. These were deprecated by XGBoost and cause an error if used. In the R function, the arguments will remain for backward compatibility, but will generate an informational error if they are not left NULL:

  • Updates the JVM version used during the Spark session. now use xgboost4j-spark version 2.0.3instead of 0.8.1. This gives us access to the latest Spark code from XGboost.

  • Updates code that used deprecated features from previous R dependencies. Also stops using an unmaintained package as a dependency (forge). This eliminated all warnings that occurred when fitting a model.

  • Major improvements to packet testing. Unit tests were updated and expanded, the way sparkxgb Automatically starts and stops Spark session for testing and continuous integration tests have been restored. This will ensure the health of the package in the future.

found hereSpark 2.3 reached end of life in 2018.

This is part of a larger and ongoing effort to make the immense codebase of
sparklyr a little easier to maintain and therefore reduce the risk of failures. As part of the same effort, the number of upstream packets that sparklyr
depends have been reduced. This has been happening in multiple versions of CRAN and in this latest version tibbleand rappdirs They are no longer imported by sparklyr.

Re-use

Text and figures are licensed under a Creative Commons Attribution license. CC BY 4.0. Figures that have been reused from other sources are not covered by this license and can be recognized by a note in their caption: “Figure of…”.

Citation

For attribution, please cite this work as

Ruiz (2024, April 22). Posit AI Blog: News from the sparkly-verse. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2024-04-22-sparklyr-updates/

BibTeX Citation

@misc{sparklyr-updates-q1-2024,
  author = {Ruiz, Edgar},
  title = {Posit AI Blog: News from the sparkly-verse},
  url = {https://blogs.rstudio.com/tensorflow/posts/2024-04-22-sparklyr-updates/},
  year = {2024}
}

Related Articles

Latest Articles