Earlier in 2022, Sam Brockie was hired as a postdoctoral researcher in Jason K. Moore's lab at TU Delft to work on SymPy's code generation as part of the CZI grant.
For those not familiar with what code generation is, here's a quick explanation:
Oftentimes SymPy users will need to evaluate their symbolic expressions using numeric values. In simple cases, SymPy's subs
and evalf
methods can be used to substitute numeric values into symbolic expressions. However, this approach is slow. If your expressions are very large, or the numeric evaluations need to be done many times, then a different approach is required. Code generation is the process of automatically converting symbolic expressions into dedicated computer code for their numeric evaluation. SymPy offers a range of code generation tools supporting the simple creating of numeric Python functions equivalent to a symbolic expression (e.g. lambdify
) and spanning to the creating, compiling, wrapping, and/or importing of efficient numeric C/Fortran/<other language> callables. Code generation can also be used to generate other code, such as LaTeX representations of SymPy objects.
More information about code generation and numeric computation in SymPy can be found in the docs.
A user survey was conducted between 19th September and 17th October 2022. The approach mirrored that of the SymPy user survey about documentation conducted in February 2022.
The survey consisted of five short questions, predominantly multiple choice with the option to provide additional free-form information, plus to option to feed back anything else. It was conducted using Google Docs survey form and was advertised to SymPy users via the SymPy mailing list and SymPy Twitter account.
The primary purpose of this survey was to gather information about which of, and how, SymPy's codegen features are used by its users. This information is intended to help inform Sam Brockie's work programme on SymPy's codegen throughout the duration of his work under the CZI grant.
We would like to thank everyone who responsed to and shared the survey. A total of 24 responses were received. While we acknowledge that this is a relatively small sample size, we believe that it has provided valuable feedback that is representative of the wider SymPy user base.
A detailed analysis is provided in the following sections, with a high level summary provided directly below:
lambdify
is the most used codegen interface in SymPy.import collections
import textwrap
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats('retina') # Set the plot format to SVG for better quality plots
%matplotlib inline
df = pd.read_csv('responses.csv')
(
timestamp,
experience_level_user,
experience_level,
use_case_user,
use_case,
use_case_extra,
scientific_field,
code_type_user,
code_type,
code_type_extra,
target_language,
codegen_tool_user,
codegen_tool,
codegen_tool_extra,
lambdify_backend,
improve_user,
improve,
improve_extra,
*_,
) = df.columns
number_response = len(df)
Respondants were allowed to give a free-form response to this question. Responses were categorised during the analysis process into one of five categories based on which was the closest match:
Most respondants reported a moderate level of experience with SymPy.
EXPERIENCE_CATEGORY_MAPPING = {
0: "No Response",
1: "Beginner User",
2: "Intermediate User",
3: "Advanced User",
4: "Major Contributor",
}
number_no_response = df[experience_level].isna().sum()
experience_category_no_response = pd.Series({"0": number_no_response})
experience_category_response = df[experience_level].dropna().astype(int).value_counts(sort=False).sort_index()
experience_category = pd.concat([experience_category_no_response, experience_category_response])
experience_category_proportion = (experience_category / number_response) * 100
experience_category_proportion = [f"{proportion:.1f}%" for proportion in experience_category_proportion]
ax = sns.countplot(x=df[experience_level].fillna(0).astype(int).astype("category"))
ax.bar_label(ax.containers[0], experience_category_proportion, label_type='center')
ax.set_xticklabels(list(EXPERIENCE_CATEGORY_MAPPING.values()))
ax.set_xlabel("SymPy Experience Level")
ax.set_ylabel("Number of Respondants")
_ = ax.set_xticklabels(textwrap.fill(x.get_text(), 10) for x in ax.get_xticklabels())
This question intended to find out how SymPy's code generation is used in practice. This question gave respondants the option to select as many of the following five use cases as they wished:
The spectrum of options here could help inform where development should be focussed. For example, if the majority of respondants use code generation for debugging or in simple scripts, then it could indicate that improvements to error messaging would be a useful addition. Conversely, if the majority of respondants use code generation in production code or to generate code for use within another library, then it is likely that improvements to computational performance and numerical stability would be best received.
Libraries noted in responses included:
USE_CASE_MAPPING = {
0: "No Response",
1: "Debugging Symbolic Code",
2: "In Notebooks/Scripts",
3: "In Scientific Research",
4: "In Production Code",
5: "In Library Code",
}
use_case_count = collections.Counter()
for response in df[use_case].fillna(0).astype("str"):
use_case_count.update(response.split(","))
use_case_df = pd.Series(use_case_count, name="use case").sort_index().to_frame()
use_case_proportion = [f"{((user / number_response) * 100):.1f}%" for user in use_case_df["use case"]]
ax = sns.barplot(x=use_case_df.index, y=use_case_df["use case"])
ax.bar_label(ax.containers[0], use_case_proportion, label_type='center')
ax.set_xticklabels(list(USE_CASE_MAPPING.values()))
ax.set_xlabel("Codegen Use Cases")
ax.set_ylabel("Number of Respondants")
_ = ax.set_xticklabels(textwrap.fill(x.get_text(), 10) for x in ax.get_xticklabels())
75% of respondants reported using SymPy's code generation in scientific research. The two most common scientific fields stated were mechanics and control. However, it should be noted that such a categorization is limited due to the potential overlap, or difference, between work in such areas. For example, it is possible that there is significant overlap in how respondants use SymPy for mechanics, biomechanics, and robotics research as all involve multibody modelling and likely leverage sympy.physics.mechanics
.
scientific_field_count = collections.Counter()
for response in df[scientific_field].fillna("No Response").astype("str"):
scientific_field_count.update(response.split(","))
scientific_field_df = pd.Series(scientific_field_count, name="scientific field").to_frame()
scientific_field_proportion = [f"{((user / number_response) * 100):.1f}%"
for user in scientific_field_df["scientific field"]]
ax = sns.barplot(x=scientific_field_df.index, y=scientific_field_df["scientific field"])
ax.bar_label(ax.containers[0], scientific_field_proportion, label_type='center')
ax.set_xticklabels(list(scientific_field_count.keys()))
ax.set_xlabel("Scientific Research Field")
ax.set_ylabel("Number of Respondants")
_ = ax.set_xticklabels(textwrap.fill(x.get_text(), 8) for x in ax.get_xticklabels())
This question intended to find out about the intended use for SymPy's code generation. This question gave respondants the option to select as many of the following three cases as they wished:
For instances where non-Python code is being generated, the following languages were mentioned:
CODE_TYPE_MAPPING = {
0: "No Response",
1: "Call from Python",
2: "Call from Another Language",
3: "Copy-Paste into Non-Python Code",
}
code_type_count = collections.Counter()
for response in df[code_type].fillna(0).astype("str"):
code_type_count.update(response.split(","))
code_type_df = pd.Series(code_type_count, name="code type").sort_index().to_frame()
code_type_proportion = [f"{((user / number_response) * 100):.1f}%" for user in code_type_df["code type"]]
ax = sns.barplot(x=code_type_df.index, y=code_type_df["code type"])
ax.bar_label(ax.containers[0], code_type_proportion, label_type='center')
ax.set_xticklabels(list(CODE_TYPE_MAPPING.values()))
ax.set_xlabel("Codegen Code Types")
ax.set_ylabel("Number of Respondants")
_ = ax.set_xticklabels(textwrap.fill(x.get_text(), 10) for x in ax.get_xticklabels())
The majority of respondants stated that they use SymPy's code generation to generate code that can be called from Python. This indicates that functions like lambdify
, autowrap
, and ufuncify
are important to the majority of SymPy users that leverage code generation.
A significant portion of respondants also stated that they use code generation features to generate non-Python code that can be used outside Python. This indicates that SymPy's code printing features are also leveraged by a significant number of users too.
The most common non-Python target language for code generation is C, with 25% of respondants stating that they had used SymPy's code generation to target it. Following C, the other most common language targets are C++ and Julia, with 16.7% and 12.5% of users stating that they had targetted them respectively.
target_language_count = collections.Counter()
for response in df[target_language].fillna("No Response").astype("str"):
target_language_count.update(response.split(","))
target_language_df = pd.Series(target_language_count, name="target language").to_frame()
target_language_proportion = [f"{((users / number_response) * 100):.1f}%"
for users in target_language_df["target language"]]
ax = sns.barplot(x=target_language_df.index, y=target_language_df["target language"])
ax.bar_label(ax.containers[0], target_language_proportion, label_type='center')
ax.set_xticklabels(list(target_language_count.keys()))
ax.set_xlabel("Target Languages")
ax.set_ylabel("Number of Respondants")
_ = ax.set_xticklabels(textwrap.fill(x.get_text(), 10) for x in ax.get_xticklabels())
This question intended to find out which of SymPy's code generation interfaces are most commonly used. This question gave respondants the option to select as many of the following five cases as they wished:
Additional information was requested about which Lambdify backends and code printers are most used by respondants. Lambdify backends mentioned include:
CODEGEN_TOOL_MAPPING = {
0: "No Response",
1: "Lambdify",
2: "Autowrap",
3: "Ufuncify",
4: "Printers",
5: "Subs/Evalf",
}
codegen_tool_count = collections.Counter()
for response in df[codegen_tool].fillna(0).astype("str"):
codegen_tool_count.update(response.split(","))
codegen_tool_df = pd.Series(codegen_tool_count, name="codegen tool").sort_index().to_frame()
codegen_tool_proportion = [f"{((user / number_response) * 100):.1f}%" for user in codegen_tool_df["codegen tool"]]
ax = sns.barplot(x=codegen_tool_df.index, y=codegen_tool_df["codegen tool"])
ax.bar_label(ax.containers[0], codegen_tool_proportion, label_type='center')
ax.set_xticklabels(list(CODEGEN_TOOL_MAPPING.values()))
ax.set_xlabel("Codegen Tools")
ax.set_ylabel("Number of Respondants")
_ = ax.set_xticklabels(textwrap.fill(x.get_text(), 10) for x in ax.get_xticklabels())
The majority of respondants stated that they use lamdify
as the primary interface into SymPy's code generation. This likely relates to the fact that most respondants also stated that they primarily generate code that can be called from Python.
NumPy was the most used backend for lambdify
, with 33.3% of respondants stating that they use it. This was followed by JAX, newly added in June 2022, with 12.5% of respondants stating that they use it. It is likely that NumPy is the most commonly used backend for lambdify
as it is the default option. Three respondants also stated that they use Numba to JIT compile the functions returned by lambdify
.
lambdify_backend_count = collections.Counter()
for response in df[lambdify_backend].fillna("No Response").astype("str"):
lambdify_backend_count.update(response.split(","))
lambdify_backend_df = pd.Series(lambdify_backend_count, name="lambdify backend").to_frame()
lambdify_backend_proportion = [f"{((user / number_response) * 100):.1f}%"
for user in lambdify_backend_df["lambdify backend"]]
ax = sns.barplot(x=lambdify_backend_df.index, y=lambdify_backend_df["lambdify backend"])
ax.bar_label(ax.containers[0], lambdify_backend_proportion, label_type='center')
ax.set_xticklabels(list(lambdify_backend_count.keys()))
ax.set_xlabel("Lambdify Backends")
ax.set_ylabel("Number of Respondants")
_ = ax.set_xticklabels(textwrap.fill(x.get_text(), 10) for x in ax.get_xticklabels())
This question intended to find out what aspects of SymPy's code generation users would most like to be improved. This question gave respondants the option to select as many of the following six cases as they wished:
Improvements to documentation (greater explanation and more worked examples) and improvements to support of code generating derivatives (gradients, Jacobians, and Hessians) were the two most requested areas for improvement, with 58.3% of respondants stating they'd like these areas improved. This was significantly more than any other area. Following these two, the next two areas of most requested improvements were increasing the execution speed and numerical stability of generated code, with 25.0% and 20.8% of respondants stating they would like focus in these areas respectively.
The requests for improvements to code generation of derivatives likely highlights the fact that SymPy may be being increasingly used for machine learning applications, where differentiable functions are important for training (optimization). It is not entirely clear what aspect of derivatives respondants would like improving, with only one comment mentioning that SymPy's symbolic differentiation can be slow for large expressions, and that differentiation and code generation must be conducted as two separate steps when code is generated for derivatives. This is in contrast to the workflow of packages like JAX, which the respondant may have been more familiar with, where the differentiation and code generation steps are interchangible thanks to automatic differentiation and JIT compilation.
IMPROVE_MAPPING = {
0: "No Response",
1: "More Languages",
2: "Documentation",
3: "More Math Functions",
4: "Improve Derivatives",
5: "Execution Speed",
6: "Numerical Stability",
}
improve_count = collections.Counter()
for response in df[improve].fillna(0).astype("str"):
improve_count.update(response.split(","))
improve_df = pd.Series(improve_count, name="improve").sort_index().to_frame()
improve_proportion = [f"{((user / number_response) * 100):.1f}%" for user in improve_df["improve"]]
ax = sns.barplot(x=improve_df.index, y=improve_df["improve"])
ax.bar_label(ax.containers[0], improve_proportion, label_type='center')
ax.set_xticklabels(list(IMPROVE_MAPPING.values()))
ax.set_xlabel("Improvement Area")
ax.set_ylabel("Number of Respondants")
_ = ax.set_xticklabels(textwrap.fill(x.get_text(), 8) for x in ax.get_xticklabels())
Respondants were asked to provide any other comments. These included: