`lakehouse_excel_read_as_spark`

Public callable

Read an Excel file from a Fabric lakehouse Files path.

Spark does not natively read Excel files. This helper reads the Excel file as binary from the lakehouse, writes it to a temporary local file, loads it with pandas, then converts it into a Spark DataFrame.

This is intended for small reference files, mapping tables, and manually maintained business inputs. Large source datasets should be stored as Delta, Parquet, or CSV instead.

Parameters:

Name	Type	Description	Default
`lh`	`Housepath`	Lakehouse path object returned by `get_path`.	required
`relative_path`	`str`	Path to the Excel file under the lakehouse root, for example `"Files/reference/faculty_mapping.xlsx"`.	required
`sheet_name`	`str or int`	Worksheet name or index to read. Defaults to the first worksheet.	`0`
`spark_session`	`object`	Spark session to use. If omitted, the helper uses the notebook global `spark`.	`None`

Returns:

Type	Description
`DataFrame`	Spark DataFrame converted from the selected Excel worksheet.

Raises:

Type	Description
`ValueError`	If `lh.root` or `relative_path` is missing.
`FileNotFoundError`	If the Excel file cannot be found at the resolved lakehouse path.
`RuntimeError`	If no Spark session is available.

Examples:

>>> lh_source = get_path("Sandbox", "Source", config=CONFIG)
>>> df_mapping = lakehouse_excel_read_as_spark(
...     lh_source,
...     "Files/reference/faculty_mapping.xlsx",
...     sheet_name="Mapping",
... )
Notes
-----
Side effects:
- Creates a temporary local file during conversion.
- Materializes rows through pandas before creating a Spark DataFrame.

Source code in src/fabricops_kit/fabric_io.py

def lakehouse_excel_read_as_spark(lh, relative_path, sheet_name=0, spark_session=None):
    """Read an Excel file from a Fabric lakehouse Files path.

    Spark does not natively read Excel files. This helper reads the Excel file
    as binary from the lakehouse, writes it to a temporary local file, loads it
    with pandas, then converts it into a Spark DataFrame.

    This is intended for small reference files, mapping tables, and manually
    maintained business inputs. Large source datasets should be stored as
    Delta, Parquet, or CSV instead.

    Parameters
    ----------
    lh : Housepath
        Lakehouse path object returned by `get_path`.
    relative_path : str
        Path to the Excel file under the lakehouse root, for example
        `"Files/reference/faculty_mapping.xlsx"`.
    sheet_name : str or int, default 0
        Worksheet name or index to read. Defaults to the first worksheet.
    spark_session : object, optional
        Spark session to use. If omitted, the helper uses the notebook global
        `spark`.

    Returns
    -------
    pyspark.sql.DataFrame
        Spark DataFrame converted from the selected Excel worksheet.

    Raises
    ------
    ValueError
        If `lh.root` or `relative_path` is missing.
    FileNotFoundError
        If the Excel file cannot be found at the resolved lakehouse path.
    RuntimeError
        If no Spark session is available.

    Examples
    --------
    >>> lh_source = get_path("Sandbox", "Source", config=CONFIG)
    >>> df_mapping = lakehouse_excel_read_as_spark(
    ...     lh_source,
    ...     "Files/reference/faculty_mapping.xlsx",
    ...     sheet_name="Mapping",
    ... )
    Notes
    -----
    Side effects:
    - Creates a temporary local file during conversion.
    - Materializes rows through pandas before creating a Spark DataFrame.
    """
    if not getattr(lh, "root", None):
        raise ValueError("lh.root is required.")
    if not relative_path:
        raise ValueError("relative_path is required.")

    spark_obj = _get_spark(spark_session)
    lakehouse_file_path = f"{lh.root.rstrip('/')}/{relative_path.lstrip('/')}"

    bin_df = (
        spark_obj.read.format("binaryFile")
        .option("recursiveFileLookup", "false")
        .load(lakehouse_file_path)
    )

    if bin_df.count() == 0:
        raise FileNotFoundError(f"No file found at path: {lakehouse_file_path}")

    content = bin_df.select("content").collect()[0][0]

    with tempfile.NamedTemporaryFile(delete=False, suffix=".xlsx") as temp_file:
        temp_file.write(bytearray(content))
        temp_file_path = temp_file.name

    pandas_df = pd.read_excel(temp_file_path, sheet_name=sheet_name)
    return spark_obj.createDataFrame(pandas_df)