Skip to content

lakehouse_excel_read_as_spark

Public callable

Read an Excel file from a Fabric lakehouse Files path.

Spark does not natively read Excel files. This helper reads the Excel file as binary from the lakehouse, writes it to a temporary local file, loads it with pandas, then converts it into a Spark DataFrame.

This is intended for small reference files, mapping tables, and manually maintained business inputs. Large source datasets should be stored as Delta, Parquet, or CSV instead.

Parameters:

Name Type Description Default
lh Housepath

Lakehouse path object returned by get_path.

required
relative_path str

Path to the Excel file under the lakehouse root, for example "Files/reference/faculty_mapping.xlsx".

required
sheet_name str or int

Worksheet name or index to read. Defaults to the first worksheet.

0
spark_session object

Spark session to use. If omitted, the helper uses the notebook global spark.

None

Returns:

Type Description
DataFrame

Spark DataFrame converted from the selected Excel worksheet.

Raises:

Type Description
ValueError

If lh.root or relative_path is missing.

FileNotFoundError

If the Excel file cannot be found at the resolved lakehouse path.

RuntimeError

If no Spark session is available.

Examples:

>>> lh_source = get_path("Sandbox", "Source", config=CONFIG)
>>> df_mapping = lakehouse_excel_read_as_spark(
...     lh_source,
...     "Files/reference/faculty_mapping.xlsx",
...     sheet_name="Mapping",
... )
Notes
-----
Side effects:
- Creates a temporary local file during conversion.
- Materializes rows through pandas before creating a Spark DataFrame.
Source code in src/fabricops_kit/fabric_io.py
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
def lakehouse_excel_read_as_spark(lh, relative_path, sheet_name=0, spark_session=None):
    """Read an Excel file from a Fabric lakehouse Files path.

    Spark does not natively read Excel files. This helper reads the Excel file
    as binary from the lakehouse, writes it to a temporary local file, loads it
    with pandas, then converts it into a Spark DataFrame.

    This is intended for small reference files, mapping tables, and manually
    maintained business inputs. Large source datasets should be stored as
    Delta, Parquet, or CSV instead.

    Parameters
    ----------
    lh : Housepath
        Lakehouse path object returned by `get_path`.
    relative_path : str
        Path to the Excel file under the lakehouse root, for example
        `"Files/reference/faculty_mapping.xlsx"`.
    sheet_name : str or int, default 0
        Worksheet name or index to read. Defaults to the first worksheet.
    spark_session : object, optional
        Spark session to use. If omitted, the helper uses the notebook global
        `spark`.

    Returns
    -------
    pyspark.sql.DataFrame
        Spark DataFrame converted from the selected Excel worksheet.

    Raises
    ------
    ValueError
        If `lh.root` or `relative_path` is missing.
    FileNotFoundError
        If the Excel file cannot be found at the resolved lakehouse path.
    RuntimeError
        If no Spark session is available.

    Examples
    --------
    >>> lh_source = get_path("Sandbox", "Source", config=CONFIG)
    >>> df_mapping = lakehouse_excel_read_as_spark(
    ...     lh_source,
    ...     "Files/reference/faculty_mapping.xlsx",
    ...     sheet_name="Mapping",
    ... )
    Notes
    -----
    Side effects:
    - Creates a temporary local file during conversion.
    - Materializes rows through pandas before creating a Spark DataFrame.
    """
    if not getattr(lh, "root", None):
        raise ValueError("lh.root is required.")
    if not relative_path:
        raise ValueError("relative_path is required.")

    spark_obj = _get_spark(spark_session)
    lakehouse_file_path = f"{lh.root.rstrip('/')}/{relative_path.lstrip('/')}"

    bin_df = (
        spark_obj.read.format("binaryFile")
        .option("recursiveFileLookup", "false")
        .load(lakehouse_file_path)
    )

    if bin_df.count() == 0:
        raise FileNotFoundError(f"No file found at path: {lakehouse_file_path}")

    content = bin_df.select("content").collect()[0][0]

    with tempfile.NamedTemporaryFile(delete=False, suffix=".xlsx") as temp_file:
        temp_file.write(bytearray(content))
        temp_file_path = temp_file.name

    pandas_df = pd.read_excel(temp_file_path, sheet_name=sheet_name)
    return spark_obj.createDataFrame(pandas_df)