Lung most cancers is the uncontrollable progress of irregular cells in a single or each of the lungs. Cigarette smoking causes most lung cancers when smoke will get into the lungs. Lung most cancers kills 1.8 million folks every year, greater than some other most cancers. It has an 80–90% dying fee and is the main reason behind most cancers dying in males, and the second main reason behind most cancers dying in girls.

**World most cancers patterns by intercourse –**

Lung most cancers is essentially the most generally recognized most cancers in males (14.5% of the whole circumstances in males and eight.4% in girls) and the main reason behind most cancers dying in males (22.0%, i.e., about one in 5 of all most cancers deaths). In males, that is adopted by prostate most cancers (13.5%) and colorectal most cancers (10.9%) for incidence, and liver most cancers (10.2%) and abdomen most cancers (9.5%) for mortality. Breast most cancers is essentially the most generally recognized most cancers in girls (24.2%, i.e., about one in 4 of all new most cancers circumstances recognized in girls worldwide are breast most cancers), and the most cancers is the most typical in 154 of the 185 international locations included in GLOBOCAN.

supply — https://www.iarc.who.int/wp-content/uploads/2018/09/pr263_E.pdf

**1.** **Importing the library:**

`import pandas as pd`

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import missingno as msno

from sklearn.preprocessing import OrdinalEncoder

from sklearn import preprocessing

le = preprocessing.LabelEncoder()

from sklearn.preprocessing import LabelEncoder

from scipy.stats import skew

from scipy.stats import probplot

from sklearn.preprocessing import MinMaxScaler

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

import plotly.specific as px

1. GENDER: Gender of the person (M: Male, F: Feminine)

2. AGE: Age of the person

3. SMOKING: Smoking standing (2: Sure, 1: No)

4. YELLOW_FINGERS: Presence of yellow fingers (2: Sure, 1: No)

5. ANXIETY: Nervousness stage (2: Excessive, 1: Low)

6. PEER_PRESSURE: Peer stress stage (2: Excessive, 1: Low)

7. CHRONIC DISEASE: Presence of continual illness (2: Sure, 1: No)

8. FATIGUE: Fatigue stage (2: Excessive, 1: Low)

9. ALLERGY: Allergy standing (2: Sure, 1: No)

10. WHEEZING: Wheezing situation (2: Sure, 1: No)

11. ALCOHOL CONSUMING: Alcohol consumption standing (2: Sure, 1: No)

12. COUGHING: Presence of coughing (2: Sure, 1: No)

13. SHORTNESS OF BREATH: Shortness of breath situation (2: Sure, 1: No)

14. SWALLOWING DIFFICULTY: Problem in swallowing (2: Sure, 1: No)

15. CHEST PAIN: Presence of chest ache (2: Sure, 1: No)

16. LUNG_CANCER: Lung most cancers analysis (2: Sure, 1: No)

**1.** **Studying the information and saving it into variable df — Right here 2 = YES and 1 = NO**

`# Load the Dataset:`

# showdf = pd.read_csv('/kaggle/enter/lung-cancer/data2.csv')

df

This code makes use of the Pandas library in Python to learn a CSV (Comma Separated Values) file named ‘data2.csv’ and cargo its contents right into a Pandas Information Body**.**

pd.read_csv(‘data2.csv’): This line makes use of the Pandas library (pd) to learn the contents of a CSV file named ‘data2.csv’.

df: The results of studying the CSV file is saved in a Pandas DataFrame known as df. An information body is a two-dimensional, tabular knowledge construction with labeled axes (rows and columns).

In abstract, this code snippet is a concise technique to learn knowledge from a CSV file and retailer it in a DataFrame utilizing Pandas, which is a strong knowledge manipulation and evaluation library in Python**.**

**3. Details about Datatypes, ranges, reminiscence utilization,**

`# Details about knowledge varieties and lacking values`df.information()

df.form

Right here we will see that the information has 309 rows and 16 columns with floating variables, integers, and which vary from 0 to 308.

Right here we will see that, the information has 309 rows and 16 columns with floating variables, integers, and objects that vary from 0 to 308.

`# Abstract statistics`df.describe()

1. **Rely** The variety of non-null observations for every variable. For instance, there are 308 observations for ‘AGE,’ however within the depend cell, the utmost quantity is 309 non-null entries, so we will say that ‘AGE’ has a lacking worth and ‘WHEEZING’ has 299 non-null entries, indicating lacking values.

2. **Imply** The typical worth for every variable. For example, the typical age within the dataset is roughly 62.67 years.

3. **Normal Deviation (std)**: a measure of the quantity of variation or dispersion within the knowledge. It reveals how a lot particular person knowledge factors deviate from the imply. A better commonplace deviation signifies extra unfold within the knowledge.

4. **Minimal (min)**: The smallest worth noticed for every variable. We will see that, the minimal age within the dataset is 21.

5. **twenty fifth Percentile (25%)**: Also referred to as the primary quartile, it represents the worth beneath which 25% of the information falls. For ‘AGE,’ 25% of the people are beneath 57 years outdated.

6. **Median (50%)**: The center worth of the dataset. Half of the values are beneath, and half are above this worth. For ‘AGE,’ the median is 62 years.

7. **seventy fifth Percentile (75%)**: The third quartile, indicating the worth beneath which 75% of the information falls. For ‘AGE,’ 75% of the people are beneath 69 years.

8. **Most (max)**: The most important worth noticed for every variable. On this case, the utmost age within the dataset is 87 years.

**5. Dropping duplicate rows:**

`# Droping Duplicate rows df`duplicate_rows = df[df.duplicated()]

print("Duplicate Rows besides first prevalence:")

print(duplicate_rows)

df = df.drop_duplicates()

print("nDataFrame after dropping duplicates:")

print(df)

This code removes duplicate rows from a DataFrame and prints the ensuing DataFrame. From this code, we got here to know that there have been 33 duplicate rows, so now the dimension might be 276*16.

**6. Label encode, or OHE (as a result of there are solely 2 varieties of values for categorical column)**

`# Label Encode`

df['GENDER']=le.fit_transform(df['GENDER'])

df['LUNG_CANCER']=le.fit_transform(df['LUNG_CANCER'])

df

· So, by utilizing the label encoder, we’re altering M & F within the gender column to 1 & 0 and, in the identical approach, YES and NO to 1 & 0.

· Categorical to Numerical Conversion: Machine studying algorithms typically work with numerical knowledge, and label encoding is a straightforward technique to convert categorical variables right into a format that may be fed into these algorithms.

· Utilizing the LabelEncoder: In your code, you might be utilizing a LabelEncoder (represented by the variable le) to rework the specific values into integers. The fit_transform technique matches the encoder to the distinctive values within the specified column (‘GENDER’ and ‘LUNG_CANCER’) after which transforms these values into corresponding integer labels.

· Preserving Relationships: Label encoding is appropriate for ordinal categorical knowledge, the place there’s a significant order among the many classes. For example, if ‘GENDER’ has classes like ‘Male’ and ‘Feminine,’ label encoding assigns 1 to ‘Male’ and 0 to ‘Feminine,’ preserving the order.

**7. Fundamental Graphs for Insights:**

`# Rely plot for Gender`sns.countplot(x='GENDER', knowledge=df)

plt.title('Gender Distribution-- 0 for Feminine & 1 for Male')

plt.present()

df.GENDER.value_counts()

Right here we will see that Male to Feminine Ratio is extra

`# Which Gender has Extra lung Most cancers?`plt.determine(figsize=(8, 5))

sns.countplot(x='GENDER', hue='LUNG_CANCER', knowledge=df)

plt.title('Distribution of Lung Most cancers Instances by Gender -- 0 for Feminine & 1 for Male ')

plt.xlabel('Gender')

plt.ylabel('Rely')

plt.present()

The graph reveals that there are 120 lung most cancers circumstances amongst males, and 80 lung most cancers circumstances amongst girls. Which means males have 50% extra lung most cancers circumstances than girls.

`# Which Gender smokes extra?`plt.determine(figsize=(8, 5))

sns.countplot(x='GENDER', hue='SMOKING', knowledge=df)

plt.title('Distribution of SMOKING Instances by Gender -- 0 for Feminine & 1 for Male')

plt.xlabel('Gender')

plt.ylabel('AGE')

plt.present()

The graph reveals that there are 120 lung most cancers circumstances amongst males, and 80 lung most cancers circumstances amongst girls. Which means males have 50% extra lung most cancers circumstances than girls.

`# Which Gender has Extra lung Most cancers?`plt.determine(figsize=(25, 9))

sns.countplot(x='AGE', hue='SMOKING',knowledge=df)

plt.title('Distribution of SMOKING Instances by AGE')

plt.xlabel('AGE')

plt.ylabel('Rely')

plt.present()

The graph reveals that the variety of smoking circumstances is highest amongst folks of their 40s and 50s. There’s a gradual decline within the variety of smoking circumstances amongst folks of their 60s and 70s, after which a pointy decline amongst folks over the age of 80.

**8. Histogram for AGE:**

`# Histogram for Age`

sns.histplot(df['AGE'], bins=20, kde=True)

plt.title('Age Distribution')

plt.xlabel('Age')

plt.ylabel('Frequency')

plt.present()

df.AGE.value_counts()

The histogram you supplied reveals the age distribution of a random pattern of adults. The age distribution is regular distribution, which implies that the typical age is between 20 and 60 years outdated. The age distribution can be a skewed distribution, which implies that the typical age is nearer to twenty years outdated than 60 years outdated.

One other technique to interpret the histogram is that many of the adults within the pattern are between 20 and 40 years outdated. Fewer adults within the pattern are youthful than 20 or older than 60. This graph reveals how many individuals are there for a selected age.

**9. Countplot for Smoking:**

The graph is a depend plot of the variety of lung most cancers circumstances primarily based on smoking standing. The x-axis represents the smoking standing (both smoker or non-smoker), and the y-axis represents the variety of lung most cancers circumstances.

The depend plots present that there are considerably extra lung most cancers circumstances amongst people who smoke than amongst non-smokers. This means that smoking is a serious threat issue for lung most cancers.

The depend plot additionally reveals that there’s a small variety of lung most cancers circumstances amongst non-smokers. That is possible on account of different elements.

`# Countplot for smoking`

plt.determine(figsize=(8, 5))

sns.countplot(x='SMOKING', hue='LUNG_CANCER', knowledge=df)

plt.title('Rely of Lung Most cancers Instances primarily based on Smoking Standing')

plt.xlabel('Smoking Standing')

plt.ylabel('Rely')

plt.present()

**10. Univariate Evaluation:**

`numeric_variables = df.select_dtypes(embody='quantity').columns`

df[numeric_variables].hist(bins=20, figsize=(15, 10))

plt.suptitle("Histograms of Numeric Variables")

plt.present()# Visualize counts for categorical variables

categorical_variables = df.select_dtypes(embody='object').columns

for column in categorical_variables:

plt.determine(figsize=(8, 5))

df[column].value_counts().plot(form='bar', colour='skyblue')

plt.title(f"Rely of {column}")

plt.xlabel(column)

plt.ylabel("Rely")

plt.present()

· Age: The age distribution is skewed, with extra sufferers of their 60s and 70s than of their 20s and 30s.

· Gender: The gender distribution is roughly equal, with barely extra male sufferers than feminine sufferers.

· Smoking: The smoking distribution is skewed, with extra sufferers who’ve by no means smoked than sufferers who smoke at present or have smoked prior to now.

· Yellow fingers: The yellow fingers distribution is skewed, with extra sufferers who would not have yellow fingers than sufferers who do.

· Persistent illness: The continual illness distribution is skewed, with extra sufferers who would not have continual ailments than sufferers who do. Similar approach for different columns.

**11.** **Bivariate Evaluation:**

This code performs bivariate evaluation by creating bar plots for all potential pairs of columns in your DataFrame (**df**).

The countplot reveals that the variety of lung most cancers circumstances will increase with age, for each women and men. The depend plot additionally reveals that there are extra male lung most cancers circumstances than feminine lung most cancers circumstances, for all age teams

`columns = df.columns`# Create bar plots for bivariate evaluation of all columns

for i in vary(len(columns)):

for j in vary(i + 1, len(columns)):

plt.determine(figsize=(10, 6))

sns.countplot(x=df[columns[i]], hue=df[columns[j]])

plt.title(f'Bivariate Evaluation of {columns[i]} and {columns[j]}')

plt.present()

The graph additionally reveals that the lung most cancers survival fee is increased for sufferers of their 40s and 50s than for sufferers of their 60s and 70s. This means that there could also be age-related elements that affect lung most cancers survival. The graph reveals that there are extra male lung most cancers circumstances than feminine lung most cancers circumstances, for each people who smoke and non-smokers. The graph additionally reveals that there are extra lung most cancers circumstances amongst people who smoke than amongst non-smokers, for each women and men.

**Pairplot for numerical variables:**

`# pairplot for numerical variables`

pairplot = sns.pairplot(df, vars=['AGE', 'SMOKING'], hue='LUNG_CANCER')

pairplot.fig.suptitle('Pairplot of Age and Smoking with Lung Most cancers', y=1.02)

plt.present()

**Field plot for smoking**

The field plot reveals the distribution of age for folks with lung most cancers, grouped by smoking standing. The median age is decrease for people who smoke with lung most cancers (69 years outdated) than for non-smokers with lung most cancers (73 years outdated). The interquartile vary (IQR) can be smaller for people who smoke (10 years) than for non-smokers (13 years). This means that people who smoke with lung most cancers are typically youthful and have a extra narrowly distributed age vary than non-smokers with lung most cancers.

smoking accelerates the development of lung most cancers. Which means people who smoke usually tend to be recognized with lung most cancers at a youthful age. Moreover, smoking could make lung most cancers extra aggressive, resulting in a shorter life expectancy for people who smoke with lung most cancers.

`# Field plot for smoking`

sns.boxplot(x='SMOKING', y='AGE', hue='LUNG_CANCER', knowledge=df)

plt.title('Field Plot of Age for Completely different Smoking Standing with Lung Most cancers')

plt.present()

**12. Dealing with lacking values:**

`# Test for lacking values`

missing_values = df.isnull().sum()# Print columns with lacking values

columns_with_missing_values = missing_values[missing_values > 0].index

if not columns_with_missing_values.empty:

print("Columns with lacking values:")

for column in columns_with_missing_values:

print(column)

else:

print("No lacking values within the dataset.")

missing_values

The supplied code checks for lacking values in a DataFrame (df). It calculates the sum of lacking values for every column after which prints out the columns with lacking values. If there are not any lacking values, it prints “No lacking values within the dataset.”

· AGE:

There may be one lacking worth within the AGE column. This lacking worth must be addressed since age is probably going an vital think about health-related analyses.

· WHEEZING:

There are 9 lacking values within the WHEEZING column. The presence or absence of wheezing could be vital in assessing respiratory well being, and the lacking values needs to be dealt with appropriately.

· For the AGE column, imputing the lacking worth utilizing an inexpensive technique (corresponding to imply or median imputation) could be acceptable.

**Issues earlier than dropping lacking values:**

· Pattern measurement: With solely 300 rows, dropping any observations would possibly considerably cut back this pattern measurement, doubtlessly affecting the statistical energy of the evaluation.

· Biases: If the lacking values aren’t fully random and are associated to particular traits of the people, dropping them would possibly introduce biases into your evaluation.

**% of lacking values**

`# % of Lacking Values`mis_value_percent = 100 * df.isnull().sum() / len(df)

print(mis_value_percent)

**13. Visualizing lacking worth**

This reveals the visible illustration of lacking values utilizing a bar graph, as we see above the 9 lacking values could be seen right here in WHEEZING and 1 in AGE.

`# Create a bar to visualise the lacking values`msno.bar(df)

`#The place are the Lacking Values`

msno.matrix(df)

The **msno.matrix(df)** operate, supplied by the **missingno** library in Python, creates a matrix visualization of lacking values in a DataFrame (**df**). This visualization is a grid the place every row corresponds to a knowledge level, and every column corresponds to a function (variable) within the dataset. The cells are coloured to point the presence or absence of information.

`#Nullity correlation ranges from `

# -1 (if one variable seems the opposite positively doesn't) to

# 0 (variables showing or not showing haven't any impact on each other) to

# 1 (if one variable seems the opposite positively additionally does).msno.heatmap(df)

Right here the heatmap reveals the white color which is on the center of the heatmap scale, representing a 0.0 correlation, it means there is no such thing as a linear correlation between the lacking values of the AGE and WHEEZING. In different phrases, the presence or absence of lacking values in a single variable doesn’t predict or affect the presence or absence of lacking values within the different variable. The 0.0 correlation means that the prevalence of lacking values in a single variable is impartial of the prevalence of lacking values within the different variable. This means the dearth of correlation.

`#A Missingno dendrogram is a tree diagram of missingness. It teams the extremely correlated variables collectively.`msno.dendrogram(df)

`# filling lacking values in AGE column`median_age = df['AGE'].median()

df['AGE'].fillna(median_age, inplace=True)

df

14. **Filling Lacking values in AGE column**

`# filling lacking values in AGE column`median_age = df['AGE'].median()

df['AGE'].fillna(median_age, inplace=True)

df

1. **median_age = df[‘AGE’].median()**: This line calculates the median of the ‘AGE’ column within the DataFrame **df**. The **median()** operate computes the median of a collection or record of numbers.

2. **df[‘AGE’].fillna(median_age, inplace=True)**: This line fills lacking (NaN) values within the ‘AGE’ column with the calculated median. The **fillna()** technique is used for changing lacking values. The **inplace=True** parameter modifies the DataFrame in place, which means it adjustments the unique DataFrame with out the necessity for reassignment.

The rationale for utilizing the median to fill lacking values on this case is usually associated to dealing with outliers. The median is much less delicate to excessive values in comparison with the imply. If the ‘AGE’ column accommodates outliers (unusually excessive or low values), utilizing the median could be a extra strong technique for imputing lacking values, as it’s not influenced by excessive values.

**15. Filling lacking values within the WHEEZING column:**

`# filling lacking values in categorical column WHEEZING`

mode_category = df['WHEEZING'].mode().iloc[0]# Fill lacking values within the categorical column with the mode

df['WHEEZING'].fillna(mode_category, inplace=True)

df

The rationale for utilizing the mode to fill lacking values in a categorical column like ‘WHEEZING’ is that it’s a frequent observe for categorical knowledge. The mode represents essentially the most frequent class within the column and is usually used to impute lacking values for categorical knowledge. It gives a easy and efficient technique to substitute lacking values with a worth that’s prone to be consultant of the prevailing knowledge in that column.

**mode_category = df[‘WHEEZING’].mode().iloc[0]**: This line calculates the mode of the ‘WHEEZING’ column utilizing the **mode()** operate. The result’s a Pandas Collection containing the mode(s). Since there could be a number of modes, **.iloc[0]** is used to pick out the primary mode in case there are a number of modes.

**df[‘WHEEZING’].fillna(mode_category, inplace=True)**: This line fills the lacking values within the ‘WHEEZING’ column with the calculated mode. The **fillna()** operate is used for this goal. The **inplace=True** argument ensures that the adjustments are utilized on to the unique DataFrame **df**, moderately than creating a brand new DataFrame.

**print(df)**: This line shows the up to date DataFrame with lacking values within the ‘WHEEZING’ column crammed utilizing the mode.

**16. Changing Floats to integer:**

Since ‘WHEEZING’ is a categorical column, having fractional components within the values could not make sense, and it may very well be fascinating to have all values as integers. That is notably true if the unique values of the ‘WHEEZING’ column have been integers and the lacking values have been full of the mode, which could be a floating-point quantity if the unique values contained decimals. The conversion to integers is a step taken for consistency and to make sure that the ‘WHEEZING’ column maintains its categorical nature.

`# changing floats to integer`

float_columns = df.select_dtypes(embody=['float']).columns# Convert float columns to integers

df[float_columns] = df[float_columns].astype(int)

df

**17. Label Encode:**

· Label Encoding is a method the place every class is assigned a novel integer. On this case of binary classes like ‘sure’ and ‘no’, or ‘male’ and ‘feminine’, or 1 and a couple of this usually leads to 1 and 0.

· Label Encoding is a simple and easy technique to transform categorical values to numerical representations. It’s straightforward to implement and perceive.

· Label Encoding could be extra memory-efficient in comparison with storing string or textual content values, particularly when coping with giant datasets. Integer values typically require much less reminiscence than their string counterparts.

utilizing Label Encoding is an inexpensive alternative as a result of ‘sure’ and ‘no’ typically indicate a binary, ordered relationship (0 for ‘no’ and 1 for ‘sure’). This may be advantageous if the mannequin you might be utilizing can doubtlessly profit from understanding this order

`# Merely Label Encode :) - Straightforward stuff`

df['SMOKING']= le.fit_transform(df['SMOKING'])

df

`from sklearn.preprocessing import LabelEncoder`

columns_to_encode = ['SMOKING', 'YELLOW_FINGERS', 'ANXIETY', 'PEER_PRESSURE', 'CHRONIC DISEASE',

'FATIGUE', 'ALLERGY', 'WHEEZING', 'ALCOHOL CONSUMING', 'COUGHING',

'SHORTNESS OF BREATH', 'SWALLOWING DIFFICULTY', 'CHEST PAIN', 'LUNG_CANCER']label_encoder = LabelEncoder()

for column in columns_to_encode:

df[column] = label_encoder.fit_transform(df[column])

df

· This code is utilizing Label Encoding to rework categorical columns within the specified record (**columns_to_encode**) into numerical values

· The LabelEncoder from scikit-learn is used to transform categorical labels into numerical values.

· The columns_to_encode record specifies which columns within the DataFrame ought to endure label encoding.

· The code then iterates by means of every specified column, applies the **fit_transform** technique of **LabelEncoder**, and replaces the unique categorical values with their corresponding numerical representations.

· This course of is finished in place, which means it modifies the DataFrame straight.

**Benefits:**

- Numeric Illustration: Many machine studying algorithms require numerical enter. Label Encoding gives a technique to convert categorical knowledge right into a format that can be utilized by these algorithms.
- Simplicity: Label Encoding is a straightforward and fast technique to remodel categorical knowledge when there may be an ordinal relationship among the many classes. It assigns integers primarily based on the order of look within the column.
- Diminished Reminiscence Utilization: Numerical illustration usually requires much less reminiscence in comparison with storing string labels.

**18. Outliers:**

· Outlier dealing with is usually particular to every numerical column. On this case, the code is tailor-made to handle outliers within the ‘AGE’ column particularly.

· Figuring out outliers in particular person columns permits for a extra focused and nuanced method to knowledge cleansing, contemplating the character of the information in every column.

Calculate IQR (Interquartile Vary):

The IQR is a measure of statistical dispersion, or in easy phrases, the vary the place the center 50% of the information lies. It’s calculated because the distinction between the third quartile (Q3) and the primary quartile (Q1).

· On this case, age_iqr = iqr(df[‘AGE’]) calculates the IQR for the ‘AGE’ column.

· Set decrease and higher bounds for outliers:

· The decrease certain is calculated as Q1–1.5 * IQR.

· The higher certain is calculated as Q3 + 1.5 * IQR.

· These bounds outline a variety past which knowledge factors are thought-about outliers.

· Determine outliers primarily based on bounds:

· The road outliers = df[(df[‘AGE’] < lower_bound) | (df[‘AGE’] > upper_bound)][‘AGE’] selects rows the place the ‘AGE’ values are lower than the decrease certain or higher than the higher certain.

`# Discovering outliers for Numerical column - 'AGE'`from scipy.stats import iqr

sns.set(fashion="whitegrid")

plt.determine(figsize=(8, 6))

sns.boxplot(x=df['AGE'])

plt.xlabel('Age')

plt.title('Field Plot of Age with Outliers')

plt.present()

age_iqr = iqr(df['AGE'])

# Set decrease and higher bounds for outliers

lower_bound = df['AGE'].quantile(0.25) - 1.5 * age_iqr

upper_bound = df['AGE'].quantile(0.75) + 1.5 * age_iqr

outliers = df[(df['AGE'] < lower_bound) | (df['AGE'] > upper_bound)]['AGE']

print("Outlier values in 'AGE' column:")

print(outliers)

1. **39 years (Index 0):**

This particular person is comparatively younger in comparison with the standard age vary for lung most cancers. It’s unusual for somebody as younger as 39 to develop lung most cancers. Additional investigation is required to make sure the accuracy of the information and to discover potential causes for early-onset lung most cancers.

2. **21 years (Index 22):**

An individual on the age of 21 is extraordinarily unusual for lung most cancers. This may very well be an error or a particular case. It’s important to scrutinize this knowledge level to verify its accuracy and to discover if there are any particular circumstances contributing to lung most cancers at such a younger age.

3. **38 years (Index 238):**

Much like the primary case, this particular person is comparatively younger. Whereas not as excessive because the earlier outliers, it’s nonetheless price investigating additional to grasp the context and potential causes for lung most cancers at this age.

4. **87 years (Index 277):**

That is an outlier on the upper finish of the age spectrum. Superior age is a threat issue for lung most cancers, however it’s essential to think about the general well being of the person and whether or not they’re appropriate candidates for surgical intervention given their age.

The outliers at 21 and 39 years outdated are notably noteworthy,à The outliers at 21 and 39 years outdated within the lung most cancers dataset are noteworthy on account of their deviation from the standard age vary related to lung most cancers. Lung most cancers is often recognized in older people, with nearly all of circumstances occurring in folks over the age of 45. The presence of people as younger as 21 and 39 within the dataset raises questions and curiosity, as it’s comparatively unusual.

**Exchange Outliers Utilizing IQR technique talked about above –**

`# Exchange outliers with values inside the bounds`

df.loc[df['AGE'] < lower_bound, 'AGE'] = lower_bound

df.loc[df['AGE'] > upper_bound, 'AGE'] = upper_bound

df

· **df[‘AGE’] < lower_bound**: This situation identifies rows within the ‘AGE’ column the place the age is lower than the calculated decrease certain.

· **df.loc[df[‘AGE’] < lower_bound, ‘AGE’] = lower_bound**: For these rows recognized by the situation, it replaces the ‘AGE’ values with the decrease certain.

· **df[‘AGE’] > upper_bound**: This situation identifies rows within the ‘AGE’ column the place the age is bigger than the calculated higher certain.

· **df.loc[df[‘AGE’] > upper_bound, ‘AGE’] = upper_bound**: For these rows recognized by the situation, it replaces the ‘AGE’ values with the higher certain.

**19. Skewness:**

`skewness_age = skew(df['AGE'])`

print("Skewness of AGE column:", skewness_age)

**Why checking skewness for the AGE column?**

Within the context of the AGE column, checking skewness is vital as a result of it gives insights into whether or not the age values are symmetrically distributed or if there’s a tendency for the information to be concentrated extra on one facet.

A skewness worth of -0.0617 signifies a really slight left (detrimental) skewness. In less complicated phrases, the distribution of ages is barely skewed in direction of the youthful facet however is mostly near being symmetric. The detrimental skewness implies that the left tail is longer or fatter than the best one. Nonetheless, the magnitude of skewness is comparatively small, suggesting that the distribution will not be closely skewed.

Given the skewness worth of roughly -0.0617, we will conclude that the age distribution within the dataset is roughly symmetric, with a slight tendency in direction of youthful ages. This data is effective for understanding the central tendency of the age variable.

Within the case of a comparatively small skewness, just like the one you’ve talked about (roughly -0.0617), there could be no rapid want for transformation.

Within the case of a comparatively small skewness, just like the one you’ve talked about (roughly -0.0617), there could be no rapid want for transformation.

The skewness worth is near zero, indicating solely a really slight skewness. In lots of circumstances, a small diploma of skewness could not considerably impression the outcomes of statistical analyses.

**Q — Q plot:**

`# Create Q-Q plot`probplot(df['AGE'], dist="norm", plot=plt)

plt.title("Q-Q Plot for AGE column")

plt.present()

Particularly, the QQ plot reveals that the AGE knowledge is skewed to the left. Which means there are extra folks with lung most cancers who’re youthful than anticipated, and fewer folks with lung most cancers who’re older than anticipated.

· Age-related adjustments within the threat of lung most cancers: The danger of lung most cancers will increase with age. Nonetheless, the speed of improve in threat will not be linear. The danger of lung most cancers will increase extra quickly at youthful ages than at older ages. This might clarify the left skew within the AGE knowledge.

· Within the case of skewness, a skewness worth of -0.0617 (as you talked about) signifies a comparatively small skewness. If the skewness is near zero, the distribution is roughly symmetric. Making use of a metamorphosis may not be obligatory on this case, as the information is already near being symmetric.

**20. Evaluating dependent variable LUNG_CANCER with its dependent:**

`independent_vars = ['GENDER', 'AGE', 'SMOKING', 'YELLOW_FINGERS', 'ANXIETY', 'PEER_PRESSURE',`

'CHRONIC DISEASE', 'FATIGUE', 'ALLERGY', 'WHEEZING', 'ALCOHOL CONSUMING',

'COUGHING', 'SHORTNESS OF BREATH', 'SWALLOWING DIFFICULTY', 'CHEST PAIN']fig, axes = plt.subplots(nrows=len(independent_vars), ncols=1, figsize=(8, 5 * len(independent_vars) + 2))

fig.tight_layout(h_pad=1.8) # Modify the vertical spacing between subplots

# Add area on the high for the general determine title

plt.subplots_adjust(high=0.95)

# Loop by means of every impartial variable and plot it towards the dependent variable

for i, var in enumerate(independent_vars):

ax = sns.countplot(x=var, hue='LUNG_CANCER', knowledge=df, ax=axes[i])

ax.set_title(f'{var} vs LUNG_CANCER', pad=10) # Add padding to the title to stop overlap

ax.set_xlabel(var)

ax.set_ylabel('Rely')

ax.tick_params(axis='x', rotation=45) # Rotate x-axis labels for higher readability

# Modify structure to stop x-axis label overlap with the title of the following graph

plt.tight_layout(rect=[0, 0, 1, 0.97])

plt.suptitle('Comparability of Unbiased Variables with LUNG_CANCER', measurement=16)

plt.present()

The supplied code generates subplots to check the distribution of every impartial variable with respect to the ‘LUNG_CANCER’ variable. It makes use of depend plots for categorical variables and compares the counts of various classes inside every variable for each lessons of ‘LUNG_CANCER’.

· The highest graph compares the incidence of lung most cancers between men and women. It clearly reveals that males have the next incidence of lung most cancers in comparison with females.

· The underside graph illustrates the connection between age and lung most cancers. It reveals that the incidence of lung most cancers will increase with age. That is per our understanding of most cancers as a illness that always develops over an extended interval, with the chance growing as we age.

· The highest part of the graph reveals the connection between smoking and lung most cancers. The x-axis represents the presence of smoking, and the y-axis represents the depend of lung most cancers circumstances. The orange bars symbolize the presence of lung most cancers, and the blue bars symbolize the absence of lung most cancers. From this part, we will see that there’s a increased depend of lung most cancers circumstances in people who smoke.

· The underside part of the graph reveals the connection between yellow fingers and lung most cancers. Much like the highest part, the x-axis represents the presence of yellow fingers, and the y-axis represents the depend of lung most cancers circumstances. The orange bars symbolize the presence of lung most cancers, and the blue bars symbolize the absence of lung most cancers. From this part, we will see that there’s a increased depend of lung most cancers circumstances in people who’ve yellow fingers.

· right here graphs recommend a constructive correlation between smoking, yellow fingers, and the incidence of lung most cancers. Which means people who smoke or have yellow fingers usually tend to have lung most cancers.

· The information reveals that people with anxiousness and peer stress usually tend to have lung most cancers than these with out. This means that there could also be a correlation between anxiousness, peer stress, and lung most cancers.

· The primary graph, titled “CHRONIC DISEASE VS LUNG_CANCER”, compares the depend of people with and with out lung most cancers who even have a continual illness. The orange bar represents people with lung most cancers, whereas the blue bar represents people with out lung most cancers. From this graph, it’s evident that there’s a increased depend of people with lung most cancers who even have a continual illness.

· The second graph, titled “FATIGUE VS LUNG_CANCER”, compares the depend of people with and with out lung most cancers who additionally expertise fatigue. Much like the primary graph, the orange bar represents people with lung most cancers, whereas the blue bar represents people with out lung most cancers. This graph reveals that there’s a increased depend of people with lung most cancers who additionally expertise fatigue.

· these graphs recommend a correlation between lung most cancers and the presence of continual illness and fatigue.

· This graph shared presents two bar charts that present the connection between allergy, wheezing, and lung most cancers.

· The primary graph, titled “ALLERGY vs LUNG_CANCER”, compares the depend of people with and with out lung most cancers who even have an allergy. From this graph, it’s evident that there’s a increased depend of people with lung most cancers who even have an allergy.

· The second graph, titled “WHEEZING vs LUNG_CANCER”, compares the depend of people with and with out lung most cancers who additionally expertise wheezing.

· From a knowledge evaluation perspective, these graphs recommend a constructive correlation between allergy, wheezing, and the incidence of lung most cancers. Nonetheless, it’s vital to notice that correlation doesn’t indicate causation, and these findings would have to be additional investigated by means of managed research.

· The highest a part of the graph reveals the connection between alcohol consumption and lung most cancers. The x-axis represents alcohol consumption, whereas the y-axis represents the depend of lung most cancers circumstances. The orange bars symbolize circumstances the place lung most cancers is current, whereas the blue bars symbolize circumstances the place lung most cancers will not be current. From this graph, it’s evident that there’s a increased depend of lung most cancers circumstances in people who devour alcohol.

· The underside a part of the graph reveals the connection between alcohol consumption and coughing. This graph reveals that there’s a increased depend of coughing circumstances in people who devour alcohol.

· these graphs recommend a constructive correlation between alcohol consumption and the incidence of lung most cancers and coughing.

· The primary graph, titled “ALLERGY vs LUNG_CANCER”, compares the depend of people with and with out lung most cancers who even have an allergy

· From this graph, it’s evident that there’s a increased depend of people with lung most cancers who even have an allergy.

· The second graph, titled “WHEEZING vs LUNG_CANCER”, compares the depend of people with and with out lung most cancers who additionally expertise wheezing.

· This graph reveals that there’s a increased depend of people with lung most cancers who additionally expertise wheezing.

· these graphs recommend a constructive correlation between allergy, wheezing, and the incidence of lung most cancers. Nonetheless, it’s vital to notice that correlation doesn’t indicate causation.

· The x-axis represents chest ache, whereas the y-axis represents the depend of people. The orange bar represents people with lung most cancers, whereas the blue bar represents people with out lung most cancers. From this graph, it’s evident that there’s a increased depend of people with lung most cancers who additionally expertise chest ache.

· This graph suggests a constructive correlation between chest ache and the incidence of lung most cancers.

**21. Correlation:**

· Correlation values point out the power and route of a linear relationship between two variables. Right here’s a quick interpretation of the correlation values within the supplied correlation matrix:

`corr=df.corr()`

corr

**GENDER and Different Variables:**

GENDER has a reasonable constructive correlation with ALCOHOL CONSUMING (0.434), indicating a bent for sure genders to devour extra alcohol.

`corr_matrix = df.corr()`# Extract correlation worth for ALCOHOL CONSUMING and GENDER

alcohol_gender_corr = corr_matrix.loc['ALCOHOL CONSUMING', 'GENDER']

# Decide which gender has the best correlation

if alcohol_gender_corr > 0:

highest_corr_gender = 1

elif alcohol_gender_corr < 0:

highest_corr_gender = 0

else:

highest_corr_gender = "No correlation"

# Print the outcome

print(f"The GENDER with the best correlation with ALCOHOL CONSUMING is: {highest_corr_gender}")

This code finds which sort of gender has the best correlation with alcohol consumption. So, the output was 1 which implies Male consumes extra alcohol than Females WRT this knowledge.

· GENDER has a reasonable constructive correlation with PEER_PRESSURE (0.261), suggesting a relationship between gender and susceptibility to look stress.

`import seaborn as sns`

import matplotlib.pyplot as plt# Assuming df is your DataFrame

corr_matrix = df[['ALCOHOL CONSUMING', 'GENDER']].corr()

# Create a heatmap for the correlation between ALCOHOL CONSUMING and GENDER

plt.determine(figsize=(5, 4))

sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)

plt.title('Correlation Heatmap: ALCOHOL CONSUMING vs. GENDER')

plt.present()

This code finds which sort of gender has highest correlation with peer stress. So, the output was 0 which implies Females have extra peer stress than males WRT this knowledge.

**Correlation Heatmap**

`corr = df.corr()`plt.determine(figsize=(10, 8))

sns.heatmap(corr, annot=True, cmap='YlGnBu', fmt=".2f", linewidths=.5)

plt.title("Correlation Heatmap")

plt.present()

**AGE and Different Variables:**

· AGE has a weak detrimental correlation with GENDER (-0.023), indicating a really slight tendency for older people to be of a unique gender.

· AGE has a weak detrimental correlation with SMOKING (-0.069), suggesting a slight tendency for older people to smoke much less.

**SMOKING and Different Variables:**

· SMOKING has a weak constructive correlation with ANXIETY (0.153) and WHEEZING (0.134), indicating a slight tendency for people who smoke to expertise extra anxiousness and wheezing.

· SMOKING has a weak detrimental correlation with ALCOHOL CONSUMING (-0.053), suggesting a slight tendency for people who smoke to devour much less alcohol.

**LUNG_CANCER and Different Variables:**

· LUNG_CANCER has weak constructive correlations with numerous variables, corresponding to ALCOHOL CONSUMING (0.294), CHEST PAIN (0.195), and PEER_PRESSURE (0.195). These correlations point out slight tendencies for people with lung most cancers to exhibit sure traits.

**YELLOW_FINGERS and Different Variables:**

· YELLOW_FINGERS has a reasonable constructive correlation with ANXIETY (0.558) and a robust constructive correlation with SWALLOWING DIFFICULTY (0.478). This means a stronger affiliation between yellow fingers and anxiousness or swallowing difficulties.

**ANXIETY and Different Variables:**

· ANXIETY has a robust constructive correlation with PEER_PRESSURE (0.210), indicating that people with anxiousness could also be extra inclined to look stress.

ANXIETY has a robust constructive correlation with SWALLOWING DIFFICULTY (0.478), suggesting a connection between anxiousness and difficulties in swallowing.

**PEER_PRESSURE and Different Variables:**

· PEER_PRESSURE has a reasonable constructive correlation with CHRONIC DISEASE (0.043) and a reasonable detrimental correlation with SMOKING (-0.030). These correlations point out weak associations with continual ailments and smoking, respectively.

**CHRONIC DISEASE and Different Variables:**

· CHRONIC DISEASE has a weak constructive correlation with WHEEZING (0.010) and a weak detrimental correlation with SMOKING (-0.149). These correlations are comparatively weak, suggesting restricted associations.

**FATIGUE and Different Variables:**

· FATIGUE has a reasonable constructive correlation with COUGHING (0.148) and a robust constructive correlation with SHORTNESS OF BREATH (0.407). These correlations recommend stronger associations between fatigue and coughing or shortness of breath.

**ALLERGY and Different Variables:**

· ALLERGY has a robust constructive correlation with WHEEZING (0.165) and a reasonable constructive correlation with SWALLOWING DIFFICULTY (0.245). These correlations point out stronger associations between allergy and wheezing or swallowing difficulties.

**WHEEZING and Different Variables:**

· WHEEZING has a reasonable constructive correlation with ALCOHOL CONSUMING (0.260) and a robust constructive correlation with COUGHING (0.352). These correlations recommend stronger associations between wheezing and alcohol consumption or coughing.

**ALCOHOL CONSUMING and Different Variables:**

· ALCOHOL CONSUMING has a reasonable constructive correlation with GENDER (0.434) and a reasonable detrimental correlation with YELLOW_FINGERS (-0.274). These correlations recommend tendencies for alcohol consumption to range with gender and be inversely associated to yellow fingers.

**COUGHING and Different Variables:**

· COUGHING has a reasonable constructive correlation with SHORTNESS OF BREATH (0.285) and a weak constructive correlation with CHEST PAIN (0.078). These correlations recommend a reasonable affiliation between coughing and shortness of breath, and a weaker affiliation with chest ache.

**SHORTNESS OF BREATH and Different Variables:**

· SHORTNESS OF BREATH has a weak constructive correlation with SWALLOWING DIFFICULTY (0.102) and a reasonable constructive correlation with CHEST PAIN (0.044). These correlations point out comparatively weak associations between shortness of breath and swallowing issue or chest ache.

**SWALLOWING DIFFICULTY and Different Variables:**

· SWALLOWING DIFFICULTY has a weak constructive correlation with CHEST PAIN (0.103). This correlation suggests a comparatively weak affiliation between swallowing issue and chest ache.

**CHEST PAIN and LUNG_CANCER:**

· CHEST PAIN has a weak constructive correlation with LUNG_CANCER (0.195). This means a slight tendency for people with lung most cancers to report chest ache.

**22.** **Dimensionality Discount Method (DRT)**

Dimensionality discount strategies are helpful in healthcare datasets, particularly these with quite a few categorical variables, for a number of causes:

1. Curse of Dimensionality: Excessive-dimensional datasets, particularly these with many categorical options, can endure from the “curse of dimensionality.” This makes algorithms much less environment friendly and will result in overfitting, because the mannequin might turn into too advanced for the obtainable knowledge.

2. Computational Effectivity: Giant datasets with excessive dimensionality require extra computational assets. Dimensionality discount simplifies the dataset, making it computationally extra environment friendly to course of and analyze.

3. Improved Mannequin Efficiency: Decreasing the variety of options can result in improved mannequin efficiency. Excessive-dimensional datasets could have redundant or irrelevant options that may negatively impression the efficiency of machine studying algorithms. Dimensionality discount helps give attention to essentially the most related data.

4. Visualization: It’s difficult to visualise knowledge in high-dimensional areas. Dimensionality discount strategies remodel the information right into a lower-dimensional area, making it simpler to visualise and interpret. That is notably helpful for understanding patterns and relationships in healthcare knowledge.

5. Dealing with Categorical Variables: Many machine studying algorithms, particularly these designed for regression or classification, work effectively with numerical knowledge. Dimensionality discount can remodel categorical variables right into a extra appropriate format for these algorithms, enhancing their effectiveness.

6. Addressing Multicollinearity: In healthcare datasets, some options could also be extremely correlated. Dimensionality discount can deal with multicollinearity points, the place one variable could be predicted from others, by capturing the shared data in a decreased set of dimensions.

`input_cols = df.columns[:-1]`

target_col = df.columns[-1]

`inputs_df = df[list(input_cols)].copy()`

inputs_df.pattern(5)

**Function and Goal Separation:** The code separates the options (enter variables) and the goal variable from the unique DataFrame. It is a frequent step in getting ready knowledge for machine studying, the place you’ll want to distinguish between what the mannequin will study from (options) and what it’s going to predict (goal).

**Enter DataFrame Copy:** Creating a duplicate of the enter DataFrame (**inputs_df**) ensures that any modifications made to this subset of the information gained’t have an effect on the unique dataset.

Displaying a random pattern of 5 rows from the enter DataFrame permits a fast visible examine to make sure that the information has been processed as anticipated

This code is a part of the usual knowledge preprocessing steps typically carried out earlier than making use of machine studying algorithms.

`scaler = MinMaxScaler()`

scaler.match(inputs_df[input_cols])

inputs_df[input_cols] = scaler.remodel(inputs_df[input_cols])

inputs_df[input_cols].head()

The aim of this code is to normalize or scale the enter options. Normalization is important when the options within the dataset have completely different scales, because it helps algorithms converge sooner and carry out higher. On this case, the **MinMaxScaler** is used to scale the values of the options to a particular vary (generally between 0 and 1).

Scaling is especially vital when working with machine studying algorithms which might be delicate to the size of enter options.

The particular alternative of **MinMaxScaler** scales the information to a specified vary, often between 0 and 1. That is helpful if you wish to preserve the form of the unique distribution whereas making certain that each one values are inside a constant, bounded vary.

**Principal element evaluation (PCA):**

Principal Element Evaluation (PCA) is a dimensionality discount approach itself, and it’s generally used to scale back the dimensionality of a dataset by remodeling the unique options into a brand new set of uncorrelated options known as principal elements. PCA is usually utilized on to the unique dataset, not essentially after different dimensionality discount strategies.

`pca = PCA(n_components=13).match(inputs_df[input_cols])`explained_variance = pca.explained_variance_ratio_

print(np.cumsum(explained_variance)[-1])

· **pca = PCA(n_components=13).match(inputs_df[input_cols])**: This line initializes a Principal Element Evaluation (PCA) mannequin with the required variety of elements (**n_components=13**) and matches it to the enter knowledge (**inputs_df[input_cols]**). The PCA mannequin is educated to determine the principal elements that seize the utmost variance within the enter options.

· **explained_variance = pca.explained_variance_ratio_**: After becoming the PCA mannequin, this line retrieves the defined variance ratio of every principal element. The **explained_variance_ratio_** attribute represents the proportion of the dataset’s variance defined by every principal element.

· **print(np.cumsum(explained_variance)[-1])**: Right here, the cumulative sum of the defined variance ratios is calculated utilizing **np.cumsum()**. The **[-1]** index is used to print the cumulative defined variance of all the chosen principal elements.

· **Dimensionality Discount:** PCA is employed to scale back the dimensionality of the dataset whereas retaining as a lot of the unique data as potential. By choosing a particular variety of principal elements (on this case, 13), the purpose is to symbolize the dataset in a lower-dimensional area.

· **Defined Variance:** The cumulative sum of defined variance ratios signifies how a lot of the whole variance within the authentic knowledge is retained by the chosen principal elements. It helps in assessing the trade-off between dimensionality discount and knowledge loss. A better cumulative defined variance suggests {that a} bigger portion of the unique variability is captured by the decreased set of options.

`pca = PCA().match(inputs_df[input_cols])`explained_variance = np.cumsum(pca.explained_variance_ratio_)

components_range = vary(1, len(explained_variance) + 1)

sns.lineplot(x=components_range, y=explained_variance, marker='o')

plt.ylabel("Variance")

plt.xlabel("variety of PC")

plt.present()

· This code performs a Principal Element Evaluation (PCA) on the enter knowledge and creates a scree plot, a generally used visualization to assist decide the optimum variety of principal elements to retain.

· From the graph, we will see that the defined variance will increase because the variety of PCs will increase, indicating that including extra PCs to the mannequin captures extra of the variance within the knowledge. Nonetheless, there’s a slight leveling off after round 10 PCs, suggesting that including greater than 10 PCs doesn’t considerably improve the defined variance. It is a frequent commentary in PCA, the place the primary few elements seize many of the variance within the knowledge. his graph is a helpful visualization for understanding how a lot of the information’s variance is captured by every PC in a PCA mannequin. It will probably assist in deciding what number of PCs to retain for additional evaluation.

`import numpy as np`cumulative_variance = np.cumsum(explained_variance)

n_components_95 = np.the place(cumulative_variance >= 0.95)[0][0] + 1

pca_95 = PCA(n_components=n_components_95)

inputs_pca_95 = pca_95.fit_transform(inputs_df[input_cols])

pca_df_95 = pd.DataFrame(inputs_pca_95, columns=[f'PC{i+1}' for i in range(n_components_95)])

pca_df_95['target'] = df[target_col]

`n_components_95`

The output of this code could be the variable n_components_95, which accommodates the variety of principal elements wanted to clarify 95% of the variance within the dataset. It is a frequent approach in Principal Element Evaluation (PCA) when attempting to scale back the dimensionality of information, because it means that you can preserve the elements that include essentially the most data (as measured by the defined variance).

`# Specify the related classes`

classes = ['SMOKING', 'YELLOW_FINGERS', 'ANXIETY', 'PEER_PRESSURE', 'CHRONIC DISEASE',

'FATIGUE', 'ALLERGY', 'WHEEZING', 'ALCOHOL CONSUMING', 'COUGHING',

'SHORTNESS OF BREATH', 'SWALLOWING DIFFICULTY', 'CHEST PAIN']# Create a brand new DataFrame with solely the required columns

lung_cancer_yes = df[df['LUNG_CANCER'] == 1][categories + ['LUNG_CANCER']]

# Filter people with a most of three 'YES' within the specified classes

max_yes_threshold = 3

filtered_data = lung_cancer_yes[(lung_cancer_yes == 1).sum(axis=1) <= max_yes_threshold]

# Show the outcome

print("Columns with a most of", max_yes_threshold, "'YES' for people with 'YES' in Lung Most cancers:")

print(filtered_data)

# Print the 'LUNG_CANCER' column with respect to the filtered columns

lung_cancer_column = filtered_data['LUNG_CANCER']

print("nLUNG_CANCER column with respect to these columns:")

print(lung_cancer_column)

`from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA`lda = LDA(n_components=1)

inputs_lda = lda.fit_transform(inputs_df[input_cols], df[target_col])

lda_df = pd.DataFrame(inputs_lda, columns=['LDA1'])

lda_df['target'] = df[target_col]

plt.determine(figsize=(10, 6))

sns.scatterplot(x='LDA1', y=[0] * len(lda_df), hue='goal', knowledge=lda_df, palette='viridis')

plt.present(), lda_df.head()

`import plotly.specific as px`px.scatter(lda_df,x='LDA1',y=[0] * len(lda_df),colour = 'goal')

`df.to_csv('LungCancerpreprocessed.csv',index=False)`

Now, now we have a preprocessed dataset with decreased dimensionality, which is prepared for additional evaluation and modeling.