Skip to content

How to Overview Dataset information with Python?

Once we’ve effectively loaded the dataset, our next step is to get a grasp of its basic characteristics. This involves looking into key aspects like the dataset summary, descriptive information, column specifics, data size, and more. The good news is that Python offers a variety of user-friendly functions tailored for this purpose, making the process smooth and straightforward.

1.Display basic information about the DataFrame

If you’re seeking a succinct overview of any database, the ideal tool to use is the info() function. This function provides a comprehensive snapshot encompassing index dtype, columns, non-null value counts, and memory usage.

Let’s use the ‘tips’ dataset as an example. You can download the dataset on your device.

# Display basic information about the DataFrame                                                               
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null object
smoker        244 non-null object
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.4+ KB

2. Display summary statistics

Having obtained a preliminary understanding of the dataset, what if we seek more comprehensive insights into its statistical characteristics? This is where the describe() function proves valuable, as it encapsulates a summary encompassing the dataset’s central tendencies, dispersion, and distributional shape.

# Display summary statistics

       total_bill         tip        size
count  244.000000  244.000000  244.000000
mean    19.785943    2.998279    2.569672
std      8.902412    1.383638    0.951100
min      3.070000    1.000000    1.000000
25%     13.347500    2.000000    2.000000
50%     17.795000    2.900000    2.000000
75%     24.127500    3.562500    3.000000
max     50.810000   10.000000    6.000000

3. Display first few rows

Sometimes, when dealing with datasets, we all desire a more intuitive way to get a feel for the data—especially newcomers to the field of data analysis. This is where the head() function comes in handy. By default, it shows the first 5 rows of the dataset, providing a quick overview.

However, the beauty of this function lies in its customization potential. You can adjust it to display a specific number of rows, tailoring the output to match your needs and preferences.

df.head()  # show the first 5 rows of the dataset
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

df.head(2)  # show the first 2 rows of the dataset
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3