Once we’ve effectively loaded the dataset, our next step is to get a grasp of its basic characteristics. This involves looking into key aspects like the dataset summary, descriptive information, column specifics, data size, and more. The good news is that Python offers a variety of user-friendly functions tailored for this purpose, making the process smooth and straightforward.
1.Display basic information about the DataFrame
If you’re seeking a succinct overview of any database, the ideal tool to use is the info()
function. This function provides a comprehensive snapshot encompassing index dtype, columns, non-null value counts, and memory usage.
Let’s use the ‘tips’ dataset as an example. You can download the dataset on your device.
# Display basic information about the DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill 244 non-null float64
tip 244 non-null float64
sex 244 non-null object
smoker 244 non-null object
day 244 non-null object
time 244 non-null object
size 244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.4+ KB
2. Display summary statistics
Having obtained a preliminary understanding of the dataset, what if we seek more comprehensive insights into its statistical characteristics? This is where the describe()
function proves valuable, as it encapsulates a summary encompassing the dataset’s central tendencies, dispersion, and distributional shape.
# Display summary statistics
df.describe()
total_bill tip size
count 244.000000 244.000000 244.000000
mean 19.785943 2.998279 2.569672
std 8.902412 1.383638 0.951100
min 3.070000 1.000000 1.000000
25% 13.347500 2.000000 2.000000
50% 17.795000 2.900000 2.000000
75% 24.127500 3.562500 3.000000
max 50.810000 10.000000 6.000000
3. Display first few rows
Sometimes, when dealing with datasets, we all desire a more intuitive way to get a feel for the data—especially newcomers to the field of data analysis. This is where the head()
function comes in handy. By default, it shows the first 5 rows of the dataset, providing a quick overview.
However, the beauty of this function lies in its customization potential. You can adjust it to display a specific number of rows, tailoring the output to match your needs and preferences.
df.head() # show the first 5 rows of the dataset
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
df.head(2) # show the first 2 rows of the dataset
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3