The Million Song Dataset (MSD) contains a variety of metadata and audio features for 1 million songs. The dataset is structured into multiple files, with the main dataset stored in HDF5 format. Here’s a breakdown of the important columns:


1. Metadata

Column NameDescriptionExample
song_idUnique identifier for a songSOQHXMF12A6D4FD69A
titleSong titleBohemian Rhapsody
artist_idUnique identifier for the artistARJ7KF01187FB3F5F1
artist_nameName of the artistQueen
releaseAlbum or single nameA Night at the Opera
yearRelease year1975
durationLength of the song in seconds355.47
track_idMusicBrainz Track IDTRMMMYQ128F932D901

2. Audio Features

Column NameDescriptionExample
tempoBPM (beats per minute)120.5
keyMusical key (0 = C, 1 = C#, …, 11 = B)5 (F Major)
mode1 = Major, 0 = Minor1
time_signatureTime signature (3, 4, or 5)4 (4/4 time)
loudnessOverall loudness in dB-5.5
danceabilityA measure of how danceable a song is (0-1)0.8
energyEnergy level of the track (0-1)0.9

3. Genre & Similarity

Column NameDescriptionExample
artist_termsList of genre tags['rock', 'classic rock']
artist_familiarityPopularity score (0-1)0.85
artist_hotttnesssHotness score (0-1)0.9
similar_artistsList of similar artists’ IDs['AR1', 'AR2', 'AR3']

4. Additional Features (Segment-Level Analysis)

These are detailed audio features analyzed per second for deeper audio processing:

  • MFCCs (Mel-Frequency Cepstral Coefficients) – Used in audio signal processing
  • Pitch & Timbre Features – For sound quality analysis

Would you like help fetching, analyzing, or visualizing this data for probability and statistics? 🚀