Documentation
¶
Overview ¶
Package list provides functionality for downloading PDFs from URL lists, CSV files, or TSV files.
The package supports three input formats:
1. Plain text files (.txt or no extension):
- One URL per line
- Lines starting with '#' are treated as comments
- Empty lines are ignored
2. CSV files (.csv):
- Comma-separated values with automatic column detection
- Intelligent parsing of paper metadata
- Generates meaningful filenames from metadata
3. TSV files (.tsv):
- Tab-separated values with automatic column detection
- Same features as CSV support
Column Detection:
The package automatically detects and uses the following columns if present:
- URL/Link columns: BestLink, BestURL, URL, Link, href (prioritizes "best" variants)
- DOI column: Automatically converts DOIs to resolvable URLs if no direct URL is found
- Title column: ArticleTitle, Article_Title, Paper_Title, Title
- Authors column: Authors, Creator, Contributor
- Year column: PublicationYear, Publication_Year, Year
- Journal column: SourceTitle, Source_Title, Journal, Source, Publication
- Abstract column: Abstract (for future use)
File Naming:
For CSV/TSV inputs, the package generates intelligent filenames using available metadata:
- Format: [Year]_[FirstAuthorLastName]_[TruncatedTitle].pdf
- Example: 2023_Smith_Climate_change_impacts.pdf
- Falls back to row ID if metadata is insufficient
Output:
For CSV/TSV inputs, the package generates a download report with:
- Original metadata
- Download success status
- Generated filename
- Error messages for failed downloads
The report is saved as [input_filename]_report.csv in the same directory.
Usage:
// Download from plain text URL list
err := list.DownloadURLList("urls.txt")
// Download from CSV with metadata
err := list.DownloadURLList("papers.csv")
// Download from TSV with metadata
err := list.DownloadURLList("papers.tsv")
Example CSV Input:
ArticleTitle,Authors,PublicationYear,BestLink,DOI "Climate Change Impacts","Smith, J.; Jones, M.",2023,https://example.com/paper1.pdf,10.1234/abc "Machine Learning Review","Brown, A.",2024,,10.5678/def
The package will: 1. Parse the CSV/TSV structure 2. Detect column mappings automatically 3. Use BestLink when available, or resolve DOI to URL 4. Generate filename like "2023_Smith_Climate_Change_Impacts.pdf" 5. Download PDFs to the same directory as the input file 6. Create a report showing success/failure for each paper
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func DownloadURLList ¶
DownloadURLList processes a text, CSV, or TSV file containing URLs and attempts to download PDFs from each entry.
This function supports three input formats: 1. Plain text file: One URL per line 2. CSV file: Comma-separated values with intelligent column detection 3. TSV file: Tab-separated values with intelligent column detection
For CSV/TSV files, the function automatically detects columns for: - URLs (BestLink, BestURL, URL, Link, etc.) - DOIs (converts to URLs if no direct URL available) - Title, Authors, Year, Journal (used for intelligent file naming)
The function will: - Parse the input file based on its extension (.csv, .tsv, .txt) - Generate meaningful filenames using paper metadata when available - Create a download report (_report.csv) for CSV/TSV inputs - Save all files to the same directory as the input file
Parameters:
- path: The path to a file containing URLs or paper metadata
Returns an error if the function fails to open or read the input file, but continues processing even if individual URLs fail to download.
Types ¶
type ColumnMapping ¶ added in v0.9.2
type ColumnMapping struct {
URL int
DOI int
Title int
Authors int
Year int
Journal int
Abstract int
}
ColumnMapping holds the detected column indices for relevant fields
type ConcurrentDownloader ¶ added in v0.9.3
type ConcurrentDownloader struct {
// contains filtered or unexported fields
}
ConcurrentDownloader manages concurrent downloads with per-host and global limits
func NewConcurrentDownloader ¶ added in v0.9.3
func NewConcurrentDownloader(maxGlobal, maxPerHost int64) *ConcurrentDownloader
NewConcurrentDownloader creates a new concurrent downloader with specified limits
type DownloadResult ¶ added in v0.9.3
type DownloadResult struct {
Task *DownloadTask
Success bool
Error error
}
DownloadResult represents the result of a download attempt
type DownloadTask ¶ added in v0.9.3
type DownloadTask struct {
URL string
PDFUrl string
Filename string
FullPath string
Paper *PaperMetadata
OriginalURL string
}
DownloadTask represents a download job
type PaperMetadata ¶ added in v0.9.2
type PaperMetadata struct {
ID string // Row index or identifier
URL string // Best URL or resolved DOI
DOI string // Digital Object Identifier
Title string // Article title
Authors string // Authors list
Year string // Publication year
Journal string // Source/Journal title
Abstract string // Abstract text (for future use)
Downloaded bool // Whether the file was successfully downloaded
Filename string // Generated filename for the download
ErrorMsg string // Error message if download failed
OriginalRecord []string // Original CSV/TSV row data for preserving all columns
}
PaperMetadata holds information about a paper to be downloaded
type RetryConfig ¶ added in v0.9.3
type RetryConfig struct {
MaxRetries int
BaseDelay time.Duration
MaxDelay time.Duration
Jitter bool
}
RetryConfig holds retry policy configuration
func DefaultRetryConfig ¶ added in v0.9.3
func DefaultRetryConfig() RetryConfig
DefaultRetryConfig returns a sensible default retry configuration
type UnpaywallResponse ¶ added in v0.9.3
type UnpaywallResponse struct {
DOI string `json:"doi"`
IsOA bool `json:"is_oa"`
BestOALocation struct {
HostType string `json:"host_type"`
URLForPDF string `json:"url_for_pdf"`
} `json:"best_oa_location"`
OALocations []struct {
HostType string `json:"host_type"`
URLForPDF string `json:"url_for_pdf"`
} `json:"oa_locations"`
}
UnpaywallResponse represents the response from Unpaywall API