mcfile

package module
v0.0.0-...-715d5d7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 29, 2025 License: MIT Imports: 25 Imported by: 3

Documentation

Overview

Package mcfile defines a per-file structure [MCFile] that holds all relevant per-file information. This includes:

  • file path info
  • file content (UTF-8, tietysti)
  • file type information (MIME and more)
  • the results of markup-specific file analysis (in the most analysable case, i.e. XML, this comprises tokens, gtokens, gelms, gtree)

For a discussion of tree walk functions, see `doc_wfn.go`

Note that if we do not get an explicit XML DOCTYPE declaration, there is some educated guesswork required.

The first workflow was based on XML, and comprises: `text => XML tokens => GTokens => GTags => GTree`

First, package `gparse` gets as far as the `GToken`s, which can only be in a list: they have no tree structure. Then package `gtree` handles the rest.

XML analysis starts off with tokenization (by the stdlib), so it makes sense to then have separate steps for making `GToken's, GTag's, GTree`. <br/> MKDN and HTML analyses use higher-level libraries that deliver CSTs (Concrete Syntax Tree, i.e. parse tree). We choose to do this processing in `package gparse` rather than in `package gtree`.

MKDN gets a tree of `yuin/goldmark/ast/Node`, and HTML gets a tree of stdlib `golang.org/x/net/html/Node`. Since a CST is delivered fully-formed, it makes sense to have Step 1 that attaches to each node its `GToken´ and `GTag`, and then Step 2 that builds a `GTree`.

There are three major types of `MCFile`, corresponding to how we process the file content: - "XML" - - (§1) Use stdlib `encoding/xml` to get `[]XU.XToken` - - (§1) Convert `[]XU.XToken` to `[]gparse.GToken` - - (§2) Build `GTree` - "MKDN" - - (§1) Use `yuin/goldmark` to get tree of `yuin/goldmark/ast/Node` - - (§1) From each Node make a `MkdnToken` (in a list?) incl. `GToken` and `GTag` - - (§2) Build `GTree` - "HTML" - - (§1) Use `golang.org/x/net/html` to get a tree of `html.Node` - - (§1) From each Node make a `HtmlToken` (in a list?) incl. `GToken` and `GTag` - - (§2) Build `GTree`

In general, all go files in this protocol stack should be organised as: <br/> - struct definition() - constructors (named `New*`) - printf stuff (Raw(), Echo(), String())

Some characteristic methods: - Raw() returns the original string passed from the golang XML parser (with whitespace trimmed) - Echo() returns a string of the item in normalised form, altho be aware that the presence of terminating newlines is not treated uniformly - String() returns a string suitable for runtime nonitoring and debugging

NOTE The use of shorthand in variable names: Doc, Elm, Att.

NOTE We use `godoc2md`, so we can use Markdown in these code comments.

Index

Constants

This section is empty.

Variables

View Source
var GlobalAttCount int
View Source
var GlobalTagCount int
View Source
var LwDitaAttsForGLinks = []string{
	"name",
	"href",
	"id",
	"idref",
	"idrefs",
	"conref",
	"data-conref",
	"keys",
	"data-keys",
	"keyref",
	"data-keyref",
}

Functions

func AddInXName

func AddInXName(ElmT StringTally, AttT StringTally, gT *gtoken.GToken)

func DumpGElm

func DumpGElm(p AST.Node) string

func KidsAsSlice

func KidsAsSlice(p AST.Node) []AST.Node

func ListKids

func ListKids(p AST.Node) string

func NormalizeTextLeaves

func NormalizeTextLeaves(rootNode AST.Node)

Types

type Contentity

type Contentity struct {

	// Nord provides hierarchical structure, only.
	ON.Nord
	// ContentityRow includes all fields what get persisted
	// to the DB. It contains the field Raw (deeply embedded),
	// and also an FSItem that contains an Errer.
	m5db.ContentityRow

	// LogInfo is (the index of the Contentity in
	// the larger slice) + (the processing stage ID)
	LogInfo

	// ParserResults is parseutils.ParserResults_ffs
	// (ffs = file format -specific = "html" or "mkdn" but not
	// "xml" cos Go's XML parser does not produce a tree structure)
	ParserResults interface{}

	GTokens      []*gtoken.GToken
	GTags        []*gtree.GTag
	*gtree.GTree // maybe not need GRootTag or RootOfASTptr
	GTknsWriter, GTreeWriter,
	GEchoWriter io.Writer
	GLinks

	// GEnts is "ENTITY"" directives (both with "%" and without).
	GEnts map[string]*gparse.GEnt
	// DElms is "ELEMENT" directives.
	DElms map[string]*gtree.GTag

	TagTally StringTally
	AttTally StringTally
}

Contentity is awesome. It includes a ContentityRow, which includes an FSItem, which includes an Errer. .

func NewContentity

func NewContentity(aPath string) *Contentity

NewContentity returns a Contentity Nord (i.e. a node with content a and ordered children) that can NOT be the root of a Contentity tree. If there is an error, it is returned in the embedded Errer.

It should accept either an absolute or a relative filepath, altho relative is preferred, for various reasons, mainly because of the preferences of the path and filepath stdlibs.

TODO: Maybe it needs two boolean arguments:

  • One to say whether to be strict about security (using os.Root and Valid/Local, and
  • One to say whether to follow symlinks.

These two flags might have some interesting interactions. Since this func could (but does not) use os.Root, these can be left as calling options, rather than implementing higher security using funcs io/fs.ValidPath and path/filepath.IsLocal.

We want everything to be in a nice tree of Nords, and it means that we have to create Contenties for directories too (where `Raw_type == SU.Raw_type_DIRLIKE`), so we have to handle that case too. .

func (*Contentity) DoBlockList

func (p *Contentity) DoBlockList() *Contentity

DoBlockList makes a list of all the nodes that are blocks, so that they cn be traversed for rendering, and targeted for references. .

func (*Contentity) DoEntitiesList

func (p *Contentity) DoEntitiesList() error

DoEntitiesList collects all entity definitions. -n Note that each Token has been normalized. -n- rtType:ENTITY string1:foo string2:"FOO" entityIsParameter:false -n- rtType:ENTITY string1:bar string2:"BAR" entityIsParameter:true

func (p *Contentity) DoGLinks() *Contentity

DoGLinks gathers links. .

func (*Contentity) DoTableOfContents

func (p *Contentity) DoTableOfContents() *Contentity

DoTableOfContents makes a ToC. .

func (*Contentity) DoValidation

func (p *Contentity) DoValidation(pXCF *XU.XmlCatalogFile) (dtdS string, docS string, errS string)

DoValidation TODO If no DOCTYPE, make a guess based on Filext but it can't be fatal.

func (*Contentity) ExecuteStages

func (p *Contentity) ExecuteStages() *Contentity

ExecuteStages processes a Contentity to completion in an isolated thread, and can eaily be converted to run as a goroutine. Summary:

  • st0_Init()
  • st1_Read()
  • st2_Tree()
  • st3_Refs()
  • st4_Done() (not currently called, but will work on all input files at once !)

An interesting question is, how can we indicate an error and terminate a thread prematurely ? The method currently chosen is to use interface github.com/fbaube/miscutils/Errer. This has to be checked for at the start of a func. But then we can chain functions by writing them left-to-right. Winning!

(If functions accept and return a ptr+error pair then they chain right-to-left, which is a big fail for readability.)

We could also pass in a `Context` and use its cancellation capability. Yet another way might be simply to `panic`, and so this function already has code to catch panics. .

func (p *Contentity) GatherLinks() error

GatherLinks is: @conref to reuse block-level content, @keyref to reuse phrase-level content. TODO Each type of link (i.e. elm/att where it occurs) has to be categorised. TODO Each format of link target has to be categorised. Cross ref : <xref> : <a href> : [link](/URI "title") Key def : <keydef> : <div data-class="keydef"> : <div data- class="keydef"> in HDITA syntax Map : <map> : <nav> : See Example of an MDITA map (20) Topic ref : <topicref> : <a href> inside a <li> : [link](/URI "title") inside a list item TODO Stuff to get: XDITA map - topicref @href (w @format) - task @id HDITA - article @id - span @data-keyref - p @data-conref MDITA - has YAML "id" - uses <p @data-conref> - uses <span @data-keyref> - uses MD [link_text](link_target.dita) - uses ![The remote](../images/remote-control-callouts.png "The remote") XDITA - topic @id - ph @keyref - image @href - p @id - video/source @value - section @id - p @conref

func (p *Contentity) GatherXmlGLinks() *Contentity

GatherXmlGLinks is: XmlItems is (DOCS) IDs & IDREFs, (DTDs) Elm defs (incl. Att defs) & Ent defs *xmlfile.XmlItems // *IDinfo

func (p *MCFile) GatherXmlGLinks() *MCFile {

func (*Contentity) IsDir

func (p *Contentity) IsDir() bool

func (*Contentity) IsDirlike

func (p *Contentity) IsDirlike() bool

func (*Contentity) L

func (p *Contentity) L(level LL, format string, a ...interface{})

func (*Contentity) LogPrefix

func (p *Contentity) LogPrefix(mid string) string

func (*Contentity) NewEntitiesList

func (p *Contentity) NewEntitiesList() (gEnts map[string]*gparse.GEnt, err error)

NewEntitiesList collects all entity definitions. -n Note that each Token is normalized. -n- rtType:ENTITY string1:foo string2:"FOO" entityIsParameter:false -n- rtType:ENTITY string1:bar string2:"BAR" entityIsParameter:true

CALLED BY ProcessEntities only//

func (*Contentity) ProcessEntities_

func (p *Contentity) ProcessEntities_() error

func (*Contentity) RefineDirectives

func (p *Contentity) RefineDirectives() error

RefineDirectives scans to patch Directives with correct keyword.

func (Contentity) String

func (p Contentity) String() string

String is developer output. Hafta dump: FU.InputFile, FU.OutputFiles, GTree, GRefs, *XmlFileMeta, *XmlItems, *DitaInfo

func (*Contentity) SubstituteEntities

func (p *Contentity) SubstituteEntities() error

SubstituteEntities does replacement in Entities for simple (single-token) entity references, i.e. that begin with "%" or "&".

func (*Contentity) TallyTags

func (p *Contentity) TallyTags()

func (*Contentity) WrapError

func (p *Contentity) WrapError(s string, e error)

type ContentityEngine

type ContentityEngine struct {
	// contains filtered or unexported fields
}

ContentityEngine tracks the (oops, global) state of a ContentityFS tree being assembled, for example when a directory is specified for recursive analysis.

FIXME: ID assignment should be offloaded to the DB ? .

CntyEng is a package global, which is dodgy and not re-entrant. The solution probably involves currying.

NOTE: Is the call to new(..) unnecessary? This variable should NOT be reinitialized for every new ContentityFS.

type ContentityError

type ContentityError struct {
	PE fs.PathError
	*Contentity
}

ContentityError is Contentity + SrcLoc (in source code) + PathError struct { Op, Path string; Err error }

Maybe use the format pkg.filename.methodname.Lnn

In code where package `mcfile` is not available, try a fileutils.PathPropsError

func NewContentityError

func NewContentityError(ermsg string, op string, cty *Contentity) ContentityError

func WrapAsContentityError

func WrapAsContentityError(e error, op string, cty *Contentity) ContentityError

func (ContentityError) Error

func (ce ContentityError) Error() string

func (*ContentityError) String

func (ce *ContentityError) String() string

type ContentityFS

type ContentityFS struct {
	// FS will be set from func [os.DirFS]
	fs.FS
	// contains filtered or unexported fields
}

ContentityFS is an instance of an fs.FS where every node is an mcfile.Contentity.

Note that directories ARE included in the tree, because the instances of [orderednodes.Nord] in each Contentity must properly interconnect in forming a complete tree.

Note that the file system is stored as a tree AND as a slice AND as a map. If any of these is modified without also modifying the others to match, there WILL be problems. For that reason, [asSlice] and [asMapOfAbsFP] are unexported instance variables that are accessible only via getters.

It ain't bulletproof tho. In any case, users of a ContentityFS should feel free to use the functions on the embedded [Nord] ordered nodes. .

func NewContentityFS

func NewContentityFS(aPath string, okayFilexts []string) (*ContentityFS, error)

NewContentityFS proceeds as follows:

  • initialize
  • create an os.DirFS
  • FIXME: an os.Root
  • walk the DirFS, creating Contentities and appending them to a slice
  • process the list to identify and make parent/child links

The path argument should probably be an absolute filepath, because a relative filepath might cause problems. Altho this is the opposite of the advice for lower-level items.

It uses the global [CntyFS], which precludes re-entrancy and concurrency.

Note that when we use os.DirFS, it appears to make no difference whether path

  • is relative or absolute
  • ends with a trailing slash or not
  • is a directory or a symlink to a directory

The only error returns for this func are:

  • a bad path, rejected by func FU.NewFilepaths
  • the path is not a directory (altho it can be a symlnk to a directory ?)
  • TBD: WHat happens of os.Root barfs on something ?

ContentityFS does not embed Errer and cannot itself return an error. FIXME: change this ?

TODO: Maybe it needs two boolean arguments:

  • One to say whether to be strict about security (using os.Root and Valid/Local, and
  • One to say whether to follow symlinks.

These two flags might have some interesting interactions. OTOH since this func can (and does?) use os.Root, it can easily (and should probably) also default to higher security using funcs io/fs.ValidPath and path/filepath.IsLocal.

Accumulated NewContentity errors are counted in the field CotentityFS.nErrors .

func (*ContentityFS) AsSlice

func (p *ContentityFS) AsSlice() []*Contentity

func (*ContentityFS) DirCount

func (p *ContentityFS) DirCount() int

func (*ContentityFS) DoForEvery

func (p *ContentityFS) DoForEvery(stgprocsr ContentityStage)

func (*ContentityFS) FileCount

func (p *ContentityFS) FileCount() int

func (*ContentityFS) ItemCount

func (p *ContentityFS) ItemCount() int

func (*ContentityFS) RootAbsPath

func (p *ContentityFS) RootAbsPath() string

func (*ContentityFS) RootContentity

func (p *ContentityFS) RootContentity() *RootContentity

func (*ContentityFS) Size

func (p *ContentityFS) Size() int

type ContentityStage

type ContentityStage func(*Contentity) *Contentity

type Flags

type Flags int
const (
	IsRef      Flags = 1 << iota // 1 << 0 i.e. 0000 0001
	IsExtl                       // 1 << 1 i.e. 0000 0010
	IsURI                        // 1 << 2 i.e. 0000 0100
	IsKey                        // 1 << 3 i.e. 0000 1000
	IsResolved                   // 1 << 4 i.e. 0001 0000
)

func (Flags) IsSet

func (b Flags) IsSet(flag Flags) bool

func (Flags) Reset

func (b Flags) Reset(flag Flags) Flags

func (Flags) Set

func (b Flags) Set(flag Flags) Flags

func (Flags) String

func (f Flags) String() string
type GLink struct {
	// IsRefnc - else is Refnt (Referents are much more numerous)
	IsRefnc bool
	// IsExtl - else is Intl (which are more numerous)
	IsExtl bool
	// AddressMode is "http", "key", "idref", "uri"
	AddressMode string
	// Att is the XML attribute - id, idref, href, xref, keyref, etc.
	Att string
	// Tag is the tag that has this link-related attribute of interest
	Tag string
	// Link_raw as redd in during parsing
	Link_raw string
	// RelFP can be a URI or the resolution of a keyref.
	// "" if target is in same file; NOTE This is relative to the
	// sourcing file, NOT to the current working directory during parsing!
	RelFP string
	// AbsFP can be a URI or the resolution of a keyref.
	// "" if target is in same file
	AbsFP FU.AbsFilePath
	// TopicID iff present (but isn't it mandatory ?)
	TopicID string
	// FragID is peeled off from Raw
	FragID string
	// Resolved is used to narrow in on difficult cases
	Resolved bool
	// LinkedFrom is the GTag where the GLink is defined
	LinkedFrom *gtree.GTag
	// Original can be nil: it is the tag where the GLink is resolved to,
	// i.e. the REFERENT, and is quite possibly in another file, which we
	// hope we also have available in memory.
	Original *gtree.GTag
}

GLink summarizes a link (or key) (or reference) found in markup content. It is either URI-based (`href conref id`) or key-based (`key keyref`). It applies to all LwDITA formats, but not all fields apply to all LwDITA formats.

type GLinks struct {
	// OwnerP points back to the owning struct, so that
	// `GLink`s can be processed easily as simple data structures.
	OwnerP interface{}
	// KeyRefncs are outgoing key-based links/references
	KeyRefncs []*GLink // (Extl|Intl)KeyReferences
	// KeyRefnts are unique key-based definitions that are possible
	// referents (resolution targets) of same or other files' [KeyRefncs]
	KeyRefnts []*GLink // (Extl|Intl)KeyDefs
	// UriRefncs are outgoing URI-based links/references
	UriRefncs []*GLink // (Extl|Intl)UriReferences
	// UriRefnts are unique URI-based definitions that are possible
	// referents(resolution targets) of same or other files' [UriRefncs]
	UriRefnts []*GLink // (Extl|Intl)UriDefs
}

GLinks is used for (1) intra-file ref resolution, (2) inter-file ptr resolution, (3) ToC generation.

type LL

type LL LU.Level
var LDebug, LInfo, LOkay, LWarning, LError, LPanic LL

type LinkInfo

type LinkInfo struct {
}

type LinkInfos

type LinkInfos struct {
	Conrefs  []LinkInfo
	Keyrefs  []LinkInfo
	Datarefs []LinkInfo
	// contains filtered or unexported fields
}

LinkInfos is: @conref to reuse block-level content, @keyref to reuse phrase-level content. TODO Each type of link (i.e. elm/att where it occurs) has to be categorised. TODO Each format of link target has to be categorised. Cross ref : <xref> : <a href> : [link](/URI "title") Key def : <keydef> : <div data-class="keydef"> : <div data- class="keydef"> in HDITA syntax Map : <map> : <nav> : See Example of an MDITA map (20) Topic ref : <topicref> : <a href> inside a <li> : [link](/URI "title") inside a list item TODO Stuff to get: XDITA map - topicref @href (w @format) - task @id HDITA - article @id - span @data-keyref - p @data-conref MDITA - has YAML "id" - uses <p @data-conref> - uses <span @data-keyref> - uses MD [link_text](link_target.dita) - uses ![The remote](../images/remote-control-callouts.png "The remote") XDITA - topic @id - ph @keyref - image @href - p @id - video/source @value - section @id - p @conref

In GFile: LinkInfos:

type LogInfo

type LogInfo struct {
	Lindex int
	Lstage string
	W      io.Writer
}

LogInfo exists mainly to provide a grep'able string: for example "(01:4a)", where 01 is the index of the Contentity and 4a is the processing stage. This is obv a candidate for replacement by stdlib's slog.

The io.Writer field W exists outside of the github.com/fbaube/mlog logging subsystem and should only be used if `mlog` is not. .

func (*LogInfo) String

func (p *LogInfo) String() string

type NodeStringser

type NodeStringser interface {
	NodeEcho(int) string
	NodeInfo(int) string
	NodeDebug(int) string
	NodeCount() int
}

type RootContentity

type RootContentity Contentity

RootContentity makes assignments to/from root node explicit.

func NewRootContentity

func NewRootContentity(aRootPath string) *RootContentity

NewRootContentity returns a RootContentity Nord (i.e. node with ordered children) that can be the root of a new Contentity tree. It requires that argument aRootPath is an absolute filepath and is a directory. .

type StringTally

type StringTally map[string]int
var GlobalAttTally StringTally
var GlobalTagTally StringTally

func (StringTally) StringSortedValues

func (st StringTally) StringSortedValues() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL